I am extracting some variable names with own function from the text. Below you can see the code and text.
import re
text = '''
def function_cal_new1 (revenues_stories_new1, surplus_margin_new1, meadian_profit):
median_profit= revenues_stories_new1* surplus_margin_new1
return median_profit
def cal_tti_c_a(e_BusinessPropertyRightsSuccessor_c,e_Premium_c,e_Interests_c,e_OtherIncome_c,e_PassivelIncome_c,e_SubSubLease_c,estimation_base_BusinessPropertyRights_c,estimation_base_SubLease_c,estimation_base_SubLeaseBusiness_c,estimation_base_SolidWaste_c,estimation_base_Premium_c,estimation_base_PassiveGainsSaleSharePassive_c,estimation_base_PassiveGainsRealEstateThreeYear_c,estimation_base_PassiveGainssaleOtherMovableAssets_c,estimation_base_PassiveGainsSellsRealEstateFiveYear_c,tti_b_a):
tti_b_a=e_BusinessPropertyRightsSuccessor_c+e_Premium_c+e_Interests_c+e_OtherIncome_c+e_PassivelIncome_c+e_SubSubLease_c+estimation_base_BusinessPropertyRights_c+estimation_base_SubLease_c+estimation_base_SubLeaseBusiness_c+estimation_base_SolidWaste_c+estimation_base_Premium_c+estimation_base_PassiveGainsSaleSharePassive_c+estimation_base_PassiveGainsRealEstateThreeYear_c+estimation_base_PassiveGainssaleOtherMovableAssets_c+estimation_base_PassiveGainsSellsRealEstateFiveYear_c
return(tti_b_a)
'''
# Extracting text
def extraction_variables(text):
splited_table1, splited_table2 = dict(), dict()
lines = text.split('\n')
for line in lines:
x = re.search(r"^def.*:$", line)
if x is not None:
values = x[0].split('def ')[1].split('(')
splited_table1 = values[0]
splited_table2 = values[1][:-2].split(', ') # <--- Probably error is here
yield splited_table1, splited_table2
# Merging extracted text
splited_table1, splited_table2 = zip(*extraction_variables(text))
table = []
for elem in splited_table1:
table.append(elem)
for sub_array in splited_table2:
for elem in sub_array:
table.append(elem)
# Converting in list
final_table = list(table)
final_table
After execution of those lines of codes you can see result below
['function_cal_new1 ',
'cal_tti_c_a',
'revenues_stories_new1',
'surplus_margin_new1',
'meadian_profit',
'e_BusinessPropertyRightsSuccessor_c,e_Premium_c,e_Interests_c,e_OtherIncome_c,e_PassivelIncome_c,e_SubSubLease_c,estimation_base_BusinessPropertyRights_c,estimation_base_SubLease_c,estimation_base_SubLeaseBusiness_c,estimation_base_SolidWaste_c,estimation_base_Premium_c,estimation_base_PassiveGainsSaleSharePassive_c,estimation_base_PassiveGainsRealEstateThreeYear_c,estimation_base_PassiveGainssaleOtherMovableAssets_c,estimation_base_PassiveGainsSellsRealEstateFiveYear_c,tti_b_a']
Namely from function_cal_new1 until meadian_profit words from the text are extracted correctly, but after this line, words are not extracted in separate rows.
So can anybody help me how to solve this problem and extract those words in separate rows? At the end I need to have this output
'function_cal_new1 ',
'cal_tti_c_a',
'revenues_stories_new1',
'surplus_margin_new1',
'meadian_profit',
'e_BusinessPropertyRightsSuccessor_c,
'e_Premium_c',
'e_Interests_c',
'e_OtherIncome_c',
'e_PassivelIncome_c',
'e_SubSubLease_c',
'estimation_base_BusinessPropertyRights_c'.
'estimation_base_SubLease_c'.
'estimation_base_SubLeaseBusiness_c',
'estimation_base_SolidWaste_c',
'estimation_base_Premium_c',
'estimation_base_PassiveGainsSaleSharePassive_c',
'estimation_base_PassiveGainsRealEstateThreeYear_c',
'estimation_base_PassiveGainssaleOtherMovableAssets_c',
'estimation_base_PassiveGainsSellsRealEstateFiveYear_c',
'tti_b_a'
>Solution :
You can use the ast module to walk through the code and pull out the bits you want. This is probably more robust that using a regular expression. The code does need to be valid, including indents. The the code in your string will need to remove the extra indents. Given that, here’s some code that walks the tree and yields various bit.
text = '''
def function_cal_new1 (revenues_stories_new1, surplus_margin_new1, meadian_profit):
median_profit= revenues_stories_new1* surplus_margin_new1
return median_profit
def cal_tti_c_a(e_BusinessPropertyRightsSuccessor_c,e_Premium_c,e_Interests_c,e_OtherIncome_c,e_PassivelIncome_c,e_SubSubLease_c,estimation_base_BusinessPropertyRights_c,estimation_base_SubLease_c,estimation_base_SubLeaseBusiness_c,estimation_base_SolidWaste_c,estimation_base_Premium_c,estimation_base_PassiveGainsSaleSharePassive_c,estimation_base_PassiveGainsRealEstateThreeYear_c,estimation_base_PassiveGainssaleOtherMovableAssets_c,estimation_base_PassiveGainsSellsRealEstateFiveYear_c,tti_b_a):
tti_b_a=e_BusinessPropertyRightsSuccessor_c+e_Premium_c+e_Interests_c+e_OtherIncome_c+e_PassivelIncome_c+e_SubSubLease_c+estimation_base_BusinessPropertyRights_c+estimation_base_SubLease_c+estimation_base_SubLeaseBusiness_c+estimation_base_SolidWaste_c+estimation_base_Premium_c+estimation_base_PassiveGainsSaleSharePassive_c+estimation_base_PassiveGainsRealEstateThreeYear_c+estimation_base_PassiveGainssaleOtherMovableAssets_c+estimation_base_PassiveGainsSellsRealEstateFiveYear_c
return(tti_b_a)
'''
import ast
def get_names_and_functions(text):
root = ast.parse(text)
for node in ast.walk(root):
if isinstance(node, ast.FunctionDef):
yield node.name
for arg in node.args.args:
yield arg.arg
elif isinstance(node, ast.Name):
yield node.id
found = set(get_names_and_functions(text))
This will give you:
{'cal_tti_c_a',
'e_BusinessPropertyRightsSuccessor_c',
'e_Interests_c',
'e_OtherIncome_c',
'e_PassivelIncome_c',
'e_Premium_c',
'e_SubSubLease_c',
'estimation_base_BusinessPropertyRights_c',
'estimation_base_PassiveGainsRealEstateThreeYear_c',
'estimation_base_PassiveGainsSaleSharePassive_c',
'estimation_base_PassiveGainsSellsRealEstateFiveYear_c',
'estimation_base_PassiveGainssaleOtherMovableAssets_c',
'estimation_base_Premium_c',
'estimation_base_SolidWaste_c',
'estimation_base_SubLeaseBusiness_c',
'estimation_base_SubLease_c',
'function_cal_new1',
'meadian_profit',
'median_profit',
'revenues_stories_new1',
'surplus_margin_new1',
'tti_b_a'}
It’s using a set to get rid of the dupes when considering arguments and variables in the body of the function. You of course can remove the elif with the args if you don’t want to consider arguments.