Extraction of names of variables in separate rows

October 19, 2022

I am extracting some variable names with own function from the text. Below you can see the code and text.

   import re
    
    text = '''
    
    def function_cal_new1 (revenues_stories_new1, surplus_margin_new1, meadian_profit):
         median_profit= revenues_stories_new1* surplus_margin_new1
         return median_profit   
    
    def cal_tti_c_a(e_BusinessPropertyRightsSuccessor_c,e_Premium_c,e_Interests_c,e_OtherIncome_c,e_PassivelIncome_c,e_SubSubLease_c,estimation_base_BusinessPropertyRights_c,estimation_base_SubLease_c,estimation_base_SubLeaseBusiness_c,estimation_base_SolidWaste_c,estimation_base_Premium_c,estimation_base_PassiveGainsSaleSharePassive_c,estimation_base_PassiveGainsRealEstateThreeYear_c,estimation_base_PassiveGainssaleOtherMovableAssets_c,estimation_base_PassiveGainsSellsRealEstateFiveYear_c,tti_b_a):
        tti_b_a=e_BusinessPropertyRightsSuccessor_c+e_Premium_c+e_Interests_c+e_OtherIncome_c+e_PassivelIncome_c+e_SubSubLease_c+estimation_base_BusinessPropertyRights_c+estimation_base_SubLease_c+estimation_base_SubLeaseBusiness_c+estimation_base_SolidWaste_c+estimation_base_Premium_c+estimation_base_PassiveGainsSaleSharePassive_c+estimation_base_PassiveGainsRealEstateThreeYear_c+estimation_base_PassiveGainssaleOtherMovableAssets_c+estimation_base_PassiveGainsSellsRealEstateFiveYear_c   
        return(tti_b_a)
    
    '''
    
    # Extracting text
    def extraction_variables(text):
        splited_table1, splited_table2 = dict(), dict()
        lines = text.split('\n')
        for line in lines:
            x = re.search(r"^def.*:$", line)
            if x is not None:
                values = x[0].split('def ')[1].split('(')
                splited_table1 = values[0]
                splited_table2 = values[1][:-2].split(', ') # <--- Probably error is here
                yield splited_table1, splited_table2
    
    # Merging extracted text 
    splited_table1, splited_table2 = zip(*extraction_variables(text))
    table = []
    for elem in splited_table1:
        table.append(elem)
    for sub_array in splited_table2:
        for elem in sub_array:
            table.append(elem)
    
    # Converting in list
    final_table = list(table)
    final_table

After execution of those lines of codes you can see result below

['function_cal_new1 ',
 'cal_tti_c_a',
 'revenues_stories_new1',
 'surplus_margin_new1',
 'meadian_profit',
'e_BusinessPropertyRightsSuccessor_c,e_Premium_c,e_Interests_c,e_OtherIncome_c,e_PassivelIncome_c,e_SubSubLease_c,estimation_base_BusinessPropertyRights_c,estimation_base_SubLease_c,estimation_base_SubLeaseBusiness_c,estimation_base_SolidWaste_c,estimation_base_Premium_c,estimation_base_PassiveGainsSaleSharePassive_c,estimation_base_PassiveGainsRealEstateThreeYear_c,estimation_base_PassiveGainssaleOtherMovableAssets_c,estimation_base_PassiveGainsSellsRealEstateFiveYear_c,tti_b_a']

Namely from function_cal_new1 until meadian_profit words from the text are extracted correctly, but after this line, words are not extracted in separate rows.

So can anybody help me how to solve this problem and extract those words in separate rows? At the end I need to have this output

'function_cal_new1 ',
'cal_tti_c_a',
'revenues_stories_new1',
'surplus_margin_new1',
'meadian_profit',
'e_BusinessPropertyRightsSuccessor_c,
'e_Premium_c',
'e_Interests_c',
'e_OtherIncome_c',
'e_PassivelIncome_c',
'e_SubSubLease_c',
'estimation_base_BusinessPropertyRights_c'.
'estimation_base_SubLease_c'.
'estimation_base_SubLeaseBusiness_c',
'estimation_base_SolidWaste_c',
'estimation_base_Premium_c',
'estimation_base_PassiveGainsSaleSharePassive_c',
'estimation_base_PassiveGainsRealEstateThreeYear_c',
'estimation_base_PassiveGainssaleOtherMovableAssets_c',
'estimation_base_PassiveGainsSellsRealEstateFiveYear_c',
'tti_b_a'

>Solution :

You can use the ast module to walk through the code and pull out the bits you want. This is probably more robust that using a regular expression. The code does need to be valid, including indents. The the code in your string will need to remove the extra indents. Given that, here’s some code that walks the tree and yields various bit.

text = '''
    
def function_cal_new1 (revenues_stories_new1, surplus_margin_new1, meadian_profit):
     median_profit= revenues_stories_new1* surplus_margin_new1
     return median_profit   

def cal_tti_c_a(e_BusinessPropertyRightsSuccessor_c,e_Premium_c,e_Interests_c,e_OtherIncome_c,e_PassivelIncome_c,e_SubSubLease_c,estimation_base_BusinessPropertyRights_c,estimation_base_SubLease_c,estimation_base_SubLeaseBusiness_c,estimation_base_SolidWaste_c,estimation_base_Premium_c,estimation_base_PassiveGainsSaleSharePassive_c,estimation_base_PassiveGainsRealEstateThreeYear_c,estimation_base_PassiveGainssaleOtherMovableAssets_c,estimation_base_PassiveGainsSellsRealEstateFiveYear_c,tti_b_a):
    tti_b_a=e_BusinessPropertyRightsSuccessor_c+e_Premium_c+e_Interests_c+e_OtherIncome_c+e_PassivelIncome_c+e_SubSubLease_c+estimation_base_BusinessPropertyRights_c+estimation_base_SubLease_c+estimation_base_SubLeaseBusiness_c+estimation_base_SolidWaste_c+estimation_base_Premium_c+estimation_base_PassiveGainsSaleSharePassive_c+estimation_base_PassiveGainsRealEstateThreeYear_c+estimation_base_PassiveGainssaleOtherMovableAssets_c+estimation_base_PassiveGainsSellsRealEstateFiveYear_c   
    return(tti_b_a)
    
    '''

import ast

def get_names_and_functions(text):
    root = ast.parse(text)
    for node in ast.walk(root):
        if isinstance(node, ast.FunctionDef):
            yield node.name
            for arg in node.args.args:
                yield arg.arg
        elif isinstance(node, ast.Name):
            yield node.id

found = set(get_names_and_functions(text))

This will give you:

{'cal_tti_c_a',
 'e_BusinessPropertyRightsSuccessor_c',
 'e_Interests_c',
 'e_OtherIncome_c',
 'e_PassivelIncome_c',
 'e_Premium_c',
 'e_SubSubLease_c',
 'estimation_base_BusinessPropertyRights_c',
 'estimation_base_PassiveGainsRealEstateThreeYear_c',
 'estimation_base_PassiveGainsSaleSharePassive_c',
 'estimation_base_PassiveGainsSellsRealEstateFiveYear_c',
 'estimation_base_PassiveGainssaleOtherMovableAssets_c',
 'estimation_base_Premium_c',
 'estimation_base_SolidWaste_c',
 'estimation_base_SubLeaseBusiness_c',
 'estimation_base_SubLease_c',
 'function_cal_new1',
 'meadian_profit',
 'median_profit',
 'revenues_stories_new1',
 'surplus_margin_new1',
 'tti_b_a'}

It’s using a set to get rid of the dupes when considering arguments and variables in the body of the function. You of course can remove the elif with the args if you don’t want to consider arguments.