Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to recursively look for sub classes and at the same time aggregate their values?

Please find my dataframe this :

df = pd.DataFrame({'class': ['class_b', 'class_a', 'class_c', 'class_d'], 'sub_class': ['class_d', None, 'class_e', 'class_a'], 'entities': [5, 1, 7, 6]})
print(df)

     class sub_class  entities
0  class_b   class_d         5
1  class_a      None         1
2  class_c   class_e         7
3  class_d   class_a         6

As per the title, I’m just trying to look for subclasses like we do in os.walk but I can’t figure it out. For example, class_b has class_d as subclass and this one has also a subclass (class_a) and we could have sometimes up to 5 sub levels.

My expected output is this :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

     class        all_subclass  sum_entites
0  class_a                  []            1
1  class_b  [class_d, class_a]           12
2  class_c           [class_e]            7
3  class_d           [class_a]            7
4  class_e                  []            0

I made a bad code below. I was thinking about making a while loop and keep mergin until there is no match but it feels not good.

df1 = df.merge(df, left_on='sub_class', right_on='class', how="left", indicator=True).filter(like='class')

result = df1.groupby('class_x').apply(lambda x: list(x.dropna().T.values.tolist())).to_dict()

result

{'class_a': [[], [], [], []],
 'class_b': [['class_b'], ['class_d'], ['class_d'], ['class_a']],
 'class_c': [[], [], [], []],
 'class_d': [[], [], [], []]}

Do you have any suggestions guys ?

>Solution :

This can be approached as a graph problem using networkx:

import networkx as nx

# build Series to later map values
# using groupby.sum in case of duplicates
s = df.groupby('class')['entities'].sum()

# create directed graph
G = nx.from_pandas_edgelist(df.fillna('NONE'), create_using=nx.DiGraph,
                            source='class', target='sub_class')
G.remove_node('NONE')

# for each node, get all descendants and sum the entities
out = pd.DataFrame([(n, (d:=nx.descendants(G, n)),
                    sum(s.get(x, 0) for x in d|{n}))
                    for n in G],
                  columns=['class', 'all_subclass', 'sum_entites'])

Output:

     class        all_subclass  sum_entites
0  class_b  {class_a, class_d}           12
1  class_d           {class_a}            7
2  class_a                  {}            1
3  class_c           {class_e}            7
4  class_e                  {}            0

Graph:

enter image description here

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading