How to recursively look for sub classes and at the same time aggregate their values?

Please find my dataframe this :

df = pd.DataFrame({'class': ['class_b', 'class_a', 'class_c', 'class_d'], 'sub_class': ['class_d', None, 'class_e', 'class_a'], 'entities': [5, 1, 7, 6]})
print(df)

     class sub_class  entities
0  class_b   class_d         5
1  class_a      None         1
2  class_c   class_e         7
3  class_d   class_a         6

As per the title, I’m just trying to look for subclasses like we do in os.walk but I can’t figure it out. For example, class_b has class_d as subclass and this one has also a subclass (class_a) and we could have sometimes up to 5 sub levels.

My expected output is this :

     class        all_subclass  sum_entites
0  class_a                  []            1
1  class_b  [class_d, class_a]           12
2  class_c           [class_e]            7
3  class_d           [class_a]            7
4  class_e                  []            0

I made a bad code below. I was thinking about making a while loop and keep mergin until there is no match but it feels not good.

df1 = df.merge(df, left_on='sub_class', right_on='class', how="left", indicator=True).filter(like='class')

result = df1.groupby('class_x').apply(lambda x: list(x.dropna().T.values.tolist())).to_dict()

result

{'class_a': [[], [], [], []],
 'class_b': [['class_b'], ['class_d'], ['class_d'], ['class_a']],
 'class_c': [[], [], [], []],
 'class_d': [[], [], [], []]}

Do you have any suggestions guys ?

>Solution :

This can be approached as a graph problem using networkx:

import networkx as nx

# build Series to later map values
# using groupby.sum in case of duplicates
s = df.groupby('class')['entities'].sum()

# create directed graph
G = nx.from_pandas_edgelist(df.fillna('NONE'), create_using=nx.DiGraph,
                            source='class', target='sub_class')
G.remove_node('NONE')

# for each node, get all descendants and sum the entities
out = pd.DataFrame([(n, (d:=nx.descendants(G, n)),
                    sum(s.get(x, 0) for x in d|{n}))
                    for n in G],
                  columns=['class', 'all_subclass', 'sum_entites'])

Output:

     class        all_subclass  sum_entites
0  class_b  {class_a, class_d}           12
1  class_d           {class_a}            7
2  class_a                  {}            1
3  class_c           {class_e}            7
4  class_e                  {}            0

Graph:

enter image description here

Leave a Reply