Please find my dataframe this :
df = pd.DataFrame({'class': ['class_b', 'class_a', 'class_c', 'class_d'], 'sub_class': ['class_d', None, 'class_e', 'class_a'], 'entities': [5, 1, 7, 6]})
print(df)
class sub_class entities
0 class_b class_d 5
1 class_a None 1
2 class_c class_e 7
3 class_d class_a 6
As per the title, I’m just trying to look for subclasses like we do in os.walk
but I can’t figure it out. For example, class_b
has class_d
as subclass and this one has also a subclass (class_a
) and we could have sometimes up to 5 sub levels.
My expected output is this :
class all_subclass sum_entites
0 class_a [] 1
1 class_b [class_d, class_a] 12
2 class_c [class_e] 7
3 class_d [class_a] 7
4 class_e [] 0
I made a bad code below. I was thinking about making a while
loop and keep mergin until there is no match but it feels not good.
df1 = df.merge(df, left_on='sub_class', right_on='class', how="left", indicator=True).filter(like='class')
result = df1.groupby('class_x').apply(lambda x: list(x.dropna().T.values.tolist())).to_dict()
result
{'class_a': [[], [], [], []],
'class_b': [['class_b'], ['class_d'], ['class_d'], ['class_a']],
'class_c': [[], [], [], []],
'class_d': [[], [], [], []]}
Do you have any suggestions guys ?
>Solution :
This can be approached as a graph problem using networkx
:
import networkx as nx
# build Series to later map values
# using groupby.sum in case of duplicates
s = df.groupby('class')['entities'].sum()
# create directed graph
G = nx.from_pandas_edgelist(df.fillna('NONE'), create_using=nx.DiGraph,
source='class', target='sub_class')
G.remove_node('NONE')
# for each node, get all descendants and sum the entities
out = pd.DataFrame([(n, (d:=nx.descendants(G, n)),
sum(s.get(x, 0) for x in d|{n}))
for n in G],
columns=['class', 'all_subclass', 'sum_entites'])
Output:
class all_subclass sum_entites
0 class_b {class_a, class_d} 12
1 class_d {class_a} 7
2 class_a {} 1
3 class_c {class_e} 7
4 class_e {} 0
Graph: