Is there a way to remove rows with duplicated order within same traceId and average the ‘duration’ column?
For example,
start_df = pd.DataFrame({'traceId':['a','a','a','a','b','b','b','b', 'c','c','c','c'],
'service':['cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend',
'cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend', 'a','a','b','c'],
'operation': ['router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET',
'router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET', 'a-1', 'a-1', 'b-1', 'c-1'],
'duration': [1,2,3,4, 2,3,4,5, 5,5,5,5]})
start_df looks like this:
result_df should be this:
result_df = pd.DataFrame({'traceId':['a','a','a','a', 'c','c','c','c'],
'service':['cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend', 'a','a','b','c'],
'operation': ['router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET', 'a-1','a-1','b-1','c-1'],
'duration': [1.5, 2.5, 3.5, 4.5,5,5,5,5]})
as you can see, the duplicated rows (because they have same order of service+operation combination within same traceId) are removed and their durations of each service+operation were averaged out.
How can I achieve this with pandas code?
I tried
# Your DataFrame
merged_df = pd.DataFrame({'traceId':['a','a','a','a','b','b','b','b', 'c','c','c','c'],
'service':['cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend',
'cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend', 'a','a','b','c'],
'operation': ['router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET',
'router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET', 'a-1', 'a-1', 'b-1', 'c-1'],
'duration': [1,2,3,4, 2,3,4,5, 5,5,5,5]})
# Define a function to check if the 'service' and 'operation' order is the same
def has_same_order(group):
return group['service'].eq(group['service'].iloc[0]).all() and group['operation'].eq(group['operation'].iloc[0]).all()
# Group by 'traceId' and filter groups with the same 'service' and 'operation' order
result_df = merged_df.groupby('traceId').filter(has_same_order)
# Calculate the average duration for each group
result_df = result_df.groupby(['traceId', 'service', 'operation'], as_index=False)['duration'].mean()
print(result_df)
but it didn’t turn out right
>Solution :
Code
tmp = start_df.groupby('traceId')[['service', 'operation']].agg(tuple)
m = tmp.groupby(['service', 'operation']).ngroup()
grp = start_df.groupby(['traceId', 'service', 'operation']).cumcount()
out = (start_df.groupby([start_df['traceId'].map(m), grp, 'service', 'operation'], sort=False)
.agg({'traceId': 'first', 'duration':'mean'})
.droplevel([0, 1]).reset_index().reindex(start_df.columns, axis=1))
out:
traceId service operation duration
0 a cartservice router frontend egress 1.5
1 a frontend-proxy ingress 2.5
2 a frontend-proxy oteldemo.CartService/GetCart 3.5
3 a frontend GET 4.5
4 c a a-1 5.0
5 c a a-1 5.0
6 c b b-1 5.0
7 c c c-1 5.0