Home Is there way to drop duplicate rows based on the order(combinations) within same id?

Questions

Is there way to drop duplicate rows based on the order(combinations) within same id?

February 2, 2024

Is there a way to remove rows with duplicated order within same traceId and average the ‘duration’ column?

For example,

start_df = pd.DataFrame({'traceId':['a','a','a','a','b','b','b','b', 'c','c','c','c'],
                          'service':['cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend',
                                     'cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend', 'a','a','b','c'],
                          'operation': ['router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET',
                                        'router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET', 'a-1', 'a-1', 'b-1', 'c-1'],
                          'duration': [1,2,3,4, 2,3,4,5, 5,5,5,5]})

start_df looks like this:

start_df

result_df should be this:


result_df =  pd.DataFrame({'traceId':['a','a','a','a', 'c','c','c','c'],
                          'service':['cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend', 'a','a','b','c'],
                          'operation': ['router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET', 'a-1','a-1','b-1','c-1'],
                          'duration': [1.5, 2.5, 3.5, 4.5,5,5,5,5]})

result_df

as you can see, the duplicated rows (because they have same order of service+operation combination within same traceId) are removed and their durations of each service+operation were averaged out.

How can I achieve this with pandas code?

I tried



# Your DataFrame
merged_df = pd.DataFrame({'traceId':['a','a','a','a','b','b','b','b', 'c','c','c','c'],
                          'service':['cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend',
                                     'cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend', 'a','a','b','c'],
                          'operation': ['router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET',
                                        'router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET', 'a-1', 'a-1', 'b-1', 'c-1'],
                          'duration': [1,2,3,4, 2,3,4,5, 5,5,5,5]})

# Define a function to check if the 'service' and 'operation' order is the same
def has_same_order(group):
    return group['service'].eq(group['service'].iloc[0]).all() and group['operation'].eq(group['operation'].iloc[0]).all()

# Group by 'traceId' and filter groups with the same 'service' and 'operation' order
result_df = merged_df.groupby('traceId').filter(has_same_order)

# Calculate the average duration for each group
result_df = result_df.groupby(['traceId', 'service', 'operation'], as_index=False)['duration'].mean()

print(result_df)

but it didn’t turn out right

>Solution :

Code

tmp = start_df.groupby('traceId')[['service', 'operation']].agg(tuple)
m = tmp.groupby(['service', 'operation']).ngroup()
grp = start_df.groupby(['traceId', 'service', 'operation']).cumcount()
out = (start_df.groupby([start_df['traceId'].map(m), grp, 'service', 'operation'], sort=False)
               .agg({'traceId': 'first', 'duration':'mean'})
               .droplevel([0, 1]).reset_index().reindex(start_df.columns, axis=1))

out:

  traceId         service                     operation  duration
0       a     cartservice        router frontend egress       1.5
1       a  frontend-proxy                       ingress       2.5
2       a  frontend-proxy  oteldemo.CartService/GetCart       3.5
3       a        frontend                           GET       4.5
4       c               a                           a-1       5.0
5       c               a                           a-1       5.0
6       c               b                           b-1       5.0
7       c               c                           c-1       5.0