Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there way to drop duplicate rows based on the order(combinations) within same id?

Is there a way to remove rows with duplicated order within same traceId and average the ‘duration’ column?

For example,

start_df = pd.DataFrame({'traceId':['a','a','a','a','b','b','b','b', 'c','c','c','c'],
                          'service':['cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend',
                                     'cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend', 'a','a','b','c'],
                          'operation': ['router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET',
                                        'router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET', 'a-1', 'a-1', 'b-1', 'c-1'],
                          'duration': [1,2,3,4, 2,3,4,5, 5,5,5,5]})

start_df looks like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

start_df

result_df should be this:


result_df =  pd.DataFrame({'traceId':['a','a','a','a', 'c','c','c','c'],
                          'service':['cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend', 'a','a','b','c'],
                          'operation': ['router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET', 'a-1','a-1','b-1','c-1'],
                          'duration': [1.5, 2.5, 3.5, 4.5,5,5,5,5]})

result_df

as you can see, the duplicated rows (because they have same order of service+operation combination within same traceId) are removed and their durations of each service+operation were averaged out.

How can I achieve this with pandas code?

I tried



# Your DataFrame
merged_df = pd.DataFrame({'traceId':['a','a','a','a','b','b','b','b', 'c','c','c','c'],
                          'service':['cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend',
                                     'cartservice', 'frontend-proxy', 'frontend-proxy', 'frontend', 'a','a','b','c'],
                          'operation': ['router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET',
                                        'router frontend egress', 'ingress', 'oteldemo.CartService/GetCart', 'GET', 'a-1', 'a-1', 'b-1', 'c-1'],
                          'duration': [1,2,3,4, 2,3,4,5, 5,5,5,5]})

# Define a function to check if the 'service' and 'operation' order is the same
def has_same_order(group):
    return group['service'].eq(group['service'].iloc[0]).all() and group['operation'].eq(group['operation'].iloc[0]).all()

# Group by 'traceId' and filter groups with the same 'service' and 'operation' order
result_df = merged_df.groupby('traceId').filter(has_same_order)

# Calculate the average duration for each group
result_df = result_df.groupby(['traceId', 'service', 'operation'], as_index=False)['duration'].mean()

print(result_df)


but it didn’t turn out right

>Solution :

Code

tmp = start_df.groupby('traceId')[['service', 'operation']].agg(tuple)
m = tmp.groupby(['service', 'operation']).ngroup()
grp = start_df.groupby(['traceId', 'service', 'operation']).cumcount()
out = (start_df.groupby([start_df['traceId'].map(m), grp, 'service', 'operation'], sort=False)
               .agg({'traceId': 'first', 'duration':'mean'})
               .droplevel([0, 1]).reset_index().reindex(start_df.columns, axis=1))

out:

  traceId         service                     operation  duration
0       a     cartservice        router frontend egress       1.5
1       a  frontend-proxy                       ingress       2.5
2       a  frontend-proxy  oteldemo.CartService/GetCart       3.5
3       a        frontend                           GET       4.5
4       c               a                           a-1       5.0
5       c               a                           a-1       5.0
6       c               b                           b-1       5.0
7       c               c                           c-1       5.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading