Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Group by and aggregate the values in pandas dataframe

I have following dataframe in python

    meddra_id   meddra_label              soc       cross_ref                       soc_term
2   10000081    Abdominal pain            10017947  http://snomed.info/id/21522001  Gastrointestinal disorders
3   10017999    Gastrointestinal pain     10017947  http://snomed.info/id/21522001  Gastrointestinal disorders
15  10000340    Abstains from alcohol     10041244  http://snomed.info/id/105542008 Social circumstances
35  10001022    Acute psychosis           10037175  http://snomed.info/id/69322001  Psychiatric disorders
36  10061920    Psychotic disorder        10037175  http://snomed.info/id/69322001  Psychiatric disorders

I would like to aggregate the values in ‘meddra_id, meddra_label, soc and soc_term’ columns using group by another column ‘cross_ref’ (and exclude the rows where there is single ‘meddra_id’ associated to ‘cross_ref’).

The expected output is:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

meddra_id           meddra_label                           soc      cross_ref                       soc_term
10000081,10017999   Abdominal pain,Gastrointestinal pain   10017947 http://snomed.info/id/21522001  Gastrointestinal disorders
10001022,10061920   Acute psychosis,Psychotic disorder     10037175 http://snomed.info/id/69322001  Psychiatric disorders

I am trying following lines of code.

df_terms = df.groupby('cross_ref').filter(lambda g: len(g) > 1).drop_duplicates(subset=['meddra_id', 'meddra_label', 'soc', 'soc_term'], keep="first")

#aggregate the values
df_terms = df_terms.groupby('cross_ref')['meddra_id', 'meddra_label', 'soc', 'soc_term'].agg(' , '.join).reset_index()

When I try to aggegate the value, the column ‘soc_term’ is not showed up in the new dataframe (df_terms)

Any help is highly appreciated.

>Solution :

Use agg to join the values in the different columns:

df_grouped = df.groupby('cross_ref') #group as you did
df_filtered = df_grouped.filter(lambda g: len(g['meddra_id'].unique()) > 1) # filter it for single values

df_aggregated = df_filtered.groupby('cross_ref').agg({
    'meddra_id': ', '.join,
    'meddra_label': ', '.join,
    'soc': lambda x: ', '.join(map(str, x)), # convert float values to strings
    'soc_term': lambda x: ', '.join(map(str, x)) # convert float values to strings
}).reset_index() #aggregate to join values in the different columns via a comma
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading