Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to create a dataframe with aggregated categories?

I have a pandas dataframe (df) with the following fields:

id name category
01 Eddie magician
01 Eddie plumber
02 Martha actress
03 Jeremy dancer
03 Jeremy actor

I want to create a dataframe (df2) like the following:

id name categories
01 Eddie magician, plumber
02 Martha actress
03 Jeremy dancer, actor

So, first of all, i create df2 and add an additional column by the following commands:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df2 = df.groupby("id", as_index= False).count()
df2["categories"] = str()

(this also counts the occurrences of various categories, which is something useful for what I intend to do)

Then, I use this loop:

for i in df2.itertuples():
   for entries in df.itertuples():
      if i.id == entries.id:
         df2["categories"].iloc[i.Index] += entries.category
      else:
         pass

Using this code, I get the dataframe that I wanted. However, this implementation has several problems:

  1. Doesn’t look optimal.
  2. If there are repeated entries (such as another column with "Eddie" and "magician"), the entry for Eddie in df2 would have "magician, plumber, magician" in categories.

Therefore I would like to ask the community: is there a better way to do this?

Also keep in mind that this is my first question on this website!

Thanks in advance!

>Solution :

You can groupby your id and name columns and apply a function to the category one like this:

import pandas as pd

data = {
  'id': ['01', '01', '02', '03', '03'],
  'name': ['Eddie', 'Eddie', 'Martha', 'Jeremy', 'Jeremy'],
  'category': ['magician', 'plumber', 'actress', 'dancer', 'actor']
}

df = pd.DataFrame(data)
df2 = df.groupby(['id', 'name'])['category'].apply(lambda x: ', '.join(x)).reset_index()
df2

Output:

   id    name          category
0  01   Eddie  magician, plumber
1  02  Martha           actress
2  03  Jeremy      dancer, actor
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading