Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Panda rename rows after grouping by columns

I’ve recently started to play around with Pandas in order to manipulate some data and I am now trying to anonymize a few columns after a groupBy to find unique occurrences for persons.

For example, suppose the following DF:

   First Name Last Name         DOB
0  Bob        One               28/05/1973
1  Bob        One               28/05/1973
2  Ana        Two               28/07/1991
3  Ana        Two               28/07/1991
4  Ana        Two               28/07/1991
5  Jim        Three             07/01/1994

I can easily find unique person by First Name, Last Name and DOB by using df.groupby(['First Name', 'Last Name', 'DOB']).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

However, I’d like to apply a function to every unique combination that would transform those names to a known anonymized (incremental) version.

   First Name Last Name         DOB
0  F1         L1                28/05/1973
1  F1         L1                28/05/1973
2  F2         L2                28/07/1991
3  F2         L2                28/07/1991
4  F2         L2                28/07/1991
5  F3         L3                07/01/1994

I’ve tried a few things with transform and apply functions of DF groupBy but with no lucky so far. How could I achieve this?

>Solution :

ids = (df.groupby(["FirstName", "LastName", "DOB"], sort=False)
         .ngroup().add(1)
         .astype(str))

df["FirstName"] = "F" + ids
df["LastName"]  = "L" + ids
  • identify the IDs of firstname, lastname and DOB triples over their group number (ngroup)

    • sort=False helps it retain the seen order
    • ngroup is 0-based, so we add(1)
  • then add prefixes "F" and "L" to the IDs and assign to appropriate columns

to get

>>> df
  FirstName LastName         DOB
0        F1       L1  28/05/1973
1        F1       L1  28/05/1973
2        F2       L2  28/07/1991
3        F2       L2  28/07/1991
4        F2       L2  28/07/1991
5        F3       L3  07/01/1994

where the ids was

>>> ids
0    1
1    1
2    2
3    2
4    2
5    3
dtype: object
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading