Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

how to replace a specifc column value based on max count of values in a groupby scenario with pandas?

I have a dataframe with two columns district and region. Below is the sample of how the input dataframe looks like:

    district     region

1   Aveiro       -
2   Aveiro       Entre Douro e Minho
3   Aveiro       Beira Litoral
4   Aveiro       Beira Litoral
5   Aveiro       Entre Douro e Minho
6   Aveiro       Beira Litoral
7   Braga        Trás-os-Montes
8   Braga        -
9   Braga        Trás-os-Montes

As you can see, There are no null values in the dataframes. But in the region column, there are some records that have this value "-" . Now i want to replcae all the "-" records in that column with the most frequent value based on a groupby scenario with column district: We can get that count with this…

df1['region'].groupby(df1['district']).value_counts()


district          region               
Aveiro            Beira Litoral             3
                  Entre Douro e Minho       2
                  -                         1
Braga             Trás-os-Montes            2
                  -                         1

As you can see, "Beira Litoral" is the most frequent value for Averio, then it should replace the "-" in region column. Similarly, "Trás-os-Montes" is the most frequent value for Braga.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The output dataframe should look like this:

    district     region

1   Aveiro       Beira Litoral
2   Aveiro       Entre Douro e Minho
3   Aveiro       Beira Litoral
4   Aveiro       Beira Litoral
5   Aveiro       Entre Douro e Minho
6   Aveiro       Beira Litoral
7   Braga        Trás-os-Montes
8   Braga        Trás-os-Montes
9   Braga        Trás-os-Montes

If i had Nan instead of "-" then I could have solved that with something like this

>Solution :

You can filter out the -, then find the mode with groupby():

modes = (df1['region'].replace('-', np.nan)
   .groupby(df1['district'])
   .transform(lambda x: x.mode()[0])
)

df1['region'] = np.where(df1['region']=='-', modes, df1['region'])

Output:

  district               region
1   Aveiro        Beira Litoral
2   Aveiro  Entre Douro e Minho
3   Aveiro        Beira Litoral
4   Aveiro        Beira Litoral
5   Aveiro  Entre Douro e Minho
6   Aveiro        Beira Litoral
7    Braga       Trás-os-Montes
8    Braga       Trás-os-Montes
9    Braga       Trás-os-Montes
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading