Home how to replace a specifc column value based on max count of values in a groupby scenario with pandas?

Questions

how to replace a specifc column value based on max count of values in a groupby scenario with pandas?

January 14, 2022

I have a dataframe with two columns district and region. Below is the sample of how the input dataframe looks like:

    district     region

1   Aveiro       -
2   Aveiro       Entre Douro e Minho
3   Aveiro       Beira Litoral
4   Aveiro       Beira Litoral
5   Aveiro       Entre Douro e Minho
6   Aveiro       Beira Litoral
7   Braga        Trás-os-Montes
8   Braga        -
9   Braga        Trás-os-Montes

As you can see, There are no null values in the dataframes. But in the region column, there are some records that have this value "-" . Now i want to replcae all the "-" records in that column with the most frequent value based on a groupby scenario with column district: We can get that count with this…

df1['region'].groupby(df1['district']).value_counts()


district          region               
Aveiro            Beira Litoral             3
                  Entre Douro e Minho       2
                  -                         1
Braga             Trás-os-Montes            2
                  -                         1

As you can see, "Beira Litoral" is the most frequent value for Averio, then it should replace the "-" in region column. Similarly, "Trás-os-Montes" is the most frequent value for Braga.

The output dataframe should look like this:

    district     region

1   Aveiro       Beira Litoral
2   Aveiro       Entre Douro e Minho
3   Aveiro       Beira Litoral
4   Aveiro       Beira Litoral
5   Aveiro       Entre Douro e Minho
6   Aveiro       Beira Litoral
7   Braga        Trás-os-Montes
8   Braga        Trás-os-Montes
9   Braga        Trás-os-Montes

If i had Nan instead of "-" then I could have solved that with something like this

>Solution :

You can filter out the -, then find the mode with groupby():

modes = (df1['region'].replace('-', np.nan)
   .groupby(df1['district'])
   .transform(lambda x: x.mode()[0])
)

df1['region'] = np.where(df1['region']=='-', modes, df1['region'])

Output:

  district               region
1   Aveiro        Beira Litoral
2   Aveiro  Entre Douro e Minho
3   Aveiro        Beira Litoral
4   Aveiro        Beira Litoral
5   Aveiro  Entre Douro e Minho
6   Aveiro        Beira Litoral
7    Braga       Trás-os-Montes
8    Braga       Trás-os-Montes
9    Braga       Trás-os-Montes