Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas impute values from grouped dataframe to another dataframe with an object

For the dataset I’m working with, I want to impute null values with medians grouped by country. I have created a grouped table (median_data in the code below), which contains all the median values by country.

I need to do this median calculation and imputation in separate steps since the end goal is to create an object with ‘fit’ and ‘transform’ methods, so that I can calculate medians based only on the train data, and impute to the test data.

Here’s dummy data I’m working with:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

data =  [['A', 10, 20, np.nan, np.nan, 50, 30], ['A', 2, 1, 5, np.nan, 34, 35], ['A', 13, 212, 3, 6, np.nan, 37],
         ['B', 120, 230, 53, np.nan, 63, 23], ['B', 22, 115, 15, 61, 4, 15], ['B', np.nan, 22, 12, np.nan, np.nan, 31],
         ['C', 105, 120, np.nan, 22, 520, 3], ['C', 26, 11, 15, np.nan, 34, 3], ['C', 13, np.nan, 13, 234, np.nan, 10],
         ['D', 101, 220, 654, 143, 634, 123], ['D', 32, 21, 61, 24, np.nan, 32], ['D', 11, 72, 23, np.nan, 534, 30]
        ]
df = pd.DataFrame(data, columns=['Country','col1','col2','col3','col4','col5','col6'])


median_data = df.groupby('Country').median().reset_index()

Currently not using an object, just trying to figure out how to do it. Using loops isn’t working. I tried different iterations, here’s where I’m at right now:

df_new = df.copy()
for country in median_data.Country:
    country_data = median_data[median_data.Country == country].copy()
    for col in median_data.columns[2:]:
        df_new[col] = df_new[col].fillna(country_data[col])

The dataset, df:

enter image description here

The table with medians grouped by ‘Country’:

enter image description here

Result from the code above (clearly incorrect).
As an example, col4 for Country A should be [6, 6, 6], but the values I get are [6, 61, 6]:

enter image description here

Is there a way to separately calculate median values and impute them? Efficiency is not my primary concern, but an efficient solution would obviously be preferred.

>Solution :

Use DataFrame.update with GroupBy.transform with median:

df.update(df.groupby('Country').transform('median'), overwrite=False)
print (df)
   Country   col1   col2   col3   col4   col5  col6
0        A   10.0   20.0    4.0    6.0   50.0    30
1        A    2.0    1.0    5.0    6.0   34.0    35
2        A   13.0  212.0    3.0    6.0   42.0    37
3        B  120.0  230.0   53.0   61.0   63.0    23
4        B   22.0  115.0   15.0   61.0    4.0    15
5        B   71.0   22.0   12.0   61.0   33.5    31
6        C  105.0  120.0   14.0   22.0  520.0     3
7        C   26.0   11.0   15.0  128.0   34.0     3
8        C   13.0   65.5   13.0  234.0  277.0    10
9        D  101.0  220.0  654.0  143.0  634.0   123
10       D   32.0   21.0   61.0   24.0  584.0    32
11       D   11.0   72.0   23.0   83.5  534.0    30

Details:

print (df.groupby('Country').transform('median'))
    col1   col2  col3   col4   col5  col6
0   10.0   20.0   4.0    6.0   42.0    35
1   10.0   20.0   4.0    6.0   42.0    35
2   10.0   20.0   4.0    6.0   42.0    35
3   71.0  115.0  15.0   61.0   33.5    23
4   71.0  115.0  15.0   61.0   33.5    23
5   71.0  115.0  15.0   61.0   33.5    23
6   26.0   65.5  14.0  128.0  277.0     3
7   26.0   65.5  14.0  128.0  277.0     3
8   26.0   65.5  14.0  128.0  277.0     3
9   32.0   72.0  61.0   83.5  584.0    32
10  32.0   72.0  61.0   83.5  584.0    32
11  32.0   72.0  61.0   83.5  584.0    32

Alternative solution with DataFrame.combine_first:

df1 = df.combine_first(df.groupby('Country').transform('median'))
print (df1)
   Country   col1   col2   col3   col4   col5  col6
0        A   10.0   20.0    4.0    6.0   50.0    30
1        A    2.0    1.0    5.0    6.0   34.0    35
2        A   13.0  212.0    3.0    6.0   42.0    37
3        B  120.0  230.0   53.0   61.0   63.0    23
4        B   22.0  115.0   15.0   61.0    4.0    15
5        B   71.0   22.0   12.0   61.0   33.5    31
6        C  105.0  120.0   14.0   22.0  520.0     3
7        C   26.0   11.0   15.0  128.0   34.0     3
8        C   13.0   65.5   13.0  234.0  277.0    10
9        D  101.0  220.0  654.0  143.0  634.0   123
10       D   32.0   21.0   61.0   24.0  584.0    32
11       D   11.0   72.0   23.0   83.5  534.0    30
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading