Home Pandas: Filling in null values for numeric columns (for subsets of data) using min value from that particular column

Questions

Pandas: Filling in null values for numeric columns (for subsets of data) using min value from that particular column

byMR

October 4, 2022

Question

I am trying to replace null values in a data frame with the corresponding minimum values for that particular column for a given year (and then replicate for all years).

However my code doesn’t seem to work, could someone inform me why and help form the correct code for this potential use case (my method my not be optimal so open to better process).

Resources

I have added the following code snippet with a sample dataframe and what I am trying to do

# Test dataframe trying to replicate scenario
test_df = pd.DataFrame(data={'name': ['institute_1', 'institute_2', 'institute_2', 'institute_1', 'institute_2',],
                             'score_1':[0.21, np.nan, 0.5, 0.3, np.nan],
                             'score_2':[np.nan, 0.38, 0.53, 0.3, 0.986],
                             'year':[2017,2017,2018,2018,2019]
                             },
                      columns=['name', 'score_1', 'score_2', 'year' ])

# Viewing df before changes
test_df

# Finding all numeric columns (in reality the year shouldn't be included but can perform manual selection if need be later, no a big deal)
numeric_columns = test_df.select_dtypes(include=['number']).columns
numeric_columns

# Filling na values with min value for that particular numeric column in that given year (performing for all years)
for year in range(2017, 2020):
    # select numeric columns
    minimum_score_per_column = test_df[test_df['year'] == year][numeric_columns].min()
    test_df[test_df['year'] == year].fillna(minimum_score_per_column, axis=0, inplace=True)

# Viewing df after changes
test_df

Note : As you can see from the screenshot below the null values haven’t been filled

>Solution :

Because you are doing inplace operation on a sliced data test_df[test_df["year"] == year].

Do manual assignment for the last block:

for year in range(2017, 2020):
    # select numeric columns
    m = test_df['year'] == year
    minimum_score_per_column = test_df[m][numeric_columns].min()
    test_df[m] = test_df[m].fillna(minimum_score_per_column, axis=0)

Output:

          name  score_1  score_2  year
0  institute_1     0.21    0.380  2017
1  institute_2     0.21    0.380  2017
2  institute_2     0.50    0.530  2018
3  institute_1     0.30    0.300  2018
4  institute_2      NaN    0.986  2019