Compare GROUPBY and np.where with Rs datatable package

November 17, 2021

I’m struggling to create a new variable in python based on a condition with a groupby statement. I know there is a new datatable package in python, but I would like to explore the options below.

In R, I did the following:

require(data.table)
id <- c(1,2,3,4,1,5)
units <- c(34, 31, 16, 32, NA, 35)
unitsnew <- c(23, 11, 21, 22, 27, 11)
df = data.table(id, units, unitsnew)
df[, col:= ifelse(is.na(units), min(unitsnew, na.rm = T), units), by = id]

id units unitsnew col
1    34     23    34
2    31     11    31
3    16     21    16
4    32     22    32
1    NA     27    23
5    35     11    35

In Python,

import pandas as pd
import numpy as np
matrix = [(1, 34, 23),
          (2, 31, 11),
          (3, 16, 21),
          (4, 32, 22),
          (1, np.nan, 27),
          (5, 35, 11)]
df = pd.DataFrame(matrix, columns = ['id', 'units', 'unitsnew'])
df.assign(col = np.where(np.isnan(df.units), np.nanmin(df.unitsnew), df.units))

id units unitsnew col
1    34     23    34
2    31     11    31
3    16     21    16
4    32     22    32
1    NA     27    11
5    35     11    35

I’m not sure where to fit the groupby statement such that the minimum value per id is taken. The desired output is generated with R. In python, the nan value is filled with 11 because I failed to add the groupby statement. How can I solve this? Thanks,

>Solution :

First let’s look how to get the minimum value per group using groupby+transform('min'):

df.groupby('id')['unitsnew'].transform('min')

output:

0    23
1    11
2    21
3    22
4    23
5    11
Name: unitsnew, dtype: int64

Now we can use where to assign the above value when df['units'].isna():

df['col'] = df['units'].where(df['units'].notna(),
                              df.groupby('id')['unitsnew'].transform('min'))

output:

   id  units  unitsnew   col
0   1   34.0        23  34.0
1   2   31.0        11  31.0
2   3   16.0        21  16.0
3   4   32.0        22  32.0
4   1    NaN        27  23.0
5   5   35.0        11  35.0