Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Compare GROUPBY and np.where with Rs datatable package

I’m struggling to create a new variable in python based on a condition with a groupby statement. I know there is a new datatable package in python, but I would like to explore the options below.

In R, I did the following:

require(data.table)
id <- c(1,2,3,4,1,5)
units <- c(34, 31, 16, 32, NA, 35)
unitsnew <- c(23, 11, 21, 22, 27, 11)
df = data.table(id, units, unitsnew)
df[, col:= ifelse(is.na(units), min(unitsnew, na.rm = T), units), by = id]

id units unitsnew col
1    34     23    34
2    31     11    31
3    16     21    16
4    32     22    32
1    NA     27    23
5    35     11    35

In Python,

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import pandas as pd
import numpy as np
matrix = [(1, 34, 23),
          (2, 31, 11),
          (3, 16, 21),
          (4, 32, 22),
          (1, np.nan, 27),
          (5, 35, 11)]
df = pd.DataFrame(matrix, columns = ['id', 'units', 'unitsnew'])
df.assign(col = np.where(np.isnan(df.units), np.nanmin(df.unitsnew), df.units))

id units unitsnew col
1    34     23    34
2    31     11    31
3    16     21    16
4    32     22    32
1    NA     27    11
5    35     11    35

I’m not sure where to fit the groupby statement such that the minimum value per id is taken. The desired output is generated with R. In python, the nan value is filled with 11 because I failed to add the groupby statement. How can I solve this? Thanks,

>Solution :

First let’s look how to get the minimum value per group using groupby+transform('min'):

df.groupby('id')['unitsnew'].transform('min')

output:

0    23
1    11
2    21
3    22
4    23
5    11
Name: unitsnew, dtype: int64

Now we can use where to assign the above value when df['units'].isna():

df['col'] = df['units'].where(df['units'].notna(),
                              df.groupby('id')['unitsnew'].transform('min'))

output:

   id  units  unitsnew   col
0   1   34.0        23  34.0
1   2   31.0        11  31.0
2   3   16.0        21  16.0
3   4   32.0        22  32.0
4   1    NaN        27  23.0
5   5   35.0        11  35.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading