I have below the dataframe below df1:
ID Label Value
id_1 A
id_1 B
id_1 C
id_1 D
id_1 E
id_1 10
id_1 20
id_1 30
id_2 F
id_2 G
id_2 H
id_2 40
id_2 50
id_2 60
id_2 70
id_2 80
id_2 90
I would like to group the rows based on the ID column in the following way :
ID Label Value
id_1 A 10
id_1 B 20
id_1 C 30
id_1 D
id_1 E
id_2 F 40
id_2 G 50
id_2 H 60
id_2 70
id_2 80
id_2 90
My goal is to always align the first value in column "Label" for a given ID, with the first value of column "Value" for the same ID (the empty cells inbetwwen numbers are expected).
How can I do this in the most optimal way ?
I tried with groupby feature but didn’t manage to get what I want, Im pretty sure there is an optimal to do this but can’t figure it out right now.
>Solution :
Assuming empty cells are NaN/None, you could count the number of leading empty rows in Value, of trailing ones in Label (with isna+cummin+sum), then groupby.apply to shift "Value" up and remove empty rows in the end with head:
def cust_shift(g):
# number of leading empty rows
n1 = g['Value'].isna().cummin().sum()
# number of trailing empty rows
n2 = g.loc[::-1, 'Label'].isna().cummin().sum()
# shift Value up and remove trailing empty rows
return g.assign(Value=g['Value'].shift(-n1)).head(-min(n1, n2))
out = df.groupby('ID', group_keys=False)[list(df)].apply(cust_shift)
Output:
ID Label Value
0 id_1 A 10.0
1 id_1 B 20.0
2 id_1 C 30.0
3 id_1 D NaN
4 id_1 E NaN
8 id_2 F 40.0
9 id_2 G 50.0
10 id_2 H 60.0
11 id_2 None 70.0
Reproducible input:
from numpy import nan
df = pd.DataFrame({'ID': ['id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1',
'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2'],
'Label': ['A', 'B', 'C', 'D', 'E', None, None, None, 'F', 'G', 'H', None, None, None, None],
'Value': [nan, nan, nan, nan, nan, 10.0, 20.0, 30.0, nan, nan, nan, 40.0, 50.0, 60.0, 70.0]})
empty cells are empty strings
If your empty cells are in fact empty strings, just adapt the above code to use eq('') in place of isna and add fill_value='' in shift:
def cust_shift(g):
# number of leading empty rows
n1 = g['Value'].eq('').cummin().sum()
# number of trailing empty rows
n2 = g.loc[::-1, 'Label'].eq('').cummin().sum()
return (g.assign(Value=g['Value'].shift(-n1, fill_value=''))
.head(-min(n1, n2))
)
out = (df.groupby('ID', group_keys=False)[list(df)]
.apply(cust_shift)
)
Output:
ID Label Value
0 id_1 A 10
1 id_1 B 20
2 id_1 C 30
3 id_1 D
4 id_1 E
8 id_2 F 40
9 id_2 G 50
10 id_2 H 60
11 id_2 70
Alternative input:
df = pd.DataFrame({'ID': ['id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1',
'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2'],
'Label': ['A', 'B', 'C', 'D', 'E', '', '', '', 'F', 'G', 'H', '', '', '', ''],
'Value': ['', '', '', '', '', 10, 20, 30, '', '', '', 40, 50, 60, 70]})