How to group rows based on column ID in a pandas dataframe?

July 19, 2024

I have below the dataframe below df1:

ID       Label   Value
id_1     A
id_1     B
id_1     C
id_1     D
id_1     E
id_1             10
id_1             20
id_1             30
id_2     F
id_2     G
id_2     H
id_2             40
id_2             50
id_2             60
id_2             70
id_2             80
id_2             90

I would like to group the rows based on the ID column in the following way :

ID      Label   Value
id_1     A      10
id_1     B      20
id_1     C      30
id_1     D
id_1     E      
id_2     F      40
id_2     G      50
id_2     H      60
id_2            70
id_2            80
id_2            90

My goal is to always align the first value in column "Label" for a given ID, with the first value of column "Value" for the same ID (the empty cells inbetwwen numbers are expected).

How can I do this in the most optimal way ?

I tried with groupby feature but didn’t manage to get what I want, Im pretty sure there is an optimal to do this but can’t figure it out right now.

>Solution :

Assuming empty cells are NaN/None, you could count the number of leading empty rows in Value, of trailing ones in Label (with isna+cummin+sum), then groupby.apply to shift "Value" up and remove empty rows in the end with head:

def cust_shift(g):
    # number of leading empty rows
    n1 = g['Value'].isna().cummin().sum()
    # number of trailing empty rows
    n2 = g.loc[::-1, 'Label'].isna().cummin().sum()
    # shift Value up and remove trailing empty rows
    return g.assign(Value=g['Value'].shift(-n1)).head(-min(n1, n2))

out = df.groupby('ID', group_keys=False)[list(df)].apply(cust_shift)

Output:

      ID Label  Value
0   id_1     A   10.0
1   id_1     B   20.0
2   id_1     C   30.0
3   id_1     D    NaN
4   id_1     E    NaN
8   id_2     F   40.0
9   id_2     G   50.0
10  id_2     H   60.0
11  id_2  None   70.0

Reproducible input:

from numpy import nan
df = pd.DataFrame({'ID': ['id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1',
                          'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2'],
                   'Label': ['A', 'B', 'C', 'D', 'E', None, None, None, 'F', 'G', 'H', None, None, None, None],
                   'Value': [nan, nan, nan, nan, nan, 10.0, 20.0, 30.0, nan, nan, nan, 40.0, 50.0, 60.0, 70.0]})

empty cells are empty strings

If your empty cells are in fact empty strings, just adapt the above code to use eq('') in place of isna and add fill_value='' in shift:

def cust_shift(g):
    # number of leading empty rows
    n1 = g['Value'].eq('').cummin().sum()
    # number of trailing empty rows
    n2 = g.loc[::-1, 'Label'].eq('').cummin().sum()
    return (g.assign(Value=g['Value'].shift(-n1, fill_value=''))
             .head(-min(n1, n2))
            )

out = (df.groupby('ID', group_keys=False)[list(df)]
         .apply(cust_shift)
      )

Output:

      ID Label Value
0   id_1     A    10
1   id_1     B    20
2   id_1     C    30
3   id_1     D      
4   id_1     E      
8   id_2     F    40
9   id_2     G    50
10  id_2     H    60
11  id_2          70

Alternative input:

df = pd.DataFrame({'ID': ['id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1', 'id_1',
                          'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2', 'id_2'],
                   'Label': ['A', 'B', 'C', 'D', 'E', '', '', '', 'F', 'G', 'H', '', '', '', ''],
                   'Value': ['', '', '', '', '', 10, 20, 30, '', '', '', 40, 50, 60, 70]})