Home Adding a series to pandas DF with a value based on groups of data in other series

Questions

Adding a series to pandas DF with a value based on groups of data in other series

September 20, 2023

I’m trying to add a col to this dataframe as efficiently as possible (since it is huge):

Time    Time_slot   Veh_ID      Pwr
1       1           100         10
1       2           100         12
2       1           100         3
2       2           100         13
3       1           100         22
3       2           100         13
1       1           55          8
1       2           55          2
2       1           55          12
2       2           55          11
6000    1           100         7
6000    2           100         6
6001    1           100         11
6001    2           100         14
6001    1           55          7
6001    2           55          9
6002    1           55          9
6002    2           55          13
6003    1           55          10
6003    2           55          9

The idea is to add a col call "Group_ID" such that it concatenates the Veh_ID and an incremental value for each group of Veh_IDs that grouped with delta Time less than, say, 5000 from the next (same!) Veh_ID that appears later. E.g. –

Time    Time_slot   Veh_ID  Pwr Group_ID
1       1           100     10          100_1
1       2           100     12          100_1
2       1           100     3           100_1
2       2           100     13          100_1
3       1           100     22          100_1
3       2           100     13          100_1
1       1           55      8           55_1
1       2           55      2           55_1
2       1           55      12          55_1
2       2           55      11          55_1
6000    1           100     7           100_2
6000    2           100     6           100_2
6001    1           100     11          100_2
6001    2           100     14          100_2
6001    1           55      7           55_2
6001    2           55      9           55_2
6002    1           55      9           55_2
6002    2           55      13          55_2
6003    1           55      10          55_2
6003    2           55      9           55_2

Here all the Group_IDs are generated for each Veh_ID and group of records with a timestamp that have np.diff < 5000 … While I can describe it, I can’t get the code to work in python.

I have code that is just too complicated and I’m sure pandas provides some way to do this more elegantly:

df.insert(loc = 4, column = 'Group_ID',  value = 0)
for v in df.Veh_ID.unique():
    diffs = np.diff(df[df.Veh_ID == v].Time)
    change_indxs = np.where(diffs > 5000)[0] # Each location is the last of the group
    start_indx = df.index[0]
    last_indx  = df.index[-1]
    abreak     = last_indx
    for i, abreak in enumerate(change_indxs):
        aChunk = df.loc[start_indx:abreak+1]
        aChunk['Group_ID'].loc[start_indx:abreak+1] = str(v) + str(i)

which doesn’t work and throws error: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

I’m a bit stumped.

>Solution :

A simple way can be:

gid = df['Time'].floordiv(5000).add(1).astype(str)
df['Group_ID'] = df['Veh_ID'].astype(str) + '_' + gid
print(df)

# Output
    Time  Time_slot  Veh_ID  Pwr Group_ID
0      1          1     100   10    100_1
1      1          2     100   12    100_1
2      2          1     100    3    100_1
3      2          2     100   13    100_1
4      3          1     100   22    100_1
5      3          2     100   13    100_1
6      1          1      55    8     55_1
7      1          2      55    2     55_1
8      2          1      55   12     55_1
9      2          2      55   11     55_1
10  6000          1     100    7    100_2
11  6000          2     100    6    100_2
12  6001          1     100   11    100_2
13  6001          2     100   14    100_2
14  6001          1      55    7     55_2
15  6001          2      55    9     55_2
16  6002          1      55    9     55_2
17  6002          2      55   13     55_2
18  6003          1      55   10     55_2
19  6003          2      55    9     55_2

A more robust way (with a rolling diff):

new_group = lambda x: x.diff().floordiv(5000).cumsum().fillna(0).astype(int).add(1)
gid = df.groupby('Veh_ID')['Time'].transform(new_group).astype(str)
df['Group_ID'] = df['Veh_ID'].astype(str) + '_' + gid