Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Adding a series to pandas DF with a value based on groups of data in other series

I’m trying to add a col to this dataframe as efficiently as possible (since it is huge):

Time    Time_slot   Veh_ID      Pwr
1       1           100         10
1       2           100         12
2       1           100         3
2       2           100         13
3       1           100         22
3       2           100         13
1       1           55          8
1       2           55          2
2       1           55          12
2       2           55          11
6000    1           100         7
6000    2           100         6
6001    1           100         11
6001    2           100         14
6001    1           55          7
6001    2           55          9
6002    1           55          9
6002    2           55          13
6003    1           55          10
6003    2           55          9

The idea is to add a col call "Group_ID" such that it concatenates the Veh_ID and an incremental value for each group of Veh_IDs that grouped with delta Time less than, say, 5000 from the next (same!) Veh_ID that appears later. E.g. –

Time    Time_slot   Veh_ID  Pwr Group_ID
1       1           100     10          100_1
1       2           100     12          100_1
2       1           100     3           100_1
2       2           100     13          100_1
3       1           100     22          100_1
3       2           100     13          100_1
1       1           55      8           55_1
1       2           55      2           55_1
2       1           55      12          55_1
2       2           55      11          55_1
6000    1           100     7           100_2
6000    2           100     6           100_2
6001    1           100     11          100_2
6001    2           100     14          100_2
6001    1           55      7           55_2
6001    2           55      9           55_2
6002    1           55      9           55_2
6002    2           55      13          55_2
6003    1           55      10          55_2
6003    2           55      9           55_2

Here all the Group_IDs are generated for each Veh_ID and group of records with a timestamp that have np.diff < 5000 … While I can describe it, I can’t get the code to work in python.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have code that is just too complicated and I’m sure pandas provides some way to do this more elegantly:

df.insert(loc = 4, column = 'Group_ID',  value = 0)
for v in df.Veh_ID.unique():
    diffs = np.diff(df[df.Veh_ID == v].Time)
    change_indxs = np.where(diffs > 5000)[0] # Each location is the last of the group
    start_indx = df.index[0]
    last_indx  = df.index[-1]
    abreak     = last_indx
    for i, abreak in enumerate(change_indxs):
        aChunk = df.loc[start_indx:abreak+1]
        aChunk['Group_ID'].loc[start_indx:abreak+1] = str(v) + str(i)

which doesn’t work and throws error: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

I’m a bit stumped.

>Solution :

A simple way can be:

gid = df['Time'].floordiv(5000).add(1).astype(str)
df['Group_ID'] = df['Veh_ID'].astype(str) + '_' + gid
print(df)

# Output
    Time  Time_slot  Veh_ID  Pwr Group_ID
0      1          1     100   10    100_1
1      1          2     100   12    100_1
2      2          1     100    3    100_1
3      2          2     100   13    100_1
4      3          1     100   22    100_1
5      3          2     100   13    100_1
6      1          1      55    8     55_1
7      1          2      55    2     55_1
8      2          1      55   12     55_1
9      2          2      55   11     55_1
10  6000          1     100    7    100_2
11  6000          2     100    6    100_2
12  6001          1     100   11    100_2
13  6001          2     100   14    100_2
14  6001          1      55    7     55_2
15  6001          2      55    9     55_2
16  6002          1      55    9     55_2
17  6002          2      55   13     55_2
18  6003          1      55   10     55_2
19  6003          2      55    9     55_2

A more robust way (with a rolling diff):

new_group = lambda x: x.diff().floordiv(5000).cumsum().fillna(0).astype(int).add(1)
gid = df.groupby('Veh_ID')['Time'].transform(new_group).astype(str)
df['Group_ID'] = df['Veh_ID'].astype(str) + '_' + gid
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading