I’m trying to add a col to this dataframe as efficiently as possible (since it is huge):
Time Time_slot Veh_ID Pwr
1 1 100 10
1 2 100 12
2 1 100 3
2 2 100 13
3 1 100 22
3 2 100 13
1 1 55 8
1 2 55 2
2 1 55 12
2 2 55 11
6000 1 100 7
6000 2 100 6
6001 1 100 11
6001 2 100 14
6001 1 55 7
6001 2 55 9
6002 1 55 9
6002 2 55 13
6003 1 55 10
6003 2 55 9
The idea is to add a col call "Group_ID" such that it concatenates the Veh_ID and an incremental value for each group of Veh_IDs that grouped with delta Time less than, say, 5000 from the next (same!) Veh_ID that appears later. E.g. –
Time Time_slot Veh_ID Pwr Group_ID
1 1 100 10 100_1
1 2 100 12 100_1
2 1 100 3 100_1
2 2 100 13 100_1
3 1 100 22 100_1
3 2 100 13 100_1
1 1 55 8 55_1
1 2 55 2 55_1
2 1 55 12 55_1
2 2 55 11 55_1
6000 1 100 7 100_2
6000 2 100 6 100_2
6001 1 100 11 100_2
6001 2 100 14 100_2
6001 1 55 7 55_2
6001 2 55 9 55_2
6002 1 55 9 55_2
6002 2 55 13 55_2
6003 1 55 10 55_2
6003 2 55 9 55_2
Here all the Group_IDs are generated for each Veh_ID and group of records with a timestamp that have np.diff < 5000 … While I can describe it, I can’t get the code to work in python.
I have code that is just too complicated and I’m sure pandas provides some way to do this more elegantly:
df.insert(loc = 4, column = 'Group_ID', value = 0)
for v in df.Veh_ID.unique():
diffs = np.diff(df[df.Veh_ID == v].Time)
change_indxs = np.where(diffs > 5000)[0] # Each location is the last of the group
start_indx = df.index[0]
last_indx = df.index[-1]
abreak = last_indx
for i, abreak in enumerate(change_indxs):
aChunk = df.loc[start_indx:abreak+1]
aChunk['Group_ID'].loc[start_indx:abreak+1] = str(v) + str(i)
which doesn’t work and throws error: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
I’m a bit stumped.
>Solution :
A simple way can be:
gid = df['Time'].floordiv(5000).add(1).astype(str)
df['Group_ID'] = df['Veh_ID'].astype(str) + '_' + gid
print(df)
# Output
Time Time_slot Veh_ID Pwr Group_ID
0 1 1 100 10 100_1
1 1 2 100 12 100_1
2 2 1 100 3 100_1
3 2 2 100 13 100_1
4 3 1 100 22 100_1
5 3 2 100 13 100_1
6 1 1 55 8 55_1
7 1 2 55 2 55_1
8 2 1 55 12 55_1
9 2 2 55 11 55_1
10 6000 1 100 7 100_2
11 6000 2 100 6 100_2
12 6001 1 100 11 100_2
13 6001 2 100 14 100_2
14 6001 1 55 7 55_2
15 6001 2 55 9 55_2
16 6002 1 55 9 55_2
17 6002 2 55 13 55_2
18 6003 1 55 10 55_2
19 6003 2 55 9 55_2
A more robust way (with a rolling diff):
new_group = lambda x: x.diff().floordiv(5000).cumsum().fillna(0).astype(int).add(1)
gid = df.groupby('Veh_ID')['Time'].transform(new_group).astype(str)
df['Group_ID'] = df['Veh_ID'].astype(str) + '_' + gid