I’m trying to group messages, which have been sent shortly after another. A parameter defines the maximum duration between messages for them to be considered part of a block. If a message is added to the block, the time window is extended for more messages to be considered part of the block.
Example Input
| datetime | message | |
|---|---|---|
| 0 | 2023-01-01 12:00:00 | A |
| 1 | 2023-01-01 12:20:00 | B |
| 2 | 2023-01-01 12:30:00 | C |
| 3 | 2023-01-01 12:30:55 | D |
| 4 | 2023-01-01 12:31:20 | E |
| 5 | 2023-01-01 15:00:00 | F |
| 6 | 2023-01-01 15:30:30 | G |
| 7 | 2023-01-01 15:30:55 | H |
Expected output for the parameter set to 1min
| datetime | message | datetime_last | n_block | |
|---|---|---|---|---|
| 0 | 2023-01-01 12:00:00 | A | 2023-01-01 12:00:00 | 1 |
| 1 | 2023-01-01 12:20:00 | B | 2023-01-01 12:20:00 | 1 |
| 2 | 2023-01-01 12:30:00 | C\nD\nE | 2023-01-01 12:31:20 | 3 |
| 3 | 2023-01-01 15:00:00 | F | 2023-01-01 15:00:00 | 1 |
| 4 | 2023-01-01 15:30:30 | G\nH | 2023-01-01 15:30:55 | 2 |
My failing attempt
I was hoping to achieve that with a rolling window, which would continuously append the message rows.
def join_messages(x):
return '\n'.join(x)
df.rolling(window='1min', on='datetime').agg({
'datetime': ['first', 'last'],
'message': [join_messages, "count"]}) #Somehow overwrite datetime with the aggregated datetime.first.
Both aggregations fail on a ValueError: invalid on specified as datetime, must be a column (of DataFrame), an Index or None.
I don’t see a clean way to get datetime "accessible" in the Window. Besides, rolling does not work well with strings either. I have the impression that this is a dead end and that there is a cleaner approach to this.
Snippets for input and expected data
df = pd.DataFrame({
'datetime': [pd.Timestamp('2023-01-01 12:00'),
pd.Timestamp('2023-01-01 12:20'),
pd.Timestamp('2023-01-01 12:30:00'),
pd.Timestamp('2023-01-01 12:30:55'),
pd.Timestamp('2023-01-01 12:31:20'),
pd.Timestamp('2023-01-01 15:00'),
pd.Timestamp('2023-01-01 15:30:30'),
pd.Timestamp('2023-01-01 15:30:55'),],
'message': list('ABCDEFGH')})
df_expected = pd.DataFrame({
'datetime': [pd.Timestamp('2023-01-01 12:00'),
pd.Timestamp('2023-01-01 12:20'),
pd.Timestamp('2023-01-01 12:30:00'),
pd.Timestamp('2023-01-01 15:00'),
pd.Timestamp('2023-01-01 15:30:30'),],
'message': ['A', 'B', 'C\nD\nE', 'F', 'G\nH'],
'datetime_last': [pd.Timestamp('2023-01-01 12:00'),
pd.Timestamp('2023-01-01 12:20'),
pd.Timestamp('2023-01-01 12:31:20'),
pd.Timestamp('2023-01-01 15:00'),
pd.Timestamp('2023-01-01 15:30:55'),],
'n_block': [1, 1, 3, 1, 2]})
>Solution :
Compare the current and previous datetime values to flag the rows where difference is greater than 1 min then apply cumulative sum on the flag to distinguish between different blocks of datetimes. Now, group the dataframe by these blocks and aggregate to get the result
m = df['datetime'].diff() > pd.Timedelta(minutes=1)
df.groupby(m.cumsum(), as_index=False).agg(datetime=('datetime', 'first'),
datetime_last=('datetime', 'last'),
message=('message', '\n'.join),
n_block=('message', 'count'))
datetime datetime_last message n_block
0 2023-01-01 12:00:00 2023-01-01 12:00:00 A 1
1 2023-01-01 12:20:00 2023-01-01 12:20:00 B 1
2 2023-01-01 12:30:00 2023-01-01 12:31:20 C\nD\nE 3
3 2023-01-01 15:00:00 2023-01-01 15:00:00 F 1
4 2023-01-01 15:30:30 2023-01-01 15:30:55 G\nH 2