Aggregate string column close in time in pandas

October 28, 2023

I’m trying to group messages, which have been sent shortly after another. A parameter defines the maximum duration between messages for them to be considered part of a block. If a message is added to the block, the time window is extended for more messages to be considered part of the block.

Example Input

	datetime	message
0	2023-01-01 12:00:00	A
1	2023-01-01 12:20:00	B
2	2023-01-01 12:30:00	C
3	2023-01-01 12:30:55	D
4	2023-01-01 12:31:20	E
5	2023-01-01 15:00:00	F
6	2023-01-01 15:30:30	G
7	2023-01-01 15:30:55	H

Expected output for the parameter set to 1min

	datetime	message	datetime_last	n_block
0	2023-01-01 12:00:00	A	2023-01-01 12:00:00	1
1	2023-01-01 12:20:00	B	2023-01-01 12:20:00	1
2	2023-01-01 12:30:00	C\nD\nE	2023-01-01 12:31:20	3
3	2023-01-01 15:00:00	F	2023-01-01 15:00:00	1
4	2023-01-01 15:30:30	G\nH	2023-01-01 15:30:55	2

My failing attempt

I was hoping to achieve that with a rolling window, which would continuously append the message rows.

def join_messages(x):
    return '\n'.join(x)

df.rolling(window='1min', on='datetime').agg({
  'datetime': ['first', 'last'], 
  'message': [join_messages, "count"]}) #Somehow overwrite datetime with the aggregated datetime.first.

Both aggregations fail on a ValueError: invalid on specified as datetime, must be a column (of DataFrame), an Index or None.

I don’t see a clean way to get datetime "accessible" in the Window. Besides, rolling does not work well with strings either. I have the impression that this is a dead end and that there is a cleaner approach to this.

Snippets for input and expected data

df = pd.DataFrame({
    'datetime': [pd.Timestamp('2023-01-01 12:00'),
                 pd.Timestamp('2023-01-01 12:20'),
                 pd.Timestamp('2023-01-01 12:30:00'),
                 pd.Timestamp('2023-01-01 12:30:55'),
                 pd.Timestamp('2023-01-01 12:31:20'),
                 pd.Timestamp('2023-01-01 15:00'),
                 pd.Timestamp('2023-01-01 15:30:30'),
                 pd.Timestamp('2023-01-01 15:30:55'),],
    'message': list('ABCDEFGH')})


df_expected = pd.DataFrame({
    'datetime': [pd.Timestamp('2023-01-01 12:00'),
                 pd.Timestamp('2023-01-01 12:20'),
                 pd.Timestamp('2023-01-01 12:30:00'),
                 pd.Timestamp('2023-01-01 15:00'),
                 pd.Timestamp('2023-01-01 15:30:30'),],
    'message': ['A', 'B', 'C\nD\nE', 'F', 'G\nH'],
    'datetime_last': [pd.Timestamp('2023-01-01 12:00'),
                      pd.Timestamp('2023-01-01 12:20'),
                      pd.Timestamp('2023-01-01 12:31:20'),
                      pd.Timestamp('2023-01-01 15:00'),
                      pd.Timestamp('2023-01-01 15:30:55'),],
    'n_block': [1, 1, 3, 1, 2]})

>Solution :

Compare the current and previous datetime values to flag the rows where difference is greater than 1 min then apply cumulative sum on the flag to distinguish between different blocks of datetimes. Now, group the dataframe by these blocks and aggregate to get the result

m = df['datetime'].diff() > pd.Timedelta(minutes=1)
df.groupby(m.cumsum(), as_index=False).agg(datetime=('datetime', 'first'),
                                           datetime_last=('datetime', 'last'),
                                           message=('message', '\n'.join),
                                           n_block=('message', 'count'))

             datetime       datetime_last  message  n_block
0 2023-01-01 12:00:00 2023-01-01 12:00:00        A        1
1 2023-01-01 12:20:00 2023-01-01 12:20:00        B        1
2 2023-01-01 12:30:00 2023-01-01 12:31:20  C\nD\nE        3
3 2023-01-01 15:00:00 2023-01-01 15:00:00        F        1
4 2023-01-01 15:30:30 2023-01-01 15:30:55     G\nH        2