Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Aggregate string column close in time in pandas

I’m trying to group messages, which have been sent shortly after another. A parameter defines the maximum duration between messages for them to be considered part of a block. If a message is added to the block, the time window is extended for more messages to be considered part of the block.

Example Input

datetime message
0 2023-01-01 12:00:00 A
1 2023-01-01 12:20:00 B
2 2023-01-01 12:30:00 C
3 2023-01-01 12:30:55 D
4 2023-01-01 12:31:20 E
5 2023-01-01 15:00:00 F
6 2023-01-01 15:30:30 G
7 2023-01-01 15:30:55 H

Expected output for the parameter set to 1min

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

datetime message datetime_last n_block
0 2023-01-01 12:00:00 A 2023-01-01 12:00:00 1
1 2023-01-01 12:20:00 B 2023-01-01 12:20:00 1
2 2023-01-01 12:30:00 C\nD\nE 2023-01-01 12:31:20 3
3 2023-01-01 15:00:00 F 2023-01-01 15:00:00 1
4 2023-01-01 15:30:30 G\nH 2023-01-01 15:30:55 2

My failing attempt

I was hoping to achieve that with a rolling window, which would continuously append the message rows.

def join_messages(x):
    return '\n'.join(x)

df.rolling(window='1min', on='datetime').agg({
  'datetime': ['first', 'last'], 
  'message': [join_messages, "count"]}) #Somehow overwrite datetime with the aggregated datetime.first.

Both aggregations fail on a ValueError: invalid on specified as datetime, must be a column (of DataFrame), an Index or None.

I don’t see a clean way to get datetime "accessible" in the Window. Besides, rolling does not work well with strings either. I have the impression that this is a dead end and that there is a cleaner approach to this.

Snippets for input and expected data

df = pd.DataFrame({
    'datetime': [pd.Timestamp('2023-01-01 12:00'),
                 pd.Timestamp('2023-01-01 12:20'),
                 pd.Timestamp('2023-01-01 12:30:00'),
                 pd.Timestamp('2023-01-01 12:30:55'),
                 pd.Timestamp('2023-01-01 12:31:20'),
                 pd.Timestamp('2023-01-01 15:00'),
                 pd.Timestamp('2023-01-01 15:30:30'),
                 pd.Timestamp('2023-01-01 15:30:55'),],
    'message': list('ABCDEFGH')})


df_expected = pd.DataFrame({
    'datetime': [pd.Timestamp('2023-01-01 12:00'),
                 pd.Timestamp('2023-01-01 12:20'),
                 pd.Timestamp('2023-01-01 12:30:00'),
                 pd.Timestamp('2023-01-01 15:00'),
                 pd.Timestamp('2023-01-01 15:30:30'),],
    'message': ['A', 'B', 'C\nD\nE', 'F', 'G\nH'],
    'datetime_last': [pd.Timestamp('2023-01-01 12:00'),
                      pd.Timestamp('2023-01-01 12:20'),
                      pd.Timestamp('2023-01-01 12:31:20'),
                      pd.Timestamp('2023-01-01 15:00'),
                      pd.Timestamp('2023-01-01 15:30:55'),],
    'n_block': [1, 1, 3, 1, 2]})

>Solution :

Compare the current and previous datetime values to flag the rows where difference is greater than 1 min then apply cumulative sum on the flag to distinguish between different blocks of datetimes. Now, group the dataframe by these blocks and aggregate to get the result

m = df['datetime'].diff() > pd.Timedelta(minutes=1)
df.groupby(m.cumsum(), as_index=False).agg(datetime=('datetime', 'first'),
                                           datetime_last=('datetime', 'last'),
                                           message=('message', '\n'.join),
                                           n_block=('message', 'count'))

             datetime       datetime_last  message  n_block
0 2023-01-01 12:00:00 2023-01-01 12:00:00        A        1
1 2023-01-01 12:20:00 2023-01-01 12:20:00        B        1
2 2023-01-01 12:30:00 2023-01-01 12:31:20  C\nD\nE        3
3 2023-01-01 15:00:00 2023-01-01 15:00:00        F        1
4 2023-01-01 15:30:30 2023-01-01 15:30:55     G\nH        2
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading