How to get rows with consecutive dates with groupby in pandas

January 17, 2024

I have the following pandas dataframe

import pandas as pd
pd.DataFrame({
    'id': [1,1,1,1,1,
           2,2,2,2,2,
           3,3,3,3,3],
    'week': ['2022-W9','2022-W10', '2022-W11', '2022-W15', '2022-W17',
            '2022-W10','2022-W11', '2022-W15', '2022-W19', '2022-W24',
            '2022-W1','2022-W3', '2022-W19', '2022-W20', '2022-W42']
})
id  week
0   1   2022-W9
1   1   2022-W10
2   1   2022-W11
3   1   2022-W15
4   1   2022-W17
5   2   2022-W10
6   2   2022-W11
7   2   2022-W15
8   2   2022-W19
9   2   2022-W24
10  3   2022-W1
11  3   2022-W3
12  3   2022-W19
13  3   2022-W20
14  3   2022-W42

I would like to get only the rows that have consecutive weeks, by id.

The output should be this

pd.DataFrame({
    'id': [1,1,1,
           2,2,
           3,3],
    'week': ['2022-W9','2022-W10', '2022-W11', 
            '2022-W10','2022-W11', 
            '2022-W19', '2022-W20']
})
id  week
0   1   2022-W9
1   1   2022-W10
2   1   2022-W11
3   2   2022-W10
4   2   2022-W11
5   3   2022-W19
6   3   2022-W20

How could I do that ?

>Solution :

You can convert your week to_datetime, use a custom groupby.transform with diff to identify the successive weeks and boolean indexing:

date = pd.to_datetime(df['week']+'-1', format='%Y-W%W-%w')
mask = (date.groupby(df['id'])
            .transform(lambda s: (m:=s.diff().eq('7d')) | m.shift(-1))
        )

out = df[mask]

With python version older than 3.8, use:

date = pd.to_datetime(df['week']+'-1', format='%Y-W%W-%w')

def select_consecutive(s):
    m = s.diff().eq('7d')
    return m | m.shift(-1)

mask = date.groupby(df['id']).transform(select_consecutive)
out = df[mask]

Output:

    id      week
0    1   2022-W9
1    1  2022-W10
2    1  2022-W11
5    2  2022-W10
6    2  2022-W11
12   3  2022-W19
13   3  2022-W20

Intermediates:

    id      week       date     diff      m m.shift(-1)   mask
0    1   2022-W9 2022-02-28      NaT  False        True   True
1    1  2022-W10 2022-03-07   7 days   True        True   True
2    1  2022-W11 2022-03-14   7 days   True       False   True
3    1  2022-W15 2022-04-11  28 days  False       False  False
4    1  2022-W17 2022-04-25  14 days  False         NaN  False
5    2  2022-W10 2022-03-07      NaT  False        True   True
6    2  2022-W11 2022-03-14   7 days   True       False   True
7    2  2022-W15 2022-04-11  28 days  False       False  False
8    2  2022-W19 2022-05-09  28 days  False       False  False
9    2  2022-W24 2022-06-13  35 days  False         NaN  False
10   3   2022-W1 2022-01-03      NaT  False       False  False
11   3   2022-W3 2022-01-17  14 days  False       False  False
12   3  2022-W19 2022-05-09 112 days  False        True   True
13   3  2022-W20 2022-05-16   7 days   True       False   True
14   3  2022-W42 2022-10-17 154 days  False         NaN  False

Variant with weekly periods (should be slower but can be useful to generalize to non-fixed periods):

date = pd.to_datetime(df['week']+'-1', format='%Y-W%W-%w').dt.to_period('W')
def select_consecutive(s):
    m=s.diff().apply(lambda x: x if pd.isna(x) else x.n).eq(1)
    return m | m.shift(-1)

mask = date.groupby(df['id']).transform(select_consecutive)
out = df[mask]