Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to get rows with consecutive dates with groupby in pandas

I have the following pandas dataframe

import pandas as pd
pd.DataFrame({
    'id': [1,1,1,1,1,
           2,2,2,2,2,
           3,3,3,3,3],
    'week': ['2022-W9','2022-W10', '2022-W11', '2022-W15', '2022-W17',
            '2022-W10','2022-W11', '2022-W15', '2022-W19', '2022-W24',
            '2022-W1','2022-W3', '2022-W19', '2022-W20', '2022-W42']
})
id  week
0   1   2022-W9
1   1   2022-W10
2   1   2022-W11
3   1   2022-W15
4   1   2022-W17
5   2   2022-W10
6   2   2022-W11
7   2   2022-W15
8   2   2022-W19
9   2   2022-W24
10  3   2022-W1
11  3   2022-W3
12  3   2022-W19
13  3   2022-W20
14  3   2022-W42

I would like to get only the rows that have consecutive weeks, by id.

The output should be this

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

pd.DataFrame({
    'id': [1,1,1,
           2,2,
           3,3],
    'week': ['2022-W9','2022-W10', '2022-W11', 
            '2022-W10','2022-W11', 
            '2022-W19', '2022-W20']
})
id  week
0   1   2022-W9
1   1   2022-W10
2   1   2022-W11
3   2   2022-W10
4   2   2022-W11
5   3   2022-W19
6   3   2022-W20

How could I do that ?

>Solution :

You can convert your week to_datetime, use a custom groupby.transform with diff to identify the successive weeks and boolean indexing:

date = pd.to_datetime(df['week']+'-1', format='%Y-W%W-%w')
mask = (date.groupby(df['id'])
            .transform(lambda s: (m:=s.diff().eq('7d')) | m.shift(-1))
        )

out = df[mask]

With python version older than 3.8, use:

date = pd.to_datetime(df['week']+'-1', format='%Y-W%W-%w')

def select_consecutive(s):
    m = s.diff().eq('7d')
    return m | m.shift(-1)

mask = date.groupby(df['id']).transform(select_consecutive)
out = df[mask]

Output:

    id      week
0    1   2022-W9
1    1  2022-W10
2    1  2022-W11
5    2  2022-W10
6    2  2022-W11
12   3  2022-W19
13   3  2022-W20

Intermediates:

    id      week       date     diff      m m.shift(-1)   mask
0    1   2022-W9 2022-02-28      NaT  False        True   True
1    1  2022-W10 2022-03-07   7 days   True        True   True
2    1  2022-W11 2022-03-14   7 days   True       False   True
3    1  2022-W15 2022-04-11  28 days  False       False  False
4    1  2022-W17 2022-04-25  14 days  False         NaN  False
5    2  2022-W10 2022-03-07      NaT  False        True   True
6    2  2022-W11 2022-03-14   7 days   True       False   True
7    2  2022-W15 2022-04-11  28 days  False       False  False
8    2  2022-W19 2022-05-09  28 days  False       False  False
9    2  2022-W24 2022-06-13  35 days  False         NaN  False
10   3   2022-W1 2022-01-03      NaT  False       False  False
11   3   2022-W3 2022-01-17  14 days  False       False  False
12   3  2022-W19 2022-05-09 112 days  False        True   True
13   3  2022-W20 2022-05-16   7 days   True       False   True
14   3  2022-W42 2022-10-17 154 days  False         NaN  False

Variant with weekly periods (should be slower but can be useful to generalize to non-fixed periods):

date = pd.to_datetime(df['week']+'-1', format='%Y-W%W-%w').dt.to_period('W')
def select_consecutive(s):
    m=s.diff().apply(lambda x: x if pd.isna(x) else x.n).eq(1)
    return m | m.shift(-1)

mask = date.groupby(df['id']).transform(select_consecutive)
out = df[mask]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading