I have the following pandas dataframe
import pandas as pd
pd.DataFrame({
'id': [1,1,1,1,1,
2,2,2,2,2,
3,3,3,3,3],
'week': ['2022-W9','2022-W10', '2022-W11', '2022-W15', '2022-W17',
'2022-W10','2022-W11', '2022-W15', '2022-W19', '2022-W24',
'2022-W1','2022-W3', '2022-W19', '2022-W20', '2022-W42']
})
id week
0 1 2022-W9
1 1 2022-W10
2 1 2022-W11
3 1 2022-W15
4 1 2022-W17
5 2 2022-W10
6 2 2022-W11
7 2 2022-W15
8 2 2022-W19
9 2 2022-W24
10 3 2022-W1
11 3 2022-W3
12 3 2022-W19
13 3 2022-W20
14 3 2022-W42
I would like to get only the rows that have consecutive weeks, by id.
The output should be this
pd.DataFrame({
'id': [1,1,1,
2,2,
3,3],
'week': ['2022-W9','2022-W10', '2022-W11',
'2022-W10','2022-W11',
'2022-W19', '2022-W20']
})
id week
0 1 2022-W9
1 1 2022-W10
2 1 2022-W11
3 2 2022-W10
4 2 2022-W11
5 3 2022-W19
6 3 2022-W20
How could I do that ?
>Solution :
You can convert your week to_datetime, use a custom groupby.transform with diff to identify the successive weeks and boolean indexing:
date = pd.to_datetime(df['week']+'-1', format='%Y-W%W-%w')
mask = (date.groupby(df['id'])
.transform(lambda s: (m:=s.diff().eq('7d')) | m.shift(-1))
)
out = df[mask]
With python version older than 3.8, use:
date = pd.to_datetime(df['week']+'-1', format='%Y-W%W-%w')
def select_consecutive(s):
m = s.diff().eq('7d')
return m | m.shift(-1)
mask = date.groupby(df['id']).transform(select_consecutive)
out = df[mask]
Output:
id week
0 1 2022-W9
1 1 2022-W10
2 1 2022-W11
5 2 2022-W10
6 2 2022-W11
12 3 2022-W19
13 3 2022-W20
Intermediates:
id week date diff m m.shift(-1) mask
0 1 2022-W9 2022-02-28 NaT False True True
1 1 2022-W10 2022-03-07 7 days True True True
2 1 2022-W11 2022-03-14 7 days True False True
3 1 2022-W15 2022-04-11 28 days False False False
4 1 2022-W17 2022-04-25 14 days False NaN False
5 2 2022-W10 2022-03-07 NaT False True True
6 2 2022-W11 2022-03-14 7 days True False True
7 2 2022-W15 2022-04-11 28 days False False False
8 2 2022-W19 2022-05-09 28 days False False False
9 2 2022-W24 2022-06-13 35 days False NaN False
10 3 2022-W1 2022-01-03 NaT False False False
11 3 2022-W3 2022-01-17 14 days False False False
12 3 2022-W19 2022-05-09 112 days False True True
13 3 2022-W20 2022-05-16 7 days True False True
14 3 2022-W42 2022-10-17 154 days False NaN False
Variant with weekly periods (should be slower but can be useful to generalize to non-fixed periods):
date = pd.to_datetime(df['week']+'-1', format='%Y-W%W-%w').dt.to_period('W')
def select_consecutive(s):
m=s.diff().apply(lambda x: x if pd.isna(x) else x.n).eq(1)
return m | m.shift(-1)
mask = date.groupby(df['id']).transform(select_consecutive)
out = df[mask]