I need to filter a large dataset of variables with entries for multiple dates. In this instance I want to keep only data entered on the very first date.
For example in the dataset below:
dfex = pd.DataFrame({'names':['jim','jim','jim','jim','jim','jim','jim','jim','jim',
'bob','bob','bob','bob','bob','bob',
'sara','sara','sara','sara','sara','sara','sara','sara','sara','sara'],
'dates':['01-01-19','01-01-19','01-01-19','01-05-19','01-06-19','01-07-19','01-08-19','01-09-19','01-10-19',
'01-05-19','01-05-19','01-07-19','01-08-19','01-09-19','01-10-19',
'01-02-19','01-02-19','01-02-19','01-02-19','01-05-19','01-06-19','01-07-19','01-08-19','01-09-19','01-10-19']})
dfex['dates'] = pd.to_datetime(dfex['dates'])
dfex
jim would keep the first 3 rows, bob the first 2 and sara the first 5.
>Solution :
For N = 1 (original question) use groupby.transform and boolean indexing:
out = dfex[dfex.groupby('names')['dates'].transform('first').eq(dfex['dates'])]
If the initial date can be repeated later on in the group and you don"t want to keep those further occurrences, use:
out = dfex[dfex.groupby('names')['dates']
.transform(lambda s: s.ne(s.shift()).cumsum()).eq(1)]
Output:
names dates
0 jim 2019-01-01
1 jim 2019-01-01
2 jim 2019-01-01
9 bob 2019-01-05
10 bob 2019-01-05
15 sara 2019-01-02
16 sara 2019-01-02
17 sara 2019-01-02
18 sara 2019-01-02
first N unique dates
Irrespective of their values (just by position)
N = 5
out = dfex[dfex.groupby('names')['dates']
.transform(lambda s: pd. factorize(s)[0]<N)]
Or for the first N oldest date:
N = 5
out = dfex[dfex.groupby('names')['dates']
.rank('dense').le(N)]
Output:
names dates
0 jim 2019-01-01
1 jim 2019-01-01
2 jim 2019-01-01
3 jim 2019-01-05
4 jim 2019-01-06
5 jim 2019-01-07
6 jim 2019-01-08
9 bob 2019-01-05
10 bob 2019-01-05
11 bob 2019-01-07
12 bob 2019-01-08
13 bob 2019-01-09
14 bob 2019-01-10
15 sara 2019-01-02
16 sara 2019-01-02
17 sara 2019-01-02
18 sara 2019-01-02
19 sara 2019-01-05
20 sara 2019-01-06
21 sara 2019-01-07
22 sara 2019-01-08
for the last N unique
N = 5
out = dfex[dfex[::-1].groupby('names')['dates']
.transform(lambda s: pd. factorize(s)[0]<N)]
Or N newest dates:
N = 5
out = dfex[dfex.groupby('names')['dates']
.rank('dense', ascending=False).le(N)]
Output:
names dates
4 jim 2019-01-06
5 jim 2019-01-07
6 jim 2019-01-08
7 jim 2019-01-09
8 jim 2019-01-10
9 bob 2019-01-05
10 bob 2019-01-05
11 bob 2019-01-07
12 bob 2019-01-08
13 bob 2019-01-09
14 bob 2019-01-10
20 sara 2019-01-06
21 sara 2019-01-07
22 sara 2019-01-08
23 sara 2019-01-09
24 sara 2019-01-10