Filter pandas dataframe for first N unique values per group

September 13, 2023

I need to filter a large dataset of variables with entries for multiple dates. In this instance I want to keep only data entered on the very first date.

For example in the dataset below:

dfex = pd.DataFrame({'names':['jim','jim','jim','jim','jim','jim','jim','jim','jim',
                           'bob','bob','bob','bob','bob','bob',
                           'sara','sara','sara','sara','sara','sara','sara','sara','sara','sara'],
                  'dates':['01-01-19','01-01-19','01-01-19','01-05-19','01-06-19','01-07-19','01-08-19','01-09-19','01-10-19',
                           '01-05-19','01-05-19','01-07-19','01-08-19','01-09-19','01-10-19',
                           '01-02-19','01-02-19','01-02-19','01-02-19','01-05-19','01-06-19','01-07-19','01-08-19','01-09-19','01-10-19']})

dfex['dates'] = pd.to_datetime(dfex['dates'])
dfex

jim would keep the first 3 rows, bob the first 2 and sara the first 5.

>Solution :

For N = 1 (original question) use groupby.transform and boolean indexing:

out = dfex[dfex.groupby('names')['dates'].transform('first').eq(dfex['dates'])]

If the initial date can be repeated later on in the group and you don"t want to keep those further occurrences, use:

out = dfex[dfex.groupby('names')['dates']
           .transform(lambda s: s.ne(s.shift()).cumsum()).eq(1)]

Output:

   names      dates
0    jim 2019-01-01
1    jim 2019-01-01
2    jim 2019-01-01
9    bob 2019-01-05
10   bob 2019-01-05
15  sara 2019-01-02
16  sara 2019-01-02
17  sara 2019-01-02
18  sara 2019-01-02

first `N` unique dates

Irrespective of their values (just by position)

N = 5
out = dfex[dfex.groupby('names')['dates']
           .transform(lambda s: pd. factorize(s)[0]<N)]

Or for the first N oldest date:

N = 5
out = dfex[dfex.groupby('names')['dates']
           .rank('dense').le(N)]

Output:

   names      dates
0    jim 2019-01-01
1    jim 2019-01-01
2    jim 2019-01-01
3    jim 2019-01-05
4    jim 2019-01-06
5    jim 2019-01-07
6    jim 2019-01-08
9    bob 2019-01-05
10   bob 2019-01-05
11   bob 2019-01-07
12   bob 2019-01-08
13   bob 2019-01-09
14   bob 2019-01-10
15  sara 2019-01-02
16  sara 2019-01-02
17  sara 2019-01-02
18  sara 2019-01-02
19  sara 2019-01-05
20  sara 2019-01-06
21  sara 2019-01-07
22  sara 2019-01-08

for the last `N` unique

N = 5
out = dfex[dfex[::-1].groupby('names')['dates']
           .transform(lambda s: pd. factorize(s)[0]<N)]

Or N newest dates:

N = 5
out = dfex[dfex.groupby('names')['dates']
           .rank('dense', ascending=False).le(N)]

Output:

   names      dates
4    jim 2019-01-06
5    jim 2019-01-07
6    jim 2019-01-08
7    jim 2019-01-09
8    jim 2019-01-10
9    bob 2019-01-05
10   bob 2019-01-05
11   bob 2019-01-07
12   bob 2019-01-08
13   bob 2019-01-09
14   bob 2019-01-10
20  sara 2019-01-06
21  sara 2019-01-07
22  sara 2019-01-08
23  sara 2019-01-09
24  sara 2019-01-10