Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Filter pandas dataframe for first N unique values per group

I need to filter a large dataset of variables with entries for multiple dates. In this instance I want to keep only data entered on the very first date.

For example in the dataset below:

dfex = pd.DataFrame({'names':['jim','jim','jim','jim','jim','jim','jim','jim','jim',
                           'bob','bob','bob','bob','bob','bob',
                           'sara','sara','sara','sara','sara','sara','sara','sara','sara','sara'],
                  'dates':['01-01-19','01-01-19','01-01-19','01-05-19','01-06-19','01-07-19','01-08-19','01-09-19','01-10-19',
                           '01-05-19','01-05-19','01-07-19','01-08-19','01-09-19','01-10-19',
                           '01-02-19','01-02-19','01-02-19','01-02-19','01-05-19','01-06-19','01-07-19','01-08-19','01-09-19','01-10-19']})

dfex['dates'] = pd.to_datetime(dfex['dates'])
dfex

jim would keep the first 3 rows, bob the first 2 and sara the first 5.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

For N = 1 (original question) use groupby.transform and boolean indexing:

out = dfex[dfex.groupby('names')['dates'].transform('first').eq(dfex['dates'])]

If the initial date can be repeated later on in the group and you don"t want to keep those further occurrences, use:

out = dfex[dfex.groupby('names')['dates']
           .transform(lambda s: s.ne(s.shift()).cumsum()).eq(1)]

Output:

   names      dates
0    jim 2019-01-01
1    jim 2019-01-01
2    jim 2019-01-01
9    bob 2019-01-05
10   bob 2019-01-05
15  sara 2019-01-02
16  sara 2019-01-02
17  sara 2019-01-02
18  sara 2019-01-02

first N unique dates

Irrespective of their values (just by position)

N = 5
out = dfex[dfex.groupby('names')['dates']
           .transform(lambda s: pd. factorize(s)[0]<N)]

Or for the first N oldest date:

N = 5
out = dfex[dfex.groupby('names')['dates']
           .rank('dense').le(N)]

Output:

   names      dates
0    jim 2019-01-01
1    jim 2019-01-01
2    jim 2019-01-01
3    jim 2019-01-05
4    jim 2019-01-06
5    jim 2019-01-07
6    jim 2019-01-08
9    bob 2019-01-05
10   bob 2019-01-05
11   bob 2019-01-07
12   bob 2019-01-08
13   bob 2019-01-09
14   bob 2019-01-10
15  sara 2019-01-02
16  sara 2019-01-02
17  sara 2019-01-02
18  sara 2019-01-02
19  sara 2019-01-05
20  sara 2019-01-06
21  sara 2019-01-07
22  sara 2019-01-08

for the last N unique

N = 5
out = dfex[dfex[::-1].groupby('names')['dates']
           .transform(lambda s: pd. factorize(s)[0]<N)]

Or N newest dates:

N = 5
out = dfex[dfex.groupby('names')['dates']
           .rank('dense', ascending=False).le(N)]

Output:

   names      dates
4    jim 2019-01-06
5    jim 2019-01-07
6    jim 2019-01-08
7    jim 2019-01-09
8    jim 2019-01-10
9    bob 2019-01-05
10   bob 2019-01-05
11   bob 2019-01-07
12   bob 2019-01-08
13   bob 2019-01-09
14   bob 2019-01-10
20  sara 2019-01-06
21  sara 2019-01-07
22  sara 2019-01-08
23  sara 2019-01-09
24  sara 2019-01-10
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading