I have a dataframe of id number and dates:
import pandas as pd
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
df
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 1 04/05/2003
3 2 01/05/2010
4 2 08/08/2009
5 3 12/11/2008
I am looking for a way to leave for each id the first TWO dates (i.e. the two earliest dates).
for the example above the output would be:
id start_date
0 1 01/01/2000
1 1 01/07/2002
2 2 08/08/2009
3 2 01/05/2010
4 3 12/11/2008
Thanks!
>Solution :
full code:
df = pd.DataFrame([['1','01/01/2000'], ['1','01/07/2002'],['1', '04/05/2003'],
['2','01/05/2010'], ['2','08/08/2009'],
['3','12/11/2008']], columns=['id','start_date'])
# 1. convert 'start_time' column to datetime
df['start_date'] = pd.to_datetime(df['start_date'])
# 2. sort the dataframe ascending by 'start_time'
df.sort_values(by='start_date', ascending=True, inplace=True)
# 3. select only the first two occurances of each id
df.groupby('id').head(2)
output:
id start_date
0 1 2000-01-01
1 1 2002-01-07
5 3 2008-12-11
4 2 2009-08-08
3 2 2010-01-05