Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How do I group into different dates based on change in another column values in Pandas

I have data that looks like this

df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
 'DATE': ['1/1/2015','1/2/2015', '1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015','1/8/2015',
  '1/9/2016','1/2/2015','1/3/2015','1/4/2015','1/5/2015','1/6/2015','1/7/2015'],
 'CD': ['A','A','A','A','B','B','A','A','C','A','A','A','A','A','A']})

enter image description here

What I would like to do is group by ID and CD and get the start and stop change for each change. I tried using groupby and agg function but it will group all A together even though they needs to be separated since there is B in between 2 A.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df1 = df.groupby(['ID','CD'])
df1 = df1.agg(
    Start_Date = ('Date',np.min),
    End_Date=('Date', np.min)
).reset_index()

What I get is :

enter image description here

I was hoping if some one could help me get the result I need. What I am looking for is :

enter image description here

>Solution :

make grouper for grouping

grouper = df['CD'].ne(df['CD'].shift(1)).cumsum()

grouper:

0     1
1     1
2     1
3     1
4     2
5     2
6     3
7     3
8     4
9     5
10    5
11    5
12    5
13    5
14    5
Name: CD, dtype: int32

then use groupby with grouper

df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)

output:

        min         max
ID  CD      
1   A   1/1/2015    1/4/2015
    B   1/5/2015    1/6/2015
    A   1/7/2015    1/8/2015
    C   1/9/2016    1/9/2016
2   A   1/2/2015    1/7/2015

change column name and use reset_index and so on..for your desired output

(df.groupby(['ID', grouper, 'CD'])['DATE'].agg([min, max]).droplevel(1)
 .set_axis(['Start_Date', 'End_Date'], axis=1)
 .reset_index()
 .assign(CD=lambda x: x.pop('CD')))

result

    ID  Start_Date  End_Date    CD
0   1   1/1/2015    1/4/2015    A
1   1   1/5/2015    1/6/2015    B
2   1   1/7/2015    1/8/2015    A
3   1   1/9/2016    1/9/2016    C
4   2   1/2/2015    1/7/2015    A
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading