Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Calculate difference to the last date

I have a problem. I want to get the difference of the last date. For example 2021-03-22 until the next date (2021-03-18) it is 4 days. I want to calculate the difference in days between the row date and the last date for the customerId. So the complete calculation should be for each customer. And the last date should be None because I dont have any older date. The problem is that if the same date occurs more than once, the second date becomes 0. It should look again at when the previous date was and not take the current date.

Dataframe

    customerId    fromDate otherInformation
0            1  2021-02-22              Cat
1            1  2021-02-22              Dog
2            1  2021-03-18          Elefant
3            1  2021-03-18              Cat
4            1  2021-03-18              Cat
5            1  2021-03-22              Cat
6            1  2021-02-10              Cat
7            1  2021-09-07              Cat
8            1        None          Elefant
9            1  2022-01-18             Fish
10           2  2021-05-17             Fish

Code

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import pandas as pd


d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
     'fromDate': ['2021-02-22','2021-02-22', '2021-03-18','2021-03-18', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
     'otherInformation': ['Cat', 'Dog', 'Elefant', 'Cat', 'Cat','Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
    }
df = pd.DataFrame(data=d)
print(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()
print(df)

What I have

    customerId   fromDate otherInformation lastindays
0            1 2021-02-22              Cat    12 days
1            1 2021-02-22              Dog     0 days
2            1 2021-03-18          Elefant    24 days
3            1 2021-03-18              Cat     0 days
4            1 2021-03-18              Cat     0 days
5            1 2021-03-22              Cat     4 days
6            1 2021-02-10              Cat        NaT
7            1 2021-09-07              Cat   169 days
8            1        NaT          Elefant        NaT
9            1 2022-01-18             Fish   133 days
10           2 2021-05-17             Fish        NaT

What I want

    customerId   fromDate otherInformation lastindays
0            1 2021-02-22              Cat    12 days
1            1 2021-02-22              Dog    12 days # from 0 -> 12
2            1 2021-03-18          Elefant    24 days
3            1 2021-03-18              Cat    24 days # from 0 -> 24
4            1 2021-03-18              Cat    24 days # from 0 -> 24
5            1 2021-03-22              Cat     4 days
6            1 2021-02-10              Cat        NaT
7            1 2021-09-07              Cat   169 days
8            1        NaT          Elefant        NaT
9            1 2022-01-18             Fish   133 days
10           2 2021-05-17             Fish        NaT

>Solution :

Add DataFrame.drop_duplicates with GroupBy.ffill:

df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').drop_duplicates(['customerId','fromDate']).groupby('customerId')['fromDate'].shift()
df['lastindays'] = df.groupby(['customerId','fromDate'])['lastindays'].ffill()
print(df)
    customerId   fromDate otherInformation lastindays
0            1 2021-02-22              Cat    12 days
1            1 2021-02-22              Dog    12 days
2            1 2021-03-18          Elefant    24 days
3            1 2021-03-18              Cat    24 days
4            1 2021-03-18              Cat    24 days
5            1 2021-03-22              Cat     4 days
6            1 2021-02-10              Cat        NaT
7            1 2021-09-07              Cat   169 days
8            1        NaT          Elefant        NaT
9            1 2022-01-18             Fish   133 days
10           2 2021-05-17             Fish        NaT

Another idea is create helper DataFrame without duplicates, use DataFrameGroupBy.diff and for new column left join:

df1 = df.sort_values('fromDate').drop_duplicates(['customerId','fromDate'])
df1['lastindays'] = df1.groupby('customerId')['fromDate'].diff()
df = df.merge(df1[['lastindays','customerId','fromDate']], on=['customerId','fromDate'], how='left')
print(df)
    customerId   fromDate otherInformation lastindays
0            1 2021-02-22              Cat    12 days
1            1 2021-02-22              Dog    12 days
2            1 2021-03-18          Elefant    24 days
3            1 2021-03-18              Cat    24 days
4            1 2021-03-18              Cat    24 days
5            1 2021-03-22              Cat     4 days
6            1 2021-02-10              Cat        NaT
7            1 2021-09-07              Cat   169 days
8            1        NaT          Elefant        NaT
9            1 2022-01-18             Fish   133 days
10           2 2021-05-17             Fish        NaT
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading