I have a problem. I want to get the difference of the last date. For example 2021-03-22 until the next date (2021-03-18) it is 4 days. I want to calculate the difference in days between the row date and the last date for the customerId. So the complete calculation should be for each customer. And the last date should be None because I dont have any older date. The problem is that if the same date occurs more than once, the second date becomes 0. It should look again at when the previous date was and not take the current date.
Dataframe
customerId fromDate otherInformation
0 1 2021-02-22 Cat
1 1 2021-02-22 Dog
2 1 2021-03-18 Elefant
3 1 2021-03-18 Cat
4 1 2021-03-18 Cat
5 1 2021-03-22 Cat
6 1 2021-02-10 Cat
7 1 2021-09-07 Cat
8 1 None Elefant
9 1 2022-01-18 Fish
10 2 2021-05-17 Fish
Code
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22','2021-02-22', '2021-03-18','2021-03-18', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17'],
'otherInformation': ['Cat', 'Dog', 'Elefant', 'Cat', 'Cat','Cat', 'Cat', 'Cat', 'Elefant', 'Fish', 'Fish']
}
df = pd.DataFrame(data=d)
print(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').groupby('customerId')['fromDate'].shift()
print(df)
What I have
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-02-22 Dog 0 days
2 1 2021-03-18 Elefant 24 days
3 1 2021-03-18 Cat 0 days
4 1 2021-03-18 Cat 0 days
5 1 2021-03-22 Cat 4 days
6 1 2021-02-10 Cat NaT
7 1 2021-09-07 Cat 169 days
8 1 NaT Elefant NaT
9 1 2022-01-18 Fish 133 days
10 2 2021-05-17 Fish NaT
What I want
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-02-22 Dog 12 days # from 0 -> 12
2 1 2021-03-18 Elefant 24 days
3 1 2021-03-18 Cat 24 days # from 0 -> 24
4 1 2021-03-18 Cat 24 days # from 0 -> 24
5 1 2021-03-22 Cat 4 days
6 1 2021-02-10 Cat NaT
7 1 2021-09-07 Cat 169 days
8 1 NaT Elefant NaT
9 1 2022-01-18 Fish 133 days
10 2 2021-05-17 Fish NaT
>Solution :
Add DataFrame.drop_duplicates with GroupBy.ffill:
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['lastindays'] = df['fromDate'] - df.sort_values('fromDate').drop_duplicates(['customerId','fromDate']).groupby('customerId')['fromDate'].shift()
df['lastindays'] = df.groupby(['customerId','fromDate'])['lastindays'].ffill()
print(df)
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-02-22 Dog 12 days
2 1 2021-03-18 Elefant 24 days
3 1 2021-03-18 Cat 24 days
4 1 2021-03-18 Cat 24 days
5 1 2021-03-22 Cat 4 days
6 1 2021-02-10 Cat NaT
7 1 2021-09-07 Cat 169 days
8 1 NaT Elefant NaT
9 1 2022-01-18 Fish 133 days
10 2 2021-05-17 Fish NaT
Another idea is create helper DataFrame without duplicates, use DataFrameGroupBy.diff and for new column left join:
df1 = df.sort_values('fromDate').drop_duplicates(['customerId','fromDate'])
df1['lastindays'] = df1.groupby('customerId')['fromDate'].diff()
df = df.merge(df1[['lastindays','customerId','fromDate']], on=['customerId','fromDate'], how='left')
print(df)
customerId fromDate otherInformation lastindays
0 1 2021-02-22 Cat 12 days
1 1 2021-02-22 Dog 12 days
2 1 2021-03-18 Elefant 24 days
3 1 2021-03-18 Cat 24 days
4 1 2021-03-18 Cat 24 days
5 1 2021-03-22 Cat 4 days
6 1 2021-02-10 Cat NaT
7 1 2021-09-07 Cat 169 days
8 1 NaT Elefant NaT
9 1 2022-01-18 Fish 133 days
10 2 2021-05-17 Fish NaT