I have a dataframe with creation_timestamps and personal_id’s. Each personal_id has one or more application_id’s. An application_id can have a bunch of rows, but they all have the same creation_timestamp.
I want to create a column that indicates the days between two application_id’s for a given personal_id. All the offer rows within the application_id should show the same days between the two application_id’s.
Here’s the code I’ve tried to modify in a bunch of ways:
# Calculate days between consecutive applications for each personal_id
df['days_between_applications'] = (
df.groupby(['personal_id'])['creation_timestamp']
.diff()
.dt.days
)
Here’s a preview of the current result:
| personal_id | application_id | creation_timestamp | days_between_applications | |
|---|---|---|---|---|
| 1007 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
| 1008 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | 0.0 |
| 1010 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | 0.0 |
| 1006 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | 0.0 |
| 7094153 | c1de3552cfb0 | 1f64dd61aee2 | 2023-08-07 11:01:45.588173 | 1533.0 |
| 7094147 | c1de3552cfb0 | 1f64dd61aee2 | 2023-08-07 11:01:45.588173 | 0.0 |
This is what I’m trying to achieve:
| personal_id | application_id | creation_timestamp | days_between_applications | |
|---|---|---|---|---|
| 1007 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
| 1008 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
| 1010 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
| 1006 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
| 7094153 | c1de3552cfb0 | 1f64dd61aee2 | 2023-08-07 11:01:45.588173 | 1533.0 |
| 7094147 | c1de3552cfb0 | 1f64dd61aee2 | 2023-08-07 11:01:45.588173 | 1533.0 |
I’ve tried to forward fill, sort and then forward fill, groupby personal_id and application_id etc, but nothing has worked.
>Solution :
You could drop_duplicates then compute the diff between successive applications per personal_id with groupby.diff. Finally, map this to the original DataFrame:
df['creation_timestamp'] = pd.to_datetime(df['creation_timestamp'])
diff = (df
.drop_duplicates(['personal_id', 'application_id'])
.sort_values(by='creation_timestamp')
.set_index('application_id').groupby('personal_id')
['creation_timestamp'].diff().dt.days
)
df['days_between_applications'] = df['application_id'].map(diff)
Output:
personal_id application_id creation_timestamp days_between_applications
1007 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 NaN
1008 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 NaN
1010 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 NaN
1006 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 NaN
7094153 c1de3552cfb0 1f64dd61aee2 2023-08-07 11:01:45.588173 1544.0
7094147 c1de3552cfb0 1f64dd61aee2 2023-08-07 11:01:45.588173 1544.0
Intermediate diff:
application_id
e18e5d77199f NaN
1f64dd61aee2 1544.0
Name: creation_timestamp, dtype: float64
Variant if the application_id are not necessarily unique to one personal_id:
df['creation_timestamp'] = pd.to_datetime(df['creation_timestamp'])
diff = (df
.drop_duplicates(['personal_id', 'application_id'])
.sort_values(by='creation_timestamp')
.set_index(['application_id', 'personal_id']).groupby('personal_id')
['creation_timestamp'].diff().dt.days
)
df['days_between_applications'] = diff.reindex(df.set_index(['application_id', 'personal_id']).index).values