Home Pandas groupby two columns and create a new column based on difference in days from third column

Questions

Pandas groupby two columns and create a new column based on difference in days from third column

February 15, 2024

I have a dataframe with creation_timestamps and personal_id’s. Each personal_id has one or more application_id’s. An application_id can have a bunch of rows, but they all have the same creation_timestamp.

I want to create a column that indicates the days between two application_id’s for a given personal_id. All the offer rows within the application_id should show the same days between the two application_id’s.

Here’s the code I’ve tried to modify in a bunch of ways:

# Calculate days between consecutive applications for each personal_id
df['days_between_applications'] = (
    df.groupby(['personal_id'])['creation_timestamp']
    .diff()
    .dt.days
)

Here’s a preview of the current result:

	personal_id	application_id	creation_timestamp	days_between_applications
1007	c1de3552cfb0	e18e5d77199f	2019-05-16 06:53:57.817842	NaN
1008	c1de3552cfb0	e18e5d77199f	2019-05-16 06:53:57.817842	0.0
1010	c1de3552cfb0	e18e5d77199f	2019-05-16 06:53:57.817842	0.0
1006	c1de3552cfb0	e18e5d77199f	2019-05-16 06:53:57.817842	0.0
7094153	c1de3552cfb0	1f64dd61aee2	2023-08-07 11:01:45.588173	1533.0
7094147	c1de3552cfb0	1f64dd61aee2	2023-08-07 11:01:45.588173	0.0

This is what I’m trying to achieve:

	personal_id	application_id	creation_timestamp	days_between_applications
1007	c1de3552cfb0	e18e5d77199f	2019-05-16 06:53:57.817842	NaN
1008	c1de3552cfb0	e18e5d77199f	2019-05-16 06:53:57.817842	NaN
1010	c1de3552cfb0	e18e5d77199f	2019-05-16 06:53:57.817842	NaN
1006	c1de3552cfb0	e18e5d77199f	2019-05-16 06:53:57.817842	NaN
7094153	c1de3552cfb0	1f64dd61aee2	2023-08-07 11:01:45.588173	1533.0
7094147	c1de3552cfb0	1f64dd61aee2	2023-08-07 11:01:45.588173	1533.0

I’ve tried to forward fill, sort and then forward fill, groupby personal_id and application_id etc, but nothing has worked.

>Solution :

You could drop_duplicates then compute the diff between successive applications per personal_id with groupby.diff. Finally, map this to the original DataFrame:

df['creation_timestamp'] = pd.to_datetime(df['creation_timestamp'])

diff = (df
   .drop_duplicates(['personal_id', 'application_id'])
   .sort_values(by='creation_timestamp')
   .set_index('application_id').groupby('personal_id')
   ['creation_timestamp'].diff().dt.days
)

df['days_between_applications'] = df['application_id'].map(diff)

Output:

          personal_id application_id         creation_timestamp  days_between_applications
1007     c1de3552cfb0   e18e5d77199f 2019-05-16 06:53:57.817842                        NaN
1008     c1de3552cfb0   e18e5d77199f 2019-05-16 06:53:57.817842                        NaN
1010     c1de3552cfb0   e18e5d77199f 2019-05-16 06:53:57.817842                        NaN
1006     c1de3552cfb0   e18e5d77199f 2019-05-16 06:53:57.817842                        NaN
7094153  c1de3552cfb0   1f64dd61aee2 2023-08-07 11:01:45.588173                     1544.0
7094147  c1de3552cfb0   1f64dd61aee2 2023-08-07 11:01:45.588173                     1544.0

Intermediate diff:

application_id
e18e5d77199f       NaN
1f64dd61aee2    1544.0
Name: creation_timestamp, dtype: float64

Variant if the application_id are not necessarily unique to one personal_id:

df['creation_timestamp'] = pd.to_datetime(df['creation_timestamp'])

diff = (df
   .drop_duplicates(['personal_id', 'application_id'])
   .sort_values(by='creation_timestamp')
   .set_index(['application_id', 'personal_id']).groupby('personal_id')
   ['creation_timestamp'].diff().dt.days
)

df['days_between_applications'] = diff.reindex(df.set_index(['application_id', 'personal_id']).index).values