Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas groupby two columns and create a new column based on difference in days from third column

I have a dataframe with creation_timestamps and personal_id’s. Each personal_id has one or more application_id’s. An application_id can have a bunch of rows, but they all have the same creation_timestamp.

I want to create a column that indicates the days between two application_id’s for a given personal_id. All the offer rows within the application_id should show the same days between the two application_id’s.

Here’s the code I’ve tried to modify in a bunch of ways:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# Calculate days between consecutive applications for each personal_id
df['days_between_applications'] = (
    df.groupby(['personal_id'])['creation_timestamp']
    .diff()
    .dt.days
)

Here’s a preview of the current result:

personal_id application_id creation_timestamp days_between_applications
1007 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 NaN
1008 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 0.0
1010 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 0.0
1006 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 0.0
7094153 c1de3552cfb0  1f64dd61aee2 2023-08-07 11:01:45.588173 1533.0
7094147 c1de3552cfb0  1f64dd61aee2  2023-08-07 11:01:45.588173  0.0

This is what I’m trying to achieve:

personal_id application_id creation_timestamp days_between_applications
1007 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 NaN
1008 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 NaN
1010 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 NaN
1006 c1de3552cfb0 e18e5d77199f 2019-05-16 06:53:57.817842 NaN
7094153 c1de3552cfb0  1f64dd61aee2 2023-08-07 11:01:45.588173 1533.0
7094147 c1de3552cfb0  1f64dd61aee2  2023-08-07 11:01:45.588173  1533.0

I’ve tried to forward fill, sort and then forward fill, groupby personal_id and application_id etc, but nothing has worked.

>Solution :

You could drop_duplicates then compute the diff between successive applications per personal_id with groupby.diff. Finally, map this to the original DataFrame:

df['creation_timestamp'] = pd.to_datetime(df['creation_timestamp'])

diff = (df
   .drop_duplicates(['personal_id', 'application_id'])
   .sort_values(by='creation_timestamp')
   .set_index('application_id').groupby('personal_id')
   ['creation_timestamp'].diff().dt.days
)

df['days_between_applications'] = df['application_id'].map(diff)

Output:

          personal_id application_id         creation_timestamp  days_between_applications
1007     c1de3552cfb0   e18e5d77199f 2019-05-16 06:53:57.817842                        NaN
1008     c1de3552cfb0   e18e5d77199f 2019-05-16 06:53:57.817842                        NaN
1010     c1de3552cfb0   e18e5d77199f 2019-05-16 06:53:57.817842                        NaN
1006     c1de3552cfb0   e18e5d77199f 2019-05-16 06:53:57.817842                        NaN
7094153  c1de3552cfb0   1f64dd61aee2 2023-08-07 11:01:45.588173                     1544.0
7094147  c1de3552cfb0   1f64dd61aee2 2023-08-07 11:01:45.588173                     1544.0

Intermediate diff:

application_id
e18e5d77199f       NaN
1f64dd61aee2    1544.0
Name: creation_timestamp, dtype: float64

Variant if the application_id are not necessarily unique to one personal_id:

df['creation_timestamp'] = pd.to_datetime(df['creation_timestamp'])

diff = (df
   .drop_duplicates(['personal_id', 'application_id'])
   .sort_values(by='creation_timestamp')
   .set_index(['application_id', 'personal_id']).groupby('personal_id')
   ['creation_timestamp'].diff().dt.days
)

df['days_between_applications'] = diff.reindex(df.set_index(['application_id', 'personal_id']).index).values
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading