Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Check for pauses in pandas df

I have a following data frame:

import pandas as pd

dict_df = {
    "id": [1, 2, 2, 2, 3, 3, 3, 3],
    "time": [
        "2022-08-30 08:00:02",
        "2022-08-30 08:03:07",
        "2022-08-30 08:05:12",
        "2022-08-30 12:06:52",
        "2022-08-30 10:00:02",
        "2022-08-30 10:00:27",
        "2022-08-30 13:00:45",
        "2022-08-30 13:02:27",
    ],
}

df = pd.DataFrame(dict_df)

I would like to get a new df with columns start_time and end_time based on time in the df above. For each employee I would like to check if the pause between time is larger than 10 minutes. If yes I would like to consider it as a break and record it in a new row.

Desired output is:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

new_dict = {
    "id": [1, 2, 2, 3, 3],
    "start_time": [
        "2022-08-30 08:00:02",
        "2022-08-30 08:03:07",
        "2022-08-30 12:06:52",
        "2022-08-30 10:00:02",
        "2022-08-30 13:00:45",
    ],

    "end_time": [
        "2022-08-30 08:00:02",
        "2022-08-30 08:05:12",
        "2022-08-30 12:06:52",
        "2022-08-30 10:00:27",
        "2022-08-30 13:02:27",
    ],
}

new_df = pd.DataFrame(new_dict)

How can I do it, please?

>Solution :

Create groups by custom function in GroupBy.transform with compare differencies with cumulative sum and then aggregate GroupBy.first and
GroupBy.last:

df['time'] = pd.to_datetime(df['time'])

minutes = 10
f = lambda x: x.diff().dt.total_seconds().gt(minutes * 60).cumsum()
df['g'] = df.groupby('id')['time'].transform(f)

df = (df.groupby(['id','g'])
        .agg(start_time=('time','first'),end_time=('time','last'))
        .droplevel(1)
        .reset_index())
print (df)
   id          start_time            end_time
0   1 2022-08-30 08:00:02 2022-08-30 08:00:02
1   2 2022-08-30 08:03:07 2022-08-30 08:05:12
2   2 2022-08-30 12:06:52 2022-08-30 12:06:52
3   3 2022-08-30 10:00:02 2022-08-30 10:00:27
4   3 2022-08-30 13:00:45 2022-08-30 13:02:27
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading