Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to sum durations in Polar dataframe?

I have the following dataframe:

import datetime

import polars as pl


df = pl.DataFrame(
    {
        "idx": [259, 123],
        "timestamp": [
            [
                datetime.datetime(2023, 4, 20, 1, 45),
                datetime.datetime(2023, 4, 20, 1, 51, 7),
                datetime.datetime(2023, 4, 20, 2, 29, 50),
            ],
            [
                datetime.datetime(2023, 4, 19, 6, 0, 1),
                datetime.datetime(2023, 4, 19, 6, 0, 17),
                datetime.datetime(2023, 4, 19, 6, 0, 26),
                datetime.datetime(2023, 4, 19, 19, 53, 29),
                datetime.datetime(2023, 4, 19, 19, 54, 4),
                datetime.datetime(2023, 4, 19, 19, 57, 52),
            ],
        ],
    }
)
print(df)
# Output
shape: (2, 2)
┌─────┬───────────────────────────────────────────────────────────────────┐
│ idx ┆ timestamp                                                         │
│ --- ┆ ---                                                               │
│ i64 ┆ list[datetime[μs]]                                                │
╞═════╪═══════════════════════════════════════════════════════════════════╡
│ 259 ┆ [2023-04-20 01:45:00, 2023-04-20 01:51:07, 2023-04-20 02:29:50]   │
│ 123 ┆ [2023-04-19 06:00:01, 2023-04-19 06:00:17, … 2023-04-19 19:57:52] │
└─────┴───────────────────────────────────────────────────────────────────┘

I want to know the total duration of each id, so I do:

df = df.with_columns(
    pl.col("timestamp")
    .apply(lambda x: [x[i + 1] - x[i] for i in range(len(x)) if i + 1 < len(x)])
    .alias("duration")
)

Which gives me:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

shape: (2, 2)
┌─────┬─────────────────────┐
│ idx ┆ duration            │
│ --- ┆ ---                 │
│ i64 ┆ list[duration[μs]]  │
╞═════╪═════════════════════╡
│ 259 ┆ [6m 7s, 38m 43s]    │
│ 123 ┆ [16s, 9s, … 3m 48s] │
└─────┴─────────────────────┘

Now, in Pandas, I would have used total_seconds when calling apply and sum the list, like this:

df["duration"] = (
    df["timestamp"]
    .apply(
        lambda x: sum(
            [(x[i + 1] - x[i]).total_seconds() for i in range(len(x)) if i + 1 < len(x)]
        )
    )
    .astype(int)
)

Which would give me the expected result:

print(df[["idx", "duration"]])
# Output

   idx  duration
0  259      2690
1  123     50271

What would be the equivalent, idiomatic way, to do this in Polars?

>Solution :

There is an arr.diff method for list types, which then can be summed, and the total seconds can be calculated with dt.seconds:

df.select(
    "idx",
    duration=pl.col("timestamp")
        .arr.diff(null_behavior="drop")
        .arr.sum()
        .dt.seconds(),
)
┌─────┬──────────┐
│ idx ┆ duration │
│ --- ┆ ---      │
│ i64 ┆ i64      │
╞═════╪══════════╡
│ 259 ┆ 2690     │
│ 123 ┆ 50271    │
└─────┴──────────┘

An equivalent expression if you really don’t need the intermediate durations at all, that should perform better, would be subtracting the first element of the list from the last:

duration=(
  pl.col("timestamp").arr.last() - pl.col("timestamp").arr.first()
).dt.seconds()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading