Home How can I convert a float value to datetime with higher precision in polars?

Questions

How can I convert a float value to datetime with higher precision in polars?

July 26, 2024

The restrictions I have:

I have to concat lots of CSV files horizontally, where the proper types can not be infered, therefore I have to scan the CSV files with string columns data_lf = [pl.scan_csv(file_path, infer_schema=False) for file_path in csv_file_paths]
a column of unix timestamps with float values (will be handled as string column after reading all the files)

Wrong value I get:

df = pl.DataFrame({
        "unix_timestamp": ["1634083172.2470002"]
})

df = df.with_columns(
    (pl.col("unix_timestamp").cast(pl.Float64) * 1_000_000_000).cast(pl.Datetime("ns")).alias("datetime")
)

print(df)

┌────────────────────┬───────────────────────────────┐
│ unix_timestamp     ┆ datetime                      │
│ ---                ┆ ---                           │
│ str                ┆ datetime[ns]                  │
╞════════════════════╪═══════════════════════════════╡
│ 1634083172.2470002 ┆ 2021-10-12 23:59:32.247000320 │
└────────────────────┴───────────────────────────────┘

What I want:

to keep a nanosecond precision after the conversion from unix timestamps to datetimes happen, instead of 2021-10-12 23:59:32.247000320 I want to get 2021-10-12 23:59:32.247000200

I understand how float number representation causes the difference, but I still need higher precision.

I was experimenting with Decimal library, which did solve my problem in pandas library, but I couldn’t figure out how to make it work with polars.

Does anyone have any idea?

>Solution :

Instead of performing a numerical multiplication by 1e9, you could use a regex to shift the decimals:

df = pl.DataFrame({
    'unix_timestamp': ['1634083172.2470002', '1634083172.247']
})

df.with_columns(
    pl.col('unix_timestamp').add('0'*9).str.replace(r'\.(\d{9}).*', '${1}')
      .cast(pl.Int64).cast(pl.Datetime('ns')).alias('datetime')
)

NB. I’m assuming all string have a decimal separator. If not you would need an extra step to perform a correct padding with zeros.

Output:

┌────────────────────┬───────────────────────────────┐
│ unix_timestamp     ┆ datetime                      │
│ ---                ┆ ---                           │
│ str                ┆ datetime[ns]                  │
╞════════════════════╪═══════════════════════════════╡
│ 1634083172.2470002 ┆ 2021-10-12 23:59:32.247000200 │
│ 1634083172.247     ┆ 2021-10-12 23:59:32.247       │
└────────────────────┴───────────────────────────────┘

For completeness, it is also possible to use the unstable Decimal datatype as suggested in comment by @Dogbert, but you need to activate it:

pl.Config.activate_decimals() 

df.with_columns(
    pl.col('unix_timestamp').cast(pl.Decimal(None, 9)) * 1_000_000_000)
      .cast(pl.Int64).cast(pl.Datetime('ns')
)