Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I convert a float value to datetime with higher precision in polars?

The restrictions I have:

  • I have to concat lots of CSV files horizontally, where the proper types can not be infered, therefore I have to scan the CSV files with string columns data_lf = [pl.scan_csv(file_path, infer_schema=False) for file_path in csv_file_paths]
  • a column of unix timestamps with float values (will be handled as string column after reading all the files)

Wrong value I get:

df = pl.DataFrame({
        "unix_timestamp": ["1634083172.2470002"]
})

df = df.with_columns(
    (pl.col("unix_timestamp").cast(pl.Float64) * 1_000_000_000).cast(pl.Datetime("ns")).alias("datetime")
)

print(df)
┌────────────────────┬───────────────────────────────┐
│ unix_timestamp     ┆ datetime                      │
│ ---                ┆ ---                           │
│ str                ┆ datetime[ns]                  │
╞════════════════════╪═══════════════════════════════╡
│ 1634083172.2470002 ┆ 2021-10-12 23:59:32.247000320 │
└────────────────────┴───────────────────────────────┘

What I want:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • to keep a nanosecond precision after the conversion from unix timestamps to datetimes happen, instead of 2021-10-12 23:59:32.247000320 I want to get 2021-10-12 23:59:32.247000200

I understand how float number representation causes the difference, but I still need higher precision.

I was experimenting with Decimal library, which did solve my problem in pandas library, but I couldn’t figure out how to make it work with polars.

Does anyone have any idea?

>Solution :

Instead of performing a numerical multiplication by 1e9, you could use a regex to shift the decimals:

df = pl.DataFrame({
    'unix_timestamp': ['1634083172.2470002', '1634083172.247']
})

df.with_columns(
    pl.col('unix_timestamp').add('0'*9).str.replace(r'\.(\d{9}).*', '${1}')
      .cast(pl.Int64).cast(pl.Datetime('ns')).alias('datetime')
)

NB. I’m assuming all string have a decimal separator. If not you would need an extra step to perform a correct padding with zeros.

Output:

┌────────────────────┬───────────────────────────────┐
│ unix_timestamp     ┆ datetime                      │
│ ---                ┆ ---                           │
│ str                ┆ datetime[ns]                  │
╞════════════════════╪═══════════════════════════════╡
│ 1634083172.2470002 ┆ 2021-10-12 23:59:32.247000200 │
│ 1634083172.247     ┆ 2021-10-12 23:59:32.247       │
└────────────────────┴───────────────────────────────┘

For completeness, it is also possible to use the unstable Decimal datatype as suggested in comment by @Dogbert, but you need to activate it:

pl.Config.activate_decimals() 

df.with_columns(
    pl.col('unix_timestamp').cast(pl.Decimal(None, 9)) * 1_000_000_000)
      .cast(pl.Int64).cast(pl.Datetime('ns')
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading