The restrictions I have:
- I have to concat lots of CSV files horizontally, where the proper types can not be infered, therefore I have to scan the CSV files with string columns
data_lf = [pl.scan_csv(file_path, infer_schema=False) for file_path in csv_file_paths] - a column of unix timestamps with float values (will be handled as string column after reading all the files)
Wrong value I get:
df = pl.DataFrame({
"unix_timestamp": ["1634083172.2470002"]
})
df = df.with_columns(
(pl.col("unix_timestamp").cast(pl.Float64) * 1_000_000_000).cast(pl.Datetime("ns")).alias("datetime")
)
print(df)
┌────────────────────┬───────────────────────────────┐
│ unix_timestamp ┆ datetime │
│ --- ┆ --- │
│ str ┆ datetime[ns] │
╞════════════════════╪═══════════════════════════════╡
│ 1634083172.2470002 ┆ 2021-10-12 23:59:32.247000320 │
└────────────────────┴───────────────────────────────┘
What I want:
- to keep a nanosecond precision after the conversion from unix timestamps to datetimes happen, instead of
2021-10-12 23:59:32.247000320I want to get2021-10-12 23:59:32.247000200
I understand how float number representation causes the difference, but I still need higher precision.
I was experimenting with Decimal library, which did solve my problem in pandas library, but I couldn’t figure out how to make it work with polars.
Does anyone have any idea?
>Solution :
Instead of performing a numerical multiplication by 1e9, you could use a regex to shift the decimals:
df = pl.DataFrame({
'unix_timestamp': ['1634083172.2470002', '1634083172.247']
})
df.with_columns(
pl.col('unix_timestamp').add('0'*9).str.replace(r'\.(\d{9}).*', '${1}')
.cast(pl.Int64).cast(pl.Datetime('ns')).alias('datetime')
)
NB. I’m assuming all string have a decimal separator. If not you would need an extra step to perform a correct padding with zeros.
Output:
┌────────────────────┬───────────────────────────────┐
│ unix_timestamp ┆ datetime │
│ --- ┆ --- │
│ str ┆ datetime[ns] │
╞════════════════════╪═══════════════════════════════╡
│ 1634083172.2470002 ┆ 2021-10-12 23:59:32.247000200 │
│ 1634083172.247 ┆ 2021-10-12 23:59:32.247 │
└────────────────────┴───────────────────────────────┘
For completeness, it is also possible to use the unstable Decimal datatype as suggested in comment by @Dogbert, but you need to activate it:
pl.Config.activate_decimals()
df.with_columns(
pl.col('unix_timestamp').cast(pl.Decimal(None, 9)) * 1_000_000_000)
.cast(pl.Int64).cast(pl.Datetime('ns')
)