I’m trying to learn a bit about using polars so I had a simple problem to test it out where I ultimately want to run a group_by operation on the data.
While analysis the data I create a few extra series from the initial data by adding and then cumulating.
I understand that when you want to use a newly created variable with an expression, it needs to be in another chained with_columns but I can’t seem to make it work.
I have the following example code which I believe should be correct, but fails. Here’s the code:
import numpy as np
import polars as pl
data = np.random.random((50,5))
df = pl.from_numpy(data, schema=["id", "sampling_time", "area", "val1", "area_corr"])
(df
.with_columns([
pl.col("id").cast(pl.Int32),
pl.Series(name="total_area", values=df.select(pl.col("area") + pl.col("area_corr"))),
])
.with_columns([
pl.Series(name="cumulative_area", values=df.select(pl.cum_sum("total_area")) / 0.15),
])
.with_columns([
pl.Series(name="parcel_id", values=df.select(pl.col("cumulative_area").cast(pl.Int32))),
])
)
However the snippet fails with the following stacktrace:
Traceback (most recent call last):
File "C:\Users\xxx\anaconda3\envs\py38\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-8-8e61e84b3c85>", line 7, in <module>
pl.Series(name="cumulative_area", values=df.select(pl.cum_sum("total_area")) / 0.15),
File "C:\Users\xxx\anaconda3\envs\py38\lib\site-packages\polars\dataframe\frame.py", line 8142, in select
return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
File "C:\Users\xxx\anaconda3\envs\py38\lib\site-packages\polars\lazyframe\frame.py", line 1940, in collect
return wrap_df(ldf.collect())
polars.exceptions.ColumnNotFoundError: total_area
Error originated just after this operation:
DF ["id", "sampling_time", "area", "val1"]; PROJECT */5 COLUMNS; SELECTION: "None"
I don’t understand why the newly created total_area series is not found.
I’m on polars 0.20.7 with python 3.8.18
>Solution :
Diagnostics.
Your code fails as you explicitly reference df in the second call of pl.DataFrame.with_columns. However, the result of the previous with_columns evaluation is never reassigned to the variable df. Especially, df doesn’t contain the total_area column.
Solution.
Usually, you only construct pl.Series objects within with_columns if the data you’d like to add to the dataframe is stored in an external variable and not computed from the data in the dataframe.
In your example, all data necessary to create the new columns already exists in the data and you can compute their value using the expressions API.
(
df
.with_columns(
pl.col("id").cast(pl.Int32),
total_area=pl.col("area") + pl.col("area_corr")
)
.with_columns(
cumulative_area=pl.cum_sum("total_area") / 0.15
)
.with_columns(
parcel_id=pl.col("cumulative_area").cast(pl.Int32)
)
)