Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Chained expressions in polars not working

I’m trying to learn a bit about using polars so I had a simple problem to test it out where I ultimately want to run a group_by operation on the data.

While analysis the data I create a few extra series from the initial data by adding and then cumulating.

I understand that when you want to use a newly created variable with an expression, it needs to be in another chained with_columns but I can’t seem to make it work.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have the following example code which I believe should be correct, but fails. Here’s the code:

import numpy as np
import polars as pl

data = np.random.random((50,5))
df = pl.from_numpy(data, schema=["id", "sampling_time", "area", "val1", "area_corr"])

(df
.with_columns([
    pl.col("id").cast(pl.Int32),
    pl.Series(name="total_area", values=df.select(pl.col("area") + pl.col("area_corr"))),
])
.with_columns([
    pl.Series(name="cumulative_area", values=df.select(pl.cum_sum("total_area")) / 0.15),
])
.with_columns([
    pl.Series(name="parcel_id", values=df.select(pl.col("cumulative_area").cast(pl.Int32))),
])
)

However the snippet fails with the following stacktrace:

Traceback (most recent call last):
  File "C:\Users\xxx\anaconda3\envs\py38\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-8e61e84b3c85>", line 7, in <module>
    pl.Series(name="cumulative_area", values=df.select(pl.cum_sum("total_area")) / 0.15),
  File "C:\Users\xxx\anaconda3\envs\py38\lib\site-packages\polars\dataframe\frame.py", line 8142, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
  File "C:\Users\xxx\anaconda3\envs\py38\lib\site-packages\polars\lazyframe\frame.py", line 1940, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ColumnNotFoundError: total_area
Error originated just after this operation:
DF ["id", "sampling_time", "area", "val1"]; PROJECT */5 COLUMNS; SELECTION: "None"

I don’t understand why the newly created total_area series is not found.

I’m on polars 0.20.7 with python 3.8.18

>Solution :

Diagnostics.

Your code fails as you explicitly reference df in the second call of pl.DataFrame.with_columns. However, the result of the previous with_columns evaluation is never reassigned to the variable df. Especially, df doesn’t contain the total_area column.

Solution.

Usually, you only construct pl.Series objects within with_columns if the data you’d like to add to the dataframe is stored in an external variable and not computed from the data in the dataframe.

In your example, all data necessary to create the new columns already exists in the data and you can compute their value using the expressions API.

(
    df
    .with_columns(
        pl.col("id").cast(pl.Int32),
        total_area=pl.col("area") + pl.col("area_corr")
    )
    .with_columns(
        cumulative_area=pl.cum_sum("total_area") / 0.15
    )
    .with_columns(
        parcel_id=pl.col("cumulative_area").cast(pl.Int32)
    )
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading