Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Add new column with multiple literal values to polars dataframe

Consider the following toy example:

import polars as pl

pl.Config(tbl_rows=-1)

df = pl.DataFrame({"group": ["A", "A", "A", "B", "B"], "value": [1, 2, 3, 4, 5]})

print(df)

shape: (5, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ A     ┆ 1     │
│ A     ┆ 2     │
│ A     ┆ 3     │
│ B     ┆ 4     │
│ B     ┆ 5     │
└───────┴───────┘

Further, I have a list of indicator values, such as vals=[10, 20, 30].

I am looking for an efficient way to insert each of these values in a new column called ìndicator using pl.lit() while expanding the dataframe vertically in a way all existing rows will be repeated for every new element in vals.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

My current solution is to insert a new column to df, append it to a list and subsequently do a pl.concat.

lit_vals = [10, 20, 30]

print(pl.concat([df.with_columns(indicator=pl.lit(val)) for val in lit_vals]))

shape: (15, 3)
┌───────┬───────┬───────────┐
│ group ┆ value ┆ indicator │
│ ---   ┆ ---   ┆ ---       │
│ str   ┆ i64   ┆ i32       │
╞═══════╪═══════╪═══════════╡
│ A     ┆ 1     ┆ 10        │
│ A     ┆ 2     ┆ 10        │
│ A     ┆ 3     ┆ 10        │
│ B     ┆ 4     ┆ 10        │
│ B     ┆ 5     ┆ 10        │
│ A     ┆ 1     ┆ 20        │
│ A     ┆ 2     ┆ 20        │
│ A     ┆ 3     ┆ 20        │
│ B     ┆ 4     ┆ 20        │
│ B     ┆ 5     ┆ 20        │
│ A     ┆ 1     ┆ 30        │
│ A     ┆ 2     ┆ 30        │
│ A     ┆ 3     ┆ 30        │
│ B     ┆ 4     ┆ 30        │
│ B     ┆ 5     ┆ 30        │
└───────┴───────┴───────────┘

As df could potentially have quite a lot of rows and columns, I am wondering if my solution is efficient in terms of speed as well as memory allocation?

Just for my understanding, if I append a new pl.DataFrame to the list, will this dataframe use additional memory or will just some new pointers be created that link to the chunks in memory which hold the data of the original df?

>Solution :

You could assign it as a column and .explode()

df.with_columns(indicator=vals).explode("indicator")
shape: (15, 3)
┌───────┬───────┬───────────┐
│ group ┆ value ┆ indicator │
│ ---   ┆ ---   ┆ ---       │
│ str   ┆ i64   ┆ i64       │
╞═══════╪═══════╪═══════════╡
│ A     ┆ 1     ┆ 10        │
│ A     ┆ 1     ┆ 20        │
│ A     ┆ 1     ┆ 30        │
│ A     ┆ 2     ┆ 10        │
│ A     ┆ 2     ┆ 20        │
│ …     ┆ …     ┆ …         │
│ B     ┆ 4     ┆ 20        │
│ B     ┆ 4     ┆ 30        │
│ B     ┆ 5     ┆ 10        │
│ B     ┆ 5     ┆ 20        │
│ B     ┆ 5     ┆ 30        │
└───────┴───────┴───────────┘
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading