Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Implement frequency encoding in polars

I want to replace the categories with their occurrence frequency. My dataframe is lazy and currently I cannot do it without 2 passes over the entire data and then one pass over a column to get the length of the dataframe. Here is how I am doing it:

Input:

df = pl.DataFrame({"a": [1, 8, 3], "b": [4, 5, None], "c": ["foo", "bar", "bar"]}).lazy()
print(df.collect())
output:
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b    ┆ c   β”‚
β”‚ --- ┆ ---  ┆ --- β”‚
β”‚ i64 ┆ i64  ┆ str β”‚
β•žβ•β•β•β•β•β•ͺ══════β•ͺ═════║
β”‚ 1   ┆ 4    ┆ foo β”‚
β”‚ 8   ┆ 5    ┆ bar β”‚
β”‚ 3   ┆ null ┆ bar β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

Required output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b    ┆ c                  β”‚
β”‚ --- ┆ ---  ┆ ---                β”‚
β”‚ i64 ┆ i64  ┆ str                β”‚
β•žβ•β•β•β•β•β•ͺ══════β•ͺ════════════════════║
β”‚ 1   ┆ 4    ┆ 0.3333333333333333 β”‚
β”‚ 8   ┆ 5    ┆ 0.6666666666666666 β”‚
β”‚ 3   ┆ null ┆ 0.6666666666666666 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

transformation code:

l = df.select("c").collect().shape[0]
rep = df.group_by("c").len().collect().with_columns(pl.col("len")/l).lazy()
df_out = df.with_context(rep.select(pl.all().name.prefix("context_"))).with_columns(pl.col("c").replace(pl.col("context_c"), pl.col("context_len"))).collect()
print(df_out)
output:
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b    ┆ c                  β”‚
β”‚ --- ┆ ---  ┆ ---                β”‚
β”‚ i64 ┆ i64  ┆ str                β”‚
β•žβ•β•β•β•β•β•ͺ══════β•ͺ════════════════════║
β”‚ 1   ┆ 4    ┆ 0.3333333333333333 β”‚
β”‚ 8   ┆ 5    ┆ 0.6666666666666666 β”‚
β”‚ 3   ┆ null ┆ 0.6666666666666666 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

As you can see I am collecting the data 2 times full and there is one collect over a single column. Can I do better?

>Solution :

pl.len() will evaluate to the "column length".

You can also use it in a group context (agg/over) as a way to count the values.

df.with_columns(pl.len().over("c") / pl.len()).collect()
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b    ┆ c        β”‚
β”‚ --- ┆ ---  ┆ ---      β”‚
β”‚ i64 ┆ i64  ┆ f64      β”‚
β•žβ•β•β•β•β•β•ͺ══════β•ͺ══════════║
β”‚ 1   ┆ 4    ┆ 0.333333 β”‚
β”‚ 8   ┆ 5    ┆ 0.666667 β”‚
β”‚ 3   ┆ null ┆ 0.666667 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

By grouping by the values, their "frequency count" is the group length.

>>> df.group_by("c").len()
shape: (2, 2)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ c   ┆ len β”‚
β”‚ --- ┆ --- β”‚
β”‚ cat ┆ u32 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ foo ┆ 1   β”‚
β”‚ bar ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading