I want to replace the categories with their occurrence frequency. My dataframe is lazy and currently I cannot do it without 2 passes over the entire data and then one pass over a column to get the length of the dataframe. Here is how I am doing it:
Input:
df = pl.DataFrame({"a": [1, 8, 3], "b": [4, 5, None], "c": ["foo", "bar", "bar"]}).lazy()
print(df.collect())
output:
shape: (3, 3)
βββββββ¬βββββββ¬ββββββ
β a β b β c β
β --- β --- β --- β
β i64 β i64 β str β
βββββββͺβββββββͺββββββ‘
β 1 β 4 β foo β
β 8 β 5 β bar β
β 3 β null β bar β
βββββββ΄βββββββ΄ββββββ
Required output:
shape: (3, 3)
βββββββ¬βββββββ¬βββββββββββββββββββββ
β a β b β c β
β --- β --- β --- β
β i64 β i64 β str β
βββββββͺβββββββͺβββββββββββββββββββββ‘
β 1 β 4 β 0.3333333333333333 β
β 8 β 5 β 0.6666666666666666 β
β 3 β null β 0.6666666666666666 β
βββββββ΄βββββββ΄βββββββββββββββββββββ
transformation code:
l = df.select("c").collect().shape[0]
rep = df.group_by("c").len().collect().with_columns(pl.col("len")/l).lazy()
df_out = df.with_context(rep.select(pl.all().name.prefix("context_"))).with_columns(pl.col("c").replace(pl.col("context_c"), pl.col("context_len"))).collect()
print(df_out)
output:
shape: (3, 3)
βββββββ¬βββββββ¬βββββββββββββββββββββ
β a β b β c β
β --- β --- β --- β
β i64 β i64 β str β
βββββββͺβββββββͺβββββββββββββββββββββ‘
β 1 β 4 β 0.3333333333333333 β
β 8 β 5 β 0.6666666666666666 β
β 3 β null β 0.6666666666666666 β
βββββββ΄βββββββ΄βββββββββββββββββββββ
As you can see I am collecting the data 2 times full and there is one collect over a single column. Can I do better?
>Solution :
pl.len() will evaluate to the "column length".
You can also use it in a group context (agg/over) as a way to count the values.
df.with_columns(pl.len().over("c") / pl.len()).collect()
shape: (3, 3)
βββββββ¬βββββββ¬βββββββββββ
β a β b β c β
β --- β --- β --- β
β i64 β i64 β f64 β
βββββββͺβββββββͺβββββββββββ‘
β 1 β 4 β 0.333333 β
β 8 β 5 β 0.666667 β
β 3 β null β 0.666667 β
βββββββ΄βββββββ΄βββββββββββ
By grouping by the values, their "frequency count" is the group length.
>>> df.group_by("c").len()
shape: (2, 2)
βββββββ¬ββββββ
β c β len β
β --- β --- β
β cat β u32 β
βββββββͺββββββ‘
β foo β 1 β
β bar β 2 β
βββββββ΄ββββββ