df = pl.LazyFrame({'col1':['a','a','b','b','c','c'],'col2':['undefined','defined','defined','defined','undefined','undefined']})
I have the following DF
Imagine I want to select the rows where col2 has more than two different values when it is grouped by col1.
And I want to select those users and map them in a way I know they have more than two different values in col 2.
Wanted result
df = pl.LazyFrame({'col1':['a','a','b','b','c','c'],'col2':['undefined','defined','defined','defined','undefined','undefined'],'col3':[2,2,1,1,1,1]})
Any ideas on how can I achieve that? In a efficient manner avoiding lefts joins separate filters and so on
>Solution :
It looks like .n_unique()
df.with_columns(col3 = pl.col("col2").n_unique().over("col1"))
shape: (6, 3)
┌──────┬───────────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════╪═══════════╪══════╡
│ a ┆ undefined ┆ 2 │
│ a ┆ defined ┆ 2 │
│ b ┆ defined ┆ 1 │
│ b ┆ defined ┆ 1 │
│ c ┆ undefined ┆ 1 │
│ c ┆ undefined ┆ 1 │
└──────┴───────────┴──────┘
You may want to also try .unique().len() as there was a previous report n_unique about performance: https://github.com/pola-rs/polars/issues/11249
Not sure if those benchmarks are still relevant.