Polars map users with more than one event

November 23, 2023

df = pl.LazyFrame({'col1':['a','a','b','b','c','c'],'col2':['undefined','defined','defined','defined','undefined','undefined']})

I have the following DF

Imagine I want to select the rows where col2 has more than two different values when it is grouped by col1.
And I want to select those users and map them in a way I know they have more than two different values in col 2.

Wanted result

df = pl.LazyFrame({'col1':['a','a','b','b','c','c'],'col2':['undefined','defined','defined','defined','undefined','undefined'],'col3':[2,2,1,1,1,1]})

Any ideas on how can I achieve that? In a efficient manner avoiding lefts joins separate filters and so on

>Solution :

It looks like .n_unique()

df.with_columns(col3 = pl.col("col2").n_unique().over("col1"))

shape: (6, 3)
┌──────┬───────────┬──────┐
│ col1 ┆ col2      ┆ col3 │
│ ---  ┆ ---       ┆ ---  │
│ str  ┆ str       ┆ u32  │
╞══════╪═══════════╪══════╡
│ a    ┆ undefined ┆ 2    │
│ a    ┆ defined   ┆ 2    │
│ b    ┆ defined   ┆ 1    │
│ b    ┆ defined   ┆ 1    │
│ c    ┆ undefined ┆ 1    │
│ c    ┆ undefined ┆ 1    │
└──────┴───────────┴──────┘

You may want to also try .unique().len() as there was a previous report n_unique about performance: https://github.com/pola-rs/polars/issues/11249

Not sure if those benchmarks are still relevant.