Polars Modify Many Columns Based On Value In Another Column

March 31, 2023

Say I have a DataFrame that looks like this:

df = pl.DataFrame({
  "id": [1, 2, 3, 4, 5],
  "feature_a": np.random.randint(0, 3, 5),
  "feature_b": np.random.randint(0, 3, 5),
  "label": [1, 0, 0, 1, 1],
})

┌─────┬───────────┬───────────┬───────┐
│ id  ┆ feature_a ┆ feature_b ┆ label │
│ --- ┆ ---       ┆ ---       ┆ ---   │
│ i64 ┆ i64       ┆ i64       ┆ i64   │
╞═════╪═══════════╪═══════════╪═══════╡
│ 1   ┆ 2         ┆ 0         ┆ 1     │
│ 2   ┆ 1         ┆ 1         ┆ 0     │
│ 3   ┆ 2         ┆ 2         ┆ 0     │
│ 4   ┆ 1         ┆ 0         ┆ 1     │
│ 5   ┆ 0         ┆ 0         ┆ 1     │
└─────┴───────────┴───────────┴───────┘

I want to modify all the features columns based on the value in the label column, producing a new DataFrame.

┌─────┬───────────┬───────────┐
│ id  ┆ feature_a ┆ feature_b │
│ --- ┆ ---       ┆ ---       │
│ i64 ┆ i64       ┆ i64       │
╞═════╪═══════════╪═══════════╡
│ 1   ┆ 1         ┆ 1         │
│ 2   ┆ 0         ┆ 0         │
│ 3   ┆ 0         ┆ 0         │
│ 4   ┆ 1         ┆ 1         │
│ 5   ┆ 1         ┆ 1         │
└─────┴───────────┴───────────┘

I know I can select all the features columns by using regex in the column selector

pl.col(r"^feature_.*$")

And I can use a when/then expression to evaluate the label column

pl.when(pl.col("label") == 1).then(1).otherwise(0)

But I can’t seem to put the 2 together to modify all the selected columns in one fell swoop. It seems so simple, what am I missing?

>Solution :

Here’s one way:

Recently support was added for more ergonomic arguments in a lot of methods, including with_columns and select. Since they now can take any number of keyword arguments acting like an alias at the end (e.g. setting the new column name), we can construct a dict of the columns to overwrite and pass it in (with unpacking) like so:

df.select('id', **{col : 'label' for col in df.columns if col.startswith('feature')})

In this simple case no when/then is needed for the label column, but in general any expression evaluating to a column of the same height as id can go into this dict comprehension.