How to Calculate Z-Scores for a List of Values in Polars DataFrame

August 21, 2024

I’m working with a Polars DataFrame in Python, where I have a column containing lists of values. I need to calculate the Z-scores for each list using pre-computed mean and standard deviation values. Here’s a sample of my DataFrame:

import polars as pl

# Example DataFrame
data = {
    "transcript_id": ["ENST00000711184.1"],
    "OE": [[None, None, 3.933402, None, 1.057907, None, 3.116513]],
    "mean_OE": [11.882091],
    "std_OE": [3.889974],
}

df_human = pl.DataFrame(data)

For each list in the OE column, I want to subtract the mean (mean_OE) and divide by the standard deviation (std_OE) to obtain the Z-scores. I also want to handle None values in the lists by leaving them as None in the Z-scores list.

How can I correctly apply the Z-score calculation to each list while keeping None values intact?

Thanks in advance for any guidance!

>Solution :

If you are interested in the usual definition of the Z-score (using the summary statistics of the actual list data), you can simply use pl.Expr.list.eval as follows.

df_human.select(
    pl.col("OE").list.eval(pl.element() - pl.element().mean()).alias("z_OE")
)

shape: (1, 1)
┌───────────────────────────────────────────────────────┐
│ z_OE                                                  │
│ ---                                                   │
│ list[f64]                                             │
╞═══════════════════════════════════════════════════════╡
│ [null, null, 1.230795, null, -1.6447, null, 0.413906] │
└───────────────────────────────────────────────────────┘

If you want to compute the Z-score explicitly using the mean_OE and std_OE columns, you’d ideally be able to use those within pl.Expr.list.eval. However, referencing "external" column within a list evaluation context is currently not supported yet.

Instead, you can use the technique of exploding and imploding, like described here.

z_score_expr = ((pl.col("OE").explode() - pl.col("mean_OE")) / pl.col("std_OE"))

df_human.with_columns(
    z_score_expr.implode().over(pl.int_range(pl.len()))
)

shape: (1, 1)
┌───────────────────────────────────────────────────────────┐
│ z_OE                                                      │
│ ---                                                       │
│ list[f64]                                                 │
╞═══════════════════════════════════════════════════════════╡
│ [null, null, -2.043378, null, -2.782585, null, -2.253377] │
└───────────────────────────────────────────────────────────┘