Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to Calculate Z-Scores for a List of Values in Polars DataFrame

I’m working with a Polars DataFrame in Python, where I have a column containing lists of values. I need to calculate the Z-scores for each list using pre-computed mean and standard deviation values. Here’s a sample of my DataFrame:

import polars as pl

# Example DataFrame
data = {
    "transcript_id": ["ENST00000711184.1"],
    "OE": [[None, None, 3.933402, None, 1.057907, None, 3.116513]],
    "mean_OE": [11.882091],
    "std_OE": [3.889974],
}

df_human = pl.DataFrame(data)

For each list in the OE column, I want to subtract the mean (mean_OE) and divide by the standard deviation (std_OE) to obtain the Z-scores. I also want to handle None values in the lists by leaving them as None in the Z-scores list.

How can I correctly apply the Z-score calculation to each list while keeping None values intact?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Thanks in advance for any guidance!

>Solution :

If you are interested in the usual definition of the Z-score (using the summary statistics of the actual list data), you can simply use pl.Expr.list.eval as follows.

df_human.select(
    pl.col("OE").list.eval(pl.element() - pl.element().mean()).alias("z_OE")
)
shape: (1, 1)
┌───────────────────────────────────────────────────────┐
│ z_OE                                                  │
│ ---                                                   │
│ list[f64]                                             │
╞═══════════════════════════════════════════════════════╡
│ [null, null, 1.230795, null, -1.6447, null, 0.413906] │
└───────────────────────────────────────────────────────┘

If you want to compute the Z-score explicitly using the mean_OE and std_OE columns, you’d ideally be able to use those within pl.Expr.list.eval. However, referencing "external" column within a list evaluation context is currently not supported yet.

Instead, you can use the technique of exploding and imploding, like described here.

z_score_expr = ((pl.col("OE").explode() - pl.col("mean_OE")) / pl.col("std_OE"))

df_human.with_columns(
    z_score_expr.implode().over(pl.int_range(pl.len()))
)
shape: (1, 1)
┌───────────────────────────────────────────────────────────┐
│ z_OE                                                      │
│ ---                                                       │
│ list[f64]                                                 │
╞═══════════════════════════════════════════════════════════╡
│ [null, null, -2.043378, null, -2.782585, null, -2.253377] │
└───────────────────────────────────────────────────────────┘
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading