Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to filter a lazy polars dataframe with arbitrary expressions?

I want to fix my function in order that the expressions inside filter will vary based on the arguments passed. For example if we pass eq_cols = {"a": "aaa", "b": "bbb"}, then two expressions should be created and if not (the default), we don’t pass those expressions. Same for isin_col and select_cols.

import polars as pl
from typing import Dict, List

def query_parquet(
        file_path: str,
        eq_cols: Dict[str, str] = None,
        isin_cols: Dict[str, List[str]] = None,
        select_cols: List[str] = None,
    ) -> pl.DataFrame:

    query = (
        pl.scan_parquet(file_path)
        .filter(
            pl.col("A") == "aaa",
            # pl.col("B") == "bbb",
            # pl.col("C").is_in(["ccc1", "ccc2"])
            # pl.col("D").is_in(["ddd1", "ddd2"])
        )
        .select(["F", "G", "H"])
        .collect()
    )

    return query

Can you guys show me how to do it ? I hope I explained my question good enough.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Expressions are Python objects and not evaluated when creating them, so you can do comprehensions like

pl.scan_parquet(file_path)
.filter(
  pl.all_horizontal(
    pl.all_horizontal(pl.col(k).eq(v) for k,v in eq_cols.items()),
    pl.all_horizontal(pl.col(k).isin(v) for k,v in isin_cols.items()),
  )
)
.select(select_cols)
.collect()

And so on in any combining logic you wish, not just all_horizontal necessarily. In the case of default None, you could make the default some type of always True expression or * for select, or do some if conditionals when building the query over multiple lines instead of just a one-liner like so:

query = pl.scan_parquet(file_path)

if eq_cols or isin_cols:
  filter_exprs = []
  if eq_cols:
    filter_exprs.append(pl.all_horizontal(pl.col(k).eq(v) for k,v in eq_cols.items())
  if isin_cols:
    filter_exprs.append(pl.all_horizontal(pl.col(k).isin(v) for k,v in isin_cols.items())
  query = query.filter(pl.all_horizontal(filter_exprs))

query = query.filter(filter_expr)
if select_cols:
  query = query.select(select_cols)

query = query.collect() # done here

Addendum: One further option for eq_cols is that in recent version of polars, filter has become more ergonomic. The logic of pl.all_horizontal is the default when passing in variadic positional arguments (you can even pass in a iterable straight up as one positional argument), and specific keyword arguments are equal to pl.col(k).eq(v). So you can unpack a dictionary (and optionally a list/generator, not needed though) to simplify the if statement above like so, given that the defaults are [] and {} instead of Nones:

if eq_cols or isin_cols:
  query = query.filter((pl.col(k).isin(v) for k,v in isin_cols.items()), **eq_cols)
  # can unpack the generator, e.g. *(pl.col(k) ... ), same results
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading