Filter a polars dataframe based on JSON in string column

Advertisements

I have a Polars dataframe like

df = pl.DataFrame({
    "tags": ['{"ref":"@1", "area": "livingroom", "type": "elec"}', '{"ref":"@2", "area": "kitchen"}', '{"ref":"@3", "type": "elec"}'],
     "name": ["a", "b", "c"],
})    
┌────────────────────────────────────────────────────┬──────┐
│ tags                                               ┆ name │
│ ---                                                ┆ ---  │
│ str                                                ┆ str  │
╞════════════════════════════════════════════════════╪══════╡
│ {"ref":"@1", "area": "livingroom", "type": "elec"} ┆ a    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ {"ref":"@2", "area": "kitchen"}                    ┆ b    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ {"ref":"@3", "type": "elec"}                       ┆ c    │
└────────────────────────────────────────────────────┴──────┘

What I would to do is create a filter function that filters dataframe based on the tags column. Particularly, I would like to only be left with rows where the tags column has an area key and a type key that has a value "elec".

How can I achieve this (ideally using the native expressions API)?

>Solution :

pl.Expr.str.json_path_match can be used to extract the first match of the JSON string with the a suitable path expression.

(
    df
    .filter(
        pl.col("tags").str.json_path_match("$.area").is_not_null(),
        pl.col("tags").str.json_path_match("$.type") == "elec",
    )
)
shape: (1, 2)
┌────────────────────────────────────────────────────┬──────┐
│ tags                                               ┆ name │
│ ---                                                ┆ ---  │
│ str                                                ┆ str  │
╞════════════════════════════════════════════════════╪══════╡
│ {"ref":"@1", "area": "livingroom", "type": "elec"} ┆ a    │
└────────────────────────────────────────────────────┴──────┘

More generally, pl.Expr.str.json_decode ca be used to obtain a struct column with the information of the JSON. This struct can be unnested and used for any downstream filtering operation.

(
    df
    .with_columns(
        pl.col("tags").str.json_decode()
    )
    .unnest("tags")
)
shape: (3, 4)
┌─────┬────────────┬──────┬──────┐
│ ref ┆ area       ┆ type ┆ name │
│ --- ┆ ---        ┆ ---  ┆ ---  │
│ str ┆ str        ┆ str  ┆ str  │
╞═════╪════════════╪══════╪══════╡
│ @1  ┆ livingroom ┆ elec ┆ a    │
│ @2  ┆ kitchen    ┆ null ┆ b    │
│ @3  ┆ null       ┆ elec ┆ c    │
└─────┴────────────┴──────┴──────┘

Leave a ReplyCancel reply