This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?
(the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]' (string))
import polars as pl
df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
.with_column(
pl.concat_str([
pl.lit('['),
pl.col('doc_ids').apply(lambda x: x.cast(pl.Utf8)).arr.join(', '),
pl.lit(']')
])
.alias('docs_str')
)
.drop('doc_ids')
).collect()
>Solution :
In general, we want to avoid apply at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.
Here’s one way that we can eliminate apply: replace it with arr.eval. arr.eval allows us to treat a list as if it were an Expression/Series, which allows us to use standard expressions on it.
(
df.lazy()
.with_column(
pl.concat_str(
[
pl.lit("["),
pl.col("doc_ids")
.arr.eval(pl.element().cast(pl.Utf8))
.arr.join(", "),
pl.lit("]"),
]
).alias("docs_str")
)
.drop("doc_ids")
.collect()
)
shape: (2, 2)
┌─────┬──────────┐
│ ent ┆ docs_str │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════╡
│ a ┆ [2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ [3] │
└─────┴──────────┘