Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Polars Create Column with String Formatting

I have a polars dataframe:

df = pl.DataFrame({'schema_name': ['test_schema', 'test_schema_2'], 
                       'table_name': ['test_table', 'test_table_2'],
                       'column_name': ['test_column, test_column_2','test_column']})
schema_name table_name column_name
test_schema test_table test_column, test_column_2
test_schema_2 test_table_2 test_column

I have a string:

date_field_value_max_query = '''
    select '{0}' as schema_name, 
           '{1}' as table_name, 
           greatest({2})
    from {0}.{1}
    group by 1, 2
'''

I would like to use polars to add a column by using string formatting. The target dataframe is this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

schema_name table_name column_name query
test_schema test_table test_column, test_column_2 select test_schema, test_table, greatest(test_column, test_column_2) from test_schema.test_table group by 1, 2
test_schema_2 test_table_2 test_column select test_schema_2, test_table_2, greatest(test_column) from test_schema_2.test_table_2 group by 1, 2

In pandas, I would do something like this:

df.apply(lambda row: date_field_value_max_query.format(row['schema_name'], row['table_name'], row['column_name']), axis=1)

For polars, I tried this:

df.with_column(
    (date_field_value_max_query.format(pl.col('schema_name'), pl.col('table_name'), pl.col('column_name')))
)

This doesn’t work, because with_column expects a single expression. I am able to get the output I want by doing this…

df.apply(lambda row: date_field_value_max_query.format(row[0], row[1], row[2]))

…but this returns only the one column, and I lose the original three columns. I know this approach is also not recommended for polars, when possible.

How can I perform string formatting across multiple dataframe columns with the output column attached to the original dataframe?

>Solution :

df.apply() returns a dataframe – you can stack dataframes horizontally:

df.hstack(
   df.apply(lambda row: date_field_value_max_query.format(*row))
     .rename({"apply": "query"})
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name   | table_name   | column_name                | query                               │
│ ---           | ---          | ---                        | ---                                 │
│ str           | str          | str                        | str                                 │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema   | test_table   | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column                | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘

You can add Series objects as columns:

df.with_columns(
   (df.apply(lambda row: date_field_value_max_query.format(*row))
      .to_series().rename("query"))
)
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name   | table_name   | column_name                | query                               │
│ ---           | ---          | ---                        | ---                                 │
│ str           | str          | str                        | str                                 │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema   | test_table   | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column                | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘

If you are creating the format string – you could perhaps use an expression instead:

df.with_columns((
   pl.lit("select '") + pl.col("schema_name") + pl.lit("' as schema_name,")
                      + pl.col("table_name")  + pl.lit("' as table_name,")
                      + pl.lit("greatest(")   + pl.col("column_name") + pl.lit(")") + 
   pl.lit("from ")    + pl.col("schema_name") + pl.lit(".") + pl.col("table_name")  + 
   pl.lit("group by 1, 2")
).alias("query"))
shape: (2, 4)
┌───────────────┬──────────────┬────────────────────────────┬─────────────────────────────────────┐
│ schema_name   | table_name   | column_name                | query                               │
│ ---           | ---          | ---                        | ---                                 │
│ str           | str          | str                        | str                                 │
╞═══════════════╪══════════════╪════════════════════════════╪═════════════════════════════════════╡
│ test_schema   | test_table   | test_column, test_column_2 | select 'test_schema' as schema_n... │
├───────────────┼──────────────┼────────────────────────────┼─────────────────────────────────────┤
│ test_schema_2 | test_table_2 | test_column                | select 'test_schema_2' as schema... │
└───────────────┴──────────────┴────────────────────────────┴─────────────────────────────────────┘
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading