import polars as pl
df1 = pl.DataFrame(
{
'type': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-01', '2022-01-02', '2022-01-01', '2022-01-02', '2022-01-03'],
'value': [1, 3, 5, 7, 8, 9, 12, 0, 1]
}
)
df2 = pl.DataFrame(
{
'type': ['β', 'β', 'β'],
'date': ['2022-01-01', '2022-01-02', '2022-01-03'],
'value': [100, 233, 666]
}
)
df3 = pl.DataFrame(
{
'type': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-01', '2022-01-02', '2022-01-01', '2022-01-02', '2022-01-03'],
'value': [1, 3, 5, 7, 8, 9, 12, 0, 1],
'type_β': ['β', 'β', 'β', 'β', 'β', 'β', 'β', 'β', 'β'],
'value_β': [100, 233, 666, None, 100, 233, 100, 233, 666]
}
)
print(df1, df2, df3)
How can we concatenate df1 and df2 into df3 based on the date?
Missing dates do not need to be filled, default None,
Please use Polars(polars version = 0.17.11)
>Solution :
You don’t want to concat. Think of concat as adding rows, not combining datasets. You want a join.
I think your example output is off just a touch though.
On the surface you want to do this:
df1.join(df2, on='date', how='outer', suffix="β")
which results in
shape: (9, 5)
┌──────┬────────────┬───────┬────────┬─────────┐
│ type ┆ date ┆ value ┆ type_β ┆ value_β │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str ┆ i64 │
╞══════╪════════════╪═══════╪════════╪═════════╡
│ A ┆ 2022-01-01 ┆ 1 ┆ β ┆ 100 │
│ A ┆ 2022-01-02 ┆ 3 ┆ β ┆ 233 │
│ A ┆ 2022-01-03 ┆ 5 ┆ β ┆ 666 │
│ A ┆ 2022-01-04 ┆ 7 ┆ null ┆ null │
│ B ┆ 2022-01-01 ┆ 8 ┆ β ┆ 100 │
│ B ┆ 2022-01-02 ┆ 9 ┆ β ┆ 233 │
│ C ┆ 2022-01-01 ┆ 12 ┆ β ┆ 100 │
│ C ┆ 2022-01-02 ┆ 0 ┆ β ┆ 233 │
│ C ┆ 2022-01-03 ┆ 1 ┆ β ┆ 666 │
└──────┴────────────┴───────┴────────┴─────────┘
You can see how this has a null value in the type_β column in the same row that you have a null in the value_β column so I think that was a typo or else you want more functionality than you said.
If you do want to fill in type_β then you can follow up with
df1.join(df2, on='date', how='outer', suffix="_β") \
.with_columns(type_β = pl.col('type_β').forward_fill())
shape: (9, 5)
┌──────┬────────────┬───────┬────────┬─────────┐
│ type ┆ date ┆ value ┆ type_β ┆ value_β │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str ┆ i64 │
╞══════╪════════════╪═══════╪════════╪═════════╡
│ A ┆ 2022-01-01 ┆ 1 ┆ β ┆ 100 │
│ A ┆ 2022-01-02 ┆ 3 ┆ β ┆ 233 │
│ A ┆ 2022-01-03 ┆ 5 ┆ β ┆ 666 │
│ A ┆ 2022-01-04 ┆ 7 ┆ β ┆ null │
│ B ┆ 2022-01-01 ┆ 8 ┆ β ┆ 100 │
│ B ┆ 2022-01-02 ┆ 9 ┆ β ┆ 233 │
│ C ┆ 2022-01-01 ┆ 12 ┆ β ┆ 100 │
│ C ┆ 2022-01-02 ┆ 0 ┆ β ┆ 233 │
│ C ┆ 2022-01-03 ┆ 1 ┆ β ┆ 666 │
└──────┴────────────┴───────┴────────┴─────────┘
The distinction between how='outer', how='left', and the default how='inner' is how values should be treated that aren’t in both dfs that are being joined.
inner joins will only return rows where values in the column to be matched exist on both sides of the join
left joins will return all the rows on the left side regardless if they exist on the right side
outer joins will return all rows of both sides
In your example, it’s not clear which you prefer between left or outer, it’s only clear you don’t want inner.