I am dealing with a problem like this:
mydata_old
| x | y |
|-----|------|
| 1 | 10 |
| 2 | 15 |
| 3 | 12 |
| 4 | 21 |
mydata_new
| x | y |
|-----|------|
| 4 | 20 |
| 5 | 25 |
| 6 | 30 |
I would like to combine both tables to get:
mydata
| x | y |
|-----|------|
| 1 | 10 |
| 2 | 15 |
| 3 | 12 |
| 4 | 20 | # x = 4 does not show up twice
| 5 | 25 |
| 6 | 30 |
I tried with pandas.merge, but I cannot get the desired result.
mydata = pd.merge(mydata_new, mydata_old, how="left")
>Solution :
One way to do this is with an outer join (i.e., keep the index from both dataframes). You can achieve this using pd.merge(mydata_new, mydata_old, how='outer') and then keep the newest data available.
However, this could be more straight forward with pandas.concat.
Here, I am concatenating both dataframes. Note that I exclude rows from mydata_old if x appears in mydata_new. This way you can keep the newest data when it’s available in both dataframes (old and new).
# Concat frames and if data is available in both, keep data from mydata_new
mydata = pd.concat(
[
mydata_old.loc[~mydata_old['x'].isin(mydata_new['x'])],
mydata_new
],
axis=0)
This will print the following:
| x | y |
|---|---|
| 1 | 10 |
| 2 | 15 |
| 3 | 12 |
| 4 | 20 |
| 5 | 25 |
| 6 | 30 |