Merge should adopt Nan values if not existing value

Advertisements

However, I have the following problem:

  • If a year or date does not exist in df2 then a price and a listing_id is automatically added during the merge. But that should be NaN

  • The second problem is when merging, as soon as I have multiple data that were on the same day and year then the temperature is also merged to the second, for example:

d = {'id': [1], 'day': [1], 'temperature': [20], 'year': [2001]}
df = pd.DataFrame(data=d)
print(df)

   id  day  temperature  year
0   1    1           20  2001

d2 = {'id': [122, 244], 'day': [1, 1],
     'listing_id': [2, 4], 'price': [20, 440], 'year': [2001, 2001]}
df2 = pd.DataFrame(data=d2)
print(df2)

    id  day  listing_id  price  year
0  122    1           2     20  2001
1  244    1           4    440  2001

df3 = pd.merge(df,df2[['day', 'listing_id', 'price']],
              left_on='day', right_on = 'day',how='left')
print(df3)

   id  day  temperature  year  listing_id  price
0   1    1           20  2001           2     20
1   1    1           20  2001           4    440 # <-- The second temperature is wrong :/

This should not be so, because if I later still have a date from year 2002 which was in day 1 with a temperature of 30 and I want to calculate the average. Then I get the following formula: 20 + 20 + 30 = 23.3. The formula should be 20 + 30 = 25. Therefore, if a value has already been filled, there should be a NaN value in it.

Code Snippet

d = {'id': [1, 2, 3, 4, 5], 'day': [1, 2, 3, 4, 2], 
     'temperature': [20, 40, 50, 60, 20], 'year': [2001, 2002, 2004, 2005, 1999]}
df = pd.DataFrame(data=d)
print(df)

   id  day  temperature  year
0   1    1           20  2001
1   2    2           40  2002
2   3    3           50  2004
3   4    4           60  2005
4   5    2           20  1999


d2 = {'id': [122, 244, 387, 4454, 521], 'day': [1, 2, 3, 4, 2],
     'listing_id': [2, 4, 5, 6, 7], 'price': [20, 440, 500, 6600, 500], 
      'year': [2001, 2002, 2004, 2005, 2005]}
df2 = pd.DataFrame(data=d2)
print(df2)

     id  day  listing_id  price  year
0   122    1           2     20  2001
1   244    2           4    440  2002
2   387    3           5    500  2004
3  4454    4           6   6600  2005
4   521    2           7    500  2005


df3 = pd.merge(df,df2[['day','listing_id', 'price']],
              left_on='day', right_on = 'day',how='left').drop('day',axis=1)
print(df3)

   id  day  temperature  year  listing_id  price
0   1    1           20  2001           2     20
1   2    2           40  2002           4    440
2   2    2           40  2002           7    500
3   3    3           50  2004           5    500
4   4    4           60  2005           6   6600
5   5    2           20  1999           4    440
6   5    2           20  1999           7    500


What I want

   id  day  temperature  year  listing_id  price
0   1    1           20  2001           2     20
1   2    2           40  2002           4    440
2   2    2          NaN  2005           7    500
3   3    3           50  2004           5    500
4   4    4           60  2005           6   6600
5   5    2           20  1999         NaN    NaN

>Solution :

IIUC:

>>> df1.merge(df2[['day', 'listing_id', 'price', 'year']],
              on=['day', 'year'], how='outer')

    id  day  temperature  year  listing_id   price
0  1.0    1         20.0  2001         2.0    20.0
1  2.0    2         40.0  2002         4.0   440.0
2  3.0    3         50.0  2004         5.0   500.0
3  4.0    4         60.0  2005         6.0  6600.0
4  5.0    2         20.0  1999         NaN     NaN
5  NaN    2          NaN  2005         7.0   500.0

Leave a ReplyCancel reply