Fill missing value with with value from another row and if not take the next similar value

I have a problem. I have missing numbers inside the column materialNumber. But if a similar price is it should take the exactly same materialNumber. If more than two materialNumber occur with the same price it should take the first. If no materialNumber is found with the same price it should take the next nearst materialnumber depending on the price.

Dataframe

   customerId  materialNumber  price
0           1          1234.0    100
1           1          4562.0     20
2           2             NaN    100
3           2          4562.0     30
4           3          1547.0     40
5           3             NaN     37

Code

import pandas as pd
d = {
    "customerId": [1, 1, 2, 2, 3, 3],
    "materialNumber": [
        1234,
        4562,
        None,
        4562,
        1547,
        None, 
    ],
    "price": [100, 20, 100, 30, 40, 37],
}
df = pd.DataFrame(data=d)
print(df)

import numpy as np
def find_next(x):
  if(x['materialNumber'] == None):
    #if price occurs only once it should finde the next nearst price
    if(x['price'].value_counts().shape[0] == 1): 
      return x.drop_duplicates(subset=['price'], keep="first")
    else:
      return x.iloc[(x['price']-input).abs().argsort()[:2]]

df['materialNumber'] = df.apply(lambda x: find_next(x), axis=1)

What I want

   customerId  materialNumber  price
0           1          1234.0    100
1           1          4562.0     20
2           2            1234    100 # of index 0: 1234.0, 100 (same value)
3           2          4562.0     30
4           3          1547.0     40
5           3            1547     37 # of index 4: 1547.0, 40 (next similar value)

>Solution :

Use merge_asof with match rows with missing values per materialNumber by rows without missing values and assign values in DataFrame.loc:

m = df['materialNumber'].isna()
new = pd.merge_asof(df[m].reset_index().sort_values('price'),
                    df[~m].sort_values('price'), on='price', direction='nearest')

df.loc[m, 'materialNumber'] = new.set_index('index')['materialNumber_y']
print(df)
   customerId  materialNumber  price
0           1          1234.0    100
1           1          4562.0     20
2           2          1234.0    100
3           2          4562.0     30
4           3          1547.0     40
5           3          1547.0     37

Leave a Reply