I have a problem. I have missing numbers inside the column materialNumber. But if a similar price is it should take the exactly same materialNumber. If more than two materialNumber occur with the same price it should take the first. If no materialNumber is found with the same price it should take the next nearst materialnumber depending on the price.
Dataframe
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 NaN 100
3 2 4562.0 30
4 3 1547.0 40
5 3 NaN 37
Code
import pandas as pd
d = {
"customerId": [1, 1, 2, 2, 3, 3],
"materialNumber": [
1234,
4562,
None,
4562,
1547,
None,
],
"price": [100, 20, 100, 30, 40, 37],
}
df = pd.DataFrame(data=d)
print(df)
import numpy as np
def find_next(x):
if(x['materialNumber'] == None):
#if price occurs only once it should finde the next nearst price
if(x['price'].value_counts().shape[0] == 1):
return x.drop_duplicates(subset=['price'], keep="first")
else:
return x.iloc[(x['price']-input).abs().argsort()[:2]]
df['materialNumber'] = df.apply(lambda x: find_next(x), axis=1)
What I want
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 1234 100 # of index 0: 1234.0, 100 (same value)
3 2 4562.0 30
4 3 1547.0 40
5 3 1547 37 # of index 4: 1547.0, 40 (next similar value)
>Solution :
Use merge_asof with match rows with missing values per materialNumber by rows without missing values and assign values in DataFrame.loc:
m = df['materialNumber'].isna()
new = pd.merge_asof(df[m].reset_index().sort_values('price'),
df[~m].sort_values('price'), on='price', direction='nearest')
df.loc[m, 'materialNumber'] = new.set_index('index')['materialNumber_y']
print(df)
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 1234.0 100
3 2 4562.0 30
4 3 1547.0 40
5 3 1547.0 37