Home Is there any way match two different csv files with similar columns in python?

Questions

Is there any way match two different csv files with similar columns in python?

November 14, 2022

I am a beginner in numpy and I have two csv files that look like this:

csv1:

ID	item_size	Cost
0010	4.4
0010	5.5
0012	8
0012	10.1

csv2:

ID	item_size	Cost
0010	6.1	5
0010	7.2	2
0010	5.3	1
0010	3.1	3
0010	4.7	2
0012	7.6	5
0012	22	4
0012	13.1	2
0012	9.2	3
0012	11.1	3

The output should look like this:
csv_output:

ID	item_size	Cost
0010	4.4	2
0010	5.5	1
0012	8	5
0012	10.1	3

"The closest value for item_size for 4.4 is 4.7, they also have the same ID so the Cost column is filled with a 2."

The task is to match the ID between the csv1 and csv2. Also to get the closest item size and add the cost. Mantaining the csv1 as the output csv.

So I have to fill the missing cost of csv1 based on the nearest item_size value between csv1 and csv2.
Also, they have to belong to the same ID.
I tried separating this problem into several tasks such as: nearest value problem, similar ID problem and filling the cost problem. Then, to combine all of them.

I already solved the nearest value problem. Using this script:

import numpy as np
 
def closest_value(input_list, input_value):

    arr = np.asarray(input_list)

    i = (np.abs(arr - input_value)).argmin()
    return arr[i]
 
list1 = dataset_1['item_size'].values.tolist()
list2 = dataset_2['item_size'].values.tolist()

for i in range(len(list2)):

    val=closest_value(list1,list2[i])

    print("The closest value to the "+ str(list[i])+" is",val)

I also tried solving the similar ID problem using this:

import pandas as pd

csv1 = pd.read_csv('csv1.csv')
csv2= pd.read_csv('csv2.csv')

csv2[csv2['ID'].duplicated(keep=False)]
csv2.groupby('ID',axis=0).apply(lambda x: x.to_csv(str(x.name) + '.csv'))

However, this method creates csv files based on ID similarity which is too computational resourceful. if there is another method to solve this, I will gladly take it. I have been trying to solve this problem for at least 5 days so any help is appreciated. Best Regards

>Solution :

You can use a merge_asof:

# save the index to restore it later
# sort the data for the merge_asof (required)
(pd.merge_asof(df1.reset_index().sort_values(by='item_size').drop(columns='Cost'),
              df2.sort_values(by='item_size'),
              by='ID', on='item_size', direction='nearest'
              )
   .set_index('index').sort_index() # restore original order
 )

Output:

       ID  item_size  Cost
index                     
0      10        4.4     2
1      10        5.5     1
2      12        8.0     5
3      12       10.1     3