Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there any way match two different csv files with similar columns in python?

I am a beginner in numpy and I have two csv files that look like this:

csv1:

ID item_size Cost
0010 4.4
0010 5.5
0012 8
0012 10.1

csv2:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID item_size Cost
0010 6.1 5
0010 7.2 2
0010 5.3 1
0010 3.1 3
0010 4.7 2
0012 7.6 5
0012 22 4
0012 13.1 2
0012 9.2 3
0012 11.1 3

The output should look like this:
csv_output:

ID item_size Cost
0010 4.4 2
0010 5.5 1
0012 8 5
0012 10.1 3

"The closest value for item_size for 4.4 is 4.7, they also have the same ID so the Cost column is filled with a 2."

The task is to match the ID between the csv1 and csv2. Also to get the closest item size and add the cost. Mantaining the csv1 as the output csv.

So I have to fill the missing cost of csv1 based on the nearest item_size value between csv1 and csv2.
Also, they have to belong to the same ID.
I tried separating this problem into several tasks such as: nearest value problem, similar ID problem and filling the cost problem. Then, to combine all of them.

I already solved the nearest value problem. Using this script:

import numpy as np
 
def closest_value(input_list, input_value):

    arr = np.asarray(input_list)

    i = (np.abs(arr - input_value)).argmin()
    return arr[i]
 
list1 = dataset_1['item_size'].values.tolist()
list2 = dataset_2['item_size'].values.tolist()

for i in range(len(list2)):

    val=closest_value(list1,list2[i])

    print("The closest value to the "+ str(list[i])+" is",val)

I also tried solving the similar ID problem using this:

import pandas as pd

csv1 = pd.read_csv('csv1.csv')
csv2= pd.read_csv('csv2.csv')

csv2[csv2['ID'].duplicated(keep=False)]
csv2.groupby('ID',axis=0).apply(lambda x: x.to_csv(str(x.name) + '.csv'))

However, this method creates csv files based on ID similarity which is too computational resourceful. if there is another method to solve this, I will gladly take it. I have been trying to solve this problem for at least 5 days so any help is appreciated. Best Regards

>Solution :

You can use a merge_asof:

# save the index to restore it later
# sort the data for the merge_asof (required)
(pd.merge_asof(df1.reset_index().sort_values(by='item_size').drop(columns='Cost'),
              df2.sort_values(by='item_size'),
              by='ID', on='item_size', direction='nearest'
              )
   .set_index('index').sort_index() # restore original order
 )

Output:

       ID  item_size  Cost
index                     
0      10        4.4     2
1      10        5.5     1
2      12        8.0     5
3      12       10.1     3
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading