A More Efficient Way to Compare CSV Columns and Adding to a certain row on Second CSV (Python)

Advertisements

So i’m new to python, I want to discover it’s potential and get more knowledge about what I can do with it.
I did this code to compare CSVs, basically what it does you provide it 2 CSVs, CSV1 that has some id column and a column with values that you want to add to other CSV (CSV2)

Note: This script does exactly what I want, and it seems to work fine, hope it can also be useful to someone, my question is really what I can do to improve it’s performance or even make the code cleaner

# Made by Varqas
# CSV1 = CSV containing values that can be matched in CSV2 and a column that will be added
# CSV2 = CSV containing values that can be matched and column that will be concatenated at the end of the CSV (The last column values should be empty)

with open('csv1.csv', encoding="utf8") as check_file:
    # Get Column that will be used to Compare values and add it to a list
    columnToCompare = list([row.split(',')[0].strip() for row in check_file])

with open('csv1.csv', encoding="utf8") as check_file:
    # Get Column that will be used to add to a row values and add it to a list
    columnToAdd = list([row.split(',')[2].strip() for row in check_file])

with open('csv2.csv', 'r', encoding="utf8") as in_file, open('out.csv', 'w', encoding="utf8") as out_file:
    i = 0
    # For each Row in CSV2
    for line in in_file:
        # Write Headers
        if i == 0:
            out_file.write(line)
        else:
            # GET Column on CSV2 containing value that will be compared on CVS1
            value = line.split(',')[1].strip()
            # Check if first Column value on CSV2  either variable is in 
            if value in columnToCompare:
                # Check for duplicates in the list 
                numberOfOccurences = list(columnToCompare).count(value)
                concatRow = ""
                if numberOfOccurences > 1:
                    # Concatenate all values of occurences
                    for x in range(numberOfOccurences):
                        index = list(columnToCompare).index(value)
                        concatRow = concatRow + columnToAdd[index]
                        if x != numberOfOccurences - 1:
                            concatRow = concatRow + " + "
                        # Remove value so list.index doesn't found same row
                        columnToCompare[index] = ""
                else:
                    # Add other row that doesn't match
                    index = list(columnToCompare).index(value)
                    concatRow = columnToAdd[index]

                # Concat to last column of CSV2
                out_file.write(line.strip() + concatRow + "\n")
            else:
                # Still concat value in CSV2 to last column if not found in csv1 
                out_file.write(line.strip() + "not found" + "\n")
        i = i + 1

I know it can be improved, and minified perhaps using some libs… Let me know what you think!

I tried using pd merge, but I didn’t understand quite well how I could add concatenation and values inside it.

>Solution :

You can use the Pandas library to read the two CSV files into dataframes and merge the two columns into the second CSV and output a new CSV with the merged columns.

import pandas as pd

# read first CSV
df1 = pd.read_csv('first.csv')

# read second CSV
df2 = pd.read_csv('second.csv')

# merge the id column and a "column with values 
# that you want to add to other CSV (CSV2)"
# for the example the second column is named 'data'.
merged_df = pd.merge(df2, df1[['id', 'data']], on='id', how='left')

# save new dataframe to csv.
merged_df.to_csv('merged.csv', index=False)

Leave a ReplyCancel reply