Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Finding duplicates in two separate txt files line by line and print only duplicates

Tldr: Open two txt files, use one to search the other and then print any duplicates.

Hi everyone, first time posting on here and very new to coding and python, I’m searching for an answer and unable to find anything that uses .txt files like I’m trying to do. I am trying to search for a group of strings or single string in test2 using the file test. The reason for me using txt files is it would be impossible for me to have to manually input each value into a list in python as the files have thousands of different strings to search through.

from itertools import chain

f1 = open(r"test.txt", "r")
f2 = open(r"test2.txt", "r")
file1 = f1.read().splitlines()
file2 = f2.read().splitlines()
x = [file1]
y = [file2]
z = list(chain([x,y]))
z.sort()
d = (x for x in z if z.count (x) > 1)
print (d)
f1.close()
f2.close()

The result I get is this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

<generator object <genexpr> at 0x7f10cc992420>

I understand that I should be getting a print out of any duplicates that are found from the combined list I created with list(chain()). Any help or suggestions would be greatly appreciated!

>Solution :

Expanding on my comment. It seems like you are just willy-nilly tossing square brackets around things hoping things will work, but in every instance you are using square brackets, you shouldn’t be.

.splitlines() already returns a list. You don’t have to take that return and put it inside of another list.

.chain() takes two lists as arguments, so sticking your two lists inside of yet another list and passing that as a single argument isn’t going to do what you want.

This is all pretty easy stuff to catch as mistakes with some basic debugging. For instance, if you would have tossed a print(x) after setting that variable you would have found it prints [['stuff','from','file','1']]. Same with y [['stuff','from','file','2']]. You have a list inside of another list.

You could also do this for the argument you pass into chain(). print([x,y]) would show [[['stuff','from','file','1']],[['stuff','from','file','2']]] list-ception.

Lastly, the one spot you probably want to use square brackets is in your list comprehension. Instead of parentheses, switch to square brackets.

Instead:

from itertools import chain

f1 = open(r"test.txt", "r")
f2 = open(r"test2.txt", "r")
file1 = f1.read().splitlines()
file2 = f2.read().splitlines()
z = list(chain(file1,file2))
z.sort()
d = [x for x in z if z.count (x) > 1]
print (d)
f1.close()
f2.close()

This will spit out ['match','match'] (assuming the one word that matches in both files is the word ‘match’).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading