Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Not all duplicates are deleted from a text file in Python

I am new to Python. I am trying to delete duplicates from my text file by doing the following:

line_seen = set()

f = open('a.txt', 'r')
w = open('out.txt', 'w')

for i in f:
    if i not in line_seen:
            w.write(i)
            line_seen.add(i)

f.close()
w.close()

In the initial file I had

hello
world
python
world
hello

And in output file I got

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

hello
world
python
hello

So it did not remove the last duplicate. Can anyone help me to understand why it happened and how could I fix it?

>Solution :

The main problem is with the break line characters ("\n") which appears at the end of each line but the last line. You can use a combination of set, map and join function such as what follows:

f = open('a.txt', 'r')
w = open('out.txt', 'w')
w.write("\n".join(list(set(map(str.strip,f.readlines())))))

out.txt

python
world
hello

If you want to stick to your previous approach you can use:

line_seen = set()

f = open('a.txt', 'r')
w = open('out.txt', 'w')

for i in f:
  i = i.strip()
  if i not in line_seen:
    w.write(i)
    line_seen.add(i)

f.close()
w.close()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading