Not all duplicates are deleted from a text file in Python

March 18, 2022

I am new to Python. I am trying to delete duplicates from my text file by doing the following:

line_seen = set()

f = open('a.txt', 'r')
w = open('out.txt', 'w')

for i in f:
    if i not in line_seen:
            w.write(i)
            line_seen.add(i)

f.close()
w.close()

In the initial file I had

hello
world
python
world
hello

And in output file I got

hello
world
python
hello

So it did not remove the last duplicate. Can anyone help me to understand why it happened and how could I fix it?

>Solution :

The main problem is with the break line characters ("\n") which appears at the end of each line but the last line. You can use a combination of set, map and join function such as what follows:

f = open('a.txt', 'r')
w = open('out.txt', 'w')
w.write("\n".join(list(set(map(str.strip,f.readlines())))))

out.txt

python
world
hello

If you want to stick to your previous approach you can use:

line_seen = set()

f = open('a.txt', 'r')
w = open('out.txt', 'w')

for i in f:
  i = i.strip()
  if i not in line_seen:
    w.write(i)
    line_seen.add(i)

f.close()
w.close()