How to strip a certain piece of text from each line of a text file?

December 28, 2021

I have downloaded the tab-separated tatoeba dataset with English-German sentence pairs to train an NMT model on it. Unfortunately each line ends with all sorts of additional information:

Go. Geh.    CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)
Hi. Hallo!  CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)

How can I strip the part after the second sentence for each line in the text file?

I tried doing it in python:

for line in text:
  split = line.split('CC-BY', 1)
  line = split[0]

…but that didn’t work. What I’m looking for is a file that looks like this:

Go. Geh.
Hi. Hallo!

For any help I would be very grateful 🙂

>Solution :

The idea of using split is correct but assigning directly in this way in a for loop will not change the list elements.

You should also avoid using split as a variable name when it is already the name of an inbuilt method.

A list comprehension will do the job:

new_lines = [line.split('CC-BY', 1)[0].strip() for line in text]

The strip is added because you probably want to remove the extra spaces at the end of each line.

With your input text saved as text.txt, the following code:

with open("text.txt", encoding="utf8") as f:
    text = f.read().splitlines()

new_lines = [line.split('CC-BY', 1)[0].strip() for line in text]

for line in new_lines:
    print(line)

gives the output: