I have downloaded the tab-separated tatoeba dataset with English-German sentence pairs to train an NMT model on it. Unfortunately each line ends with all sorts of additional information:
Go. Geh. CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)
Hi. Hallo! CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)
How can I strip the part after the second sentence for each line in the text file?
I tried doing it in python:
for line in text:
split = line.split('CC-BY', 1)
line = split[0]
…but that didn’t work. What I’m looking for is a file that looks like this:
Go. Geh.
Hi. Hallo!
For any help I would be very grateful 🙂
>Solution :
The idea of using split is correct but assigning directly in this way in a for loop will not change the list elements.
You should also avoid using split as a variable name when it is already the name of an inbuilt method.
A list comprehension will do the job:
new_lines = [line.split('CC-BY', 1)[0].strip() for line in text]
The strip is added because you probably want to remove the extra spaces at the end of each line.
With your input text saved as text.txt, the following code:
with open("text.txt", encoding="utf8") as f:
text = f.read().splitlines()
new_lines = [line.split('CC-BY', 1)[0].strip() for line in text]
for line in new_lines:
print(line)
gives the output:
Go. Geh.
Hi. Hallo!