Scraping thousands of lines of data, I’ve created a spellcheck function for specific terms that are often misspelled, automatically correcting them before writing to file.
This works well if it’s a standalone word like "apple" and I replace it with "orange", but becomes a problem if it’s "pineapple" and turns into "pineorange". As a workaround, I pad the original term with a space on either side, but this causes it to miss out on occurrences where characters like a period are after it, "apple." for example.
What options do I have to improve the handling here? Preferably something other than a bunch of if checks on the last character.
spelling_dict = {
"abc" : "ABC",
"apple" : "Apple",
"tortose" : "Tortoise"
}
def spellcheck(line):
for word, correction in spelling_dict.items():
# Pad words with a space on either side
word = word.center( len(word) + 2 )
correction = correction.center ( len(correction) + 2 )
line = line.replace(word, correction)
return line
myphrase = "For apple, I want to capitalize both occurrences of apple."
fixedphrase = spellcheck(myphrase)
print(fixedphrase)
>Solution :
Looks like you want regular expressions. In this case, the pattern (the thing to look for) is the string apple wrapped in word-boundaries \\b:
import re
pattern = "\\bapple\\b"
phrase = "apple pineapple apples and apple."
print(re.sub(pattern, "orange", phrase))
Output:
orange pineapple apples and orange.
>>>
Notice how apple and apple. were replaced with orange and orange., but pineapple and apples remain unchanged.