I have a list of strings that contain Non-English/English words. I want to filter out only English words.
Example:
phrases = [
"S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.",
"स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre",
"भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,",
]
My code so far:
import re
regex = re.compile("[^a-zA-Z0-9!@#$&()\\-`.+,/\"]+")
for i in phrases:
print(regex.sub(' ', i))
My output:
["S/O , .-4 , S/O Ashok Kumar, Block no.-4D.",
"-15, 5. Street-15, sector -5, Civic Centre",
", , , , Bhilai, Durg. Bhilai, Chhattisgarh",]
My desire output
["S/O Ashok Kumar, Block no.-4D.",
"Street-15, sector -5, Civic Centre",
"Bhilai, Durg. Bhilai, Chhattisgarh,"]
>Solution :
If I look at your data it seems you could use the following:
import regex as re
lst=["S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.",
"स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre",
"भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,",]
for i in lst:
print(re.sub(r'^.*\p{Devanagari}.+?\b', '', i))
Prints:
S/O Ashok Kumar, Block no.-4D.
Street-15, sector -5, Civic Centre
Bhilai, Durg. Bhilai, Chhattisgarh,
See an online regex demo
^– Start string anchor;.*\p{Devanagari}– 0+ (Greedy) characters upto the last Devanagari letter;.+?\b– 1+ (Lazy) characters upto the first word-boundary