Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Tagging foreign text using isascii in python

I would like to create a very simple non-English word identification script which replaces every word in a text with a <FOREIGN> tag if that word contains any specific non-English character. For this I used the .isascii() method.

I have the following sample string:

s = "abc def déf äëü المزيد 한글  - 1 2 3 4 5"

And the following is the expected output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

s_exp = "abc def <FOREIGN> <FOREIGN> <FOREIGN> <FOREIGN> - 1 2 3 4 5"

My current working implementation is:

import re
for word in s.split():
    if not word.isascii():
        s = re.sub(word, "<FOREIGN>", s)

While this works perfectly for small amount of data, I am worried about its performance on 100,000s of rows of textual data organized in a pandas dataframe. I was wondering if there is any solution that might be better performing than this for loop. At the moment, I am using
df['Text'].apply(lambda x: replace_nonenglish(x)) where replace_nonenglish is:

def replace_nonenglish(s):
    for word in s.split():
        if not word.isascii():
            s = re.sub(word, "<FOREIGN>", s)
    return s

Note:

I am fully aware that this will provide a bunch of false negatives, i.e. non-English words not tagged as <FOREIGN> such as the French "bien" or the German "gut" but that is acceptable for now.

>Solution :

You can also use

import re
s = "abc def déf äëü المزيد 한글  - 1 2 3 4 5"
print( re.sub(r"\b[a-zA-Z]*[^\W\d_a-zA-Z][^\W\d_]*\b", "<FOREIGN>", s) )
# => abc def <FOREIGN> <FOREIGN> <FOREIGN> <FOREIGN>  - 1 2 3 4 5

See the Python demo and a regex demo.

Details:

  • \b – a word boundary (it is Unicode aware in Python by default)
  • [a-zA-Z]* – zero or more ASCII letters
  • [^\W\d_a-zA-Z] – any Unicode letter but an ASCII letter
  • [^\W\d_]* – zero or more Unicode letters
  • \b – a word boundary.

With the PyPi regex library (install with pip install regex in your terminal/console window) it would look a bit cleaner:

import regex
s = "abc def déf äëü المزيد 한글  - 1 2 3 4 5"
print( regex.sub(r"\b[a-zA-Z]*[^\P{L}a-zA-Z]\p{L}*\b", "<FOREIGN>", s) )

See this Python demo. Here, \p{L} matches any Unicode letter and \P{L} matches any char other than a Unicode letter.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading