Home Tagging foreign text using isascii in python

Questions

Tagging foreign text using isascii in python

January 2, 2022

I would like to create a very simple non-English word identification script which replaces every word in a text with a <FOREIGN> tag if that word contains any specific non-English character. For this I used the .isascii() method.

I have the following sample string:

s = "abc def déf äëü المزيد 한글  - 1 2 3 4 5"

And the following is the expected output:

s_exp = "abc def <FOREIGN> <FOREIGN> <FOREIGN> <FOREIGN> - 1 2 3 4 5"

My current working implementation is:

import re
for word in s.split():
    if not word.isascii():
        s = re.sub(word, "<FOREIGN>", s)

While this works perfectly for small amount of data, I am worried about its performance on 100,000s of rows of textual data organized in a pandas dataframe. I was wondering if there is any solution that might be better performing than this for loop. At the moment, I am using
df['Text'].apply(lambda x: replace_nonenglish(x)) where replace_nonenglish is:

def replace_nonenglish(s):
    for word in s.split():
        if not word.isascii():
            s = re.sub(word, "<FOREIGN>", s)
    return s

Note:

I am fully aware that this will provide a bunch of false negatives, i.e. non-English words not tagged as <FOREIGN> such as the French "bien" or the German "gut" but that is acceptable for now.

>Solution :

You can also use

import re
s = "abc def déf äëü المزيد 한글  - 1 2 3 4 5"
print( re.sub(r"\b[a-zA-Z]*[^\W\d_a-zA-Z][^\W\d_]*\b", "<FOREIGN>", s) )
# => abc def <FOREIGN> <FOREIGN> <FOREIGN> <FOREIGN>  - 1 2 3 4 5

See the Python demo and a regex demo.

Details:

\b – a word boundary (it is Unicode aware in Python by default)
[a-zA-Z]* – zero or more ASCII letters
[^\W\d_a-zA-Z] – any Unicode letter but an ASCII letter
[^\W\d_]* – zero or more Unicode letters
\b – a word boundary.

With the PyPi regex library (install with pip install regex in your terminal/console window) it would look a bit cleaner:

import regex
s = "abc def déf äëü المزيد 한글  - 1 2 3 4 5"
print( regex.sub(r"\b[a-zA-Z]*[^\P{L}a-zA-Z]\p{L}*\b", "<FOREIGN>", s) )

See this Python demo. Here, \p{L} matches any Unicode letter and \P{L} matches any char other than a Unicode letter.

regex

byMR

Published January 02, 2022

Add a comment

infinite array loop and combine with other array in loop [PHP]

byMR

January 2, 2022

Questions

Python Selenium: If there are multiple "div" tags, how do print a specific one WITHOUT Xpath?

byMR

January 2, 2022

Questions

I get a <Buffer…/> when I console.log message sent by user using Websocket

byMR

January 2, 2022

Questions

I can't get localhost:3001/api/notes to display anything

byMR

January 2, 2022

Questions

How to print each item in a list’s score

byMR

January 2, 2022

Questions

Getting Error: Network Error at e.exports (createError.js:16) at XMLHttpRequest.g.onerror (xhr.js:117) for API call

byMR

January 2, 2022

Tagging foreign text using isascii in python

MEDevel.com: Open-source for Healthcare and Education

Note:

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

infinite array loop and combine with other array in loop [PHP]

Python Selenium: If there are multiple "div" tags, how do print a specific one WITHOUT Xpath?

I get a <Buffer…/> when I console.log message sent by user using Websocket

I can't get localhost:3001/api/notes to display anything

How to print each item in a list’s score

Getting Error: Network Error at e.exports (createError.js:16) at XMLHttpRequest.g.onerror (xhr.js:117) for API call

Keep Up to Date with the Most Important News

Tagging foreign text using isascii in python

MEDevel.com: Open-source for Healthcare and Education

Note:

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

infinite array loop and combine with other array in loop [PHP]

Python Selenium: If there are multiple "div" tags, how do print a specific one WITHOUT Xpath?

I get a <Buffer…/> when I console.log message sent by user using Websocket

I can't get localhost:3001/api/notes to display anything

How to print each item in a list’s score

Getting Error: Network Error at e.exports (createError.js:16) at XMLHttpRequest.g.onerror (xhr.js:117) for API call

Discover more from Dev solutions