Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Encounter an issue while trying to remove unicode emojis from strings

I am having a problem removing unicode emojis from my string. Here, I am providing some examples that I’ve seen in my data

['\\\\ud83d\\\\ude0e', '\\\\ud83e\\\\udd20', '\\\\ud83e\\\\udd23', '\\\\ud83d\\\\udc4d', '\\\\ud83d\\\\ude43', '\\\\ud83d\\\\ude31', '\\\\ud83d\\\\ude14', '\\\\ud83d\\\\udcaa', '\\\\ud83d\\\\ude0e', '\\\\ud83d\\\\ude09', '\\\\ud83d\\\\ude09', '\\\\ud83d\\\\ude18','\\\\ud83d\\\\ude01' , '\\\\ud83d\\\\ude44', '\\\\ud83d\\\\ude17']

I would like to remind that these are just some examples, not all of them and they are actually inside some strings in my data.

Here is the function I tried to remove them

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

def remove_emojis(data):
    emoji_pattern = re.compile(
        u"(\\\\ud83d[\\\\ude00-\\\\ude4f])|"  # emoticons
        u"(\\\\ud83c[\\\\udf00-\\\\uffff])|"  # symbols & pictographs (1 of 2)
        u"(\\\\ud83d[\\\\u0000-\\\\uddff])|"  # symbols & pictographs (2 of 2)
        u"(\\\\ud83d[\\\\ude80-\\\\udeff])|"  # transport & map symbols
        u"(\\\\ud83c[\\\\udde0-\\\\uddff])"  # flags (iOS)
        "+", flags=re.UNICODE)
    return re.sub(emoji_pattern, '', data)

If I use "Naja, gegen dich ist sie ein Waisenknabe \\\\ud83d\\\\ude02\\\\ud83d\\\\ude02\\\\ud83d\\\\ude02" as an input, my output is "Naja, gegen dich ist sie ein Waisenknabe \\\\ude02\\\\ude02\\\\ude02". However my desired output should be "Naja, gegen dich ist sie ein Waisenknabe ".

What is the mistake that I am doing and how can I fix that to get my desired results.

>Solution :

Since your text does not contain emoji chars themselves, but their representations in hexadecimal notation (\uXXXX), you can use

data = re.sub(r'\s*(?:\\+u[a-fA-F0-9]{4})+', '', data)

Details:

  • \s* – zero or more whitespaces
  • (?:\\+u[a-fA-F0-9]{4})+ – one or more sequences of
    • \\+ – one or more backslashes
    • u – a u char
    • [a-fA-F0-9]{4} – four hex chars.

See the regex demo.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading