Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Getting list of unique special characters

I want to obtain a list of all the unique characters in a text. A particularity of the text is that it includes composed characters like s̈, b̃. So when I split the text, the special characters are separated. For example, this character s̈ is separated into two characters s and ¨.

This is an example of the text I want to process.

sentence = "nejon ámas̈hó T̃iqu c̈ab̃op"
print(sentence)
print(list[set(sentence)])

I want to obtain a list with the unique characters. For this sentence, this list should be

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

expected_list = [‘a’, ‘á’, ‘b̃’, ‘c̈’, ‘e’, ‘h’, ‘i’, ‘j’, ‘m’, ‘n’, ‘o’, ‘ó’, ‘p’, ‘q’, ‘s̈’, ‘T̃’, ‘u’ ]

but it is

actual_list = [‘j’, ‘p’, ‘c’, ‘n’, ‘a’, ‘ ‘, ‘i’, ‘á’, ‘o’, ‘T’, ‘u’, ‘̃’, ‘h’, ‘̈’, ‘q’, ‘s’, ‘e’, ‘m’, ‘b’, ‘ó’]

I was reading that I can normalize the special characters as follows

import unicodedata
# Only for the character s̈
print(ascii(unicodedata.normalize('NFC', '\u0073\u00a8')))  #prints 's\xa8'

But I don’t know how to continue. Any help would be greatly appreciated.

>Solution :

Handling composed characters in Python can be a bit tricky due to the nature of how they are encoded. Try the grapheme library, which specifically deals with grapheme clusters (textual units that are displayed as a single character)

Install the grapheme library using pip:

pip install grapheme

or I prefer this way (to make sure it’s installing to the current python binary dirs)

python3 -m pip install grapheme

Then, you can use it to extract the unique grapheme clusters from the sentence:

import grapheme

sentence = "nejon ámas̈hó T̃iqu c̈ab̃op"
unique_characters = list(grapheme.graphemes(sentence))

print(unique_characters)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading