Getting list of unique special characters

I want to obtain a list of all the unique characters in a text. A particularity of the text is that it includes composed characters like s̈, b̃. So when I split the text, the special characters are separated. For example, this character s̈ is separated into two characters s and ¨.

This is an example of the text I want to process.

sentence = "nejon ámas̈hó T̃iqu c̈ab̃op"

I want to obtain a list with the unique characters. For this sentence, this list should be

expected_list = [‘a’, ‘á’, ‘b̃’, ‘c̈’, ‘e’, ‘h’, ‘i’, ‘j’, ‘m’, ‘n’, ‘o’, ‘ó’, ‘p’, ‘q’, ‘s̈’, ‘T̃’, ‘u’ ]

but it is

actual_list = [‘j’, ‘p’, ‘c’, ‘n’, ‘a’, ‘ ‘, ‘i’, ‘á’, ‘o’, ‘T’, ‘u’, ‘̃’, ‘h’, ‘̈’, ‘q’, ‘s’, ‘e’, ‘m’, ‘b’, ‘ó’]

I was reading that I can normalize the special characters as follows

import unicodedata
# Only for the character s̈
print(ascii(unicodedata.normalize('NFC', '\u0073\u00a8')))  #prints 's\xa8'

But I don’t know how to continue. Any help would be greatly appreciated.

>Solution :

Handling composed characters in Python can be a bit tricky due to the nature of how they are encoded. Try the grapheme library, which specifically deals with grapheme clusters (textual units that are displayed as a single character)

Install the grapheme library using pip:

pip install grapheme

or I prefer this way (to make sure it’s installing to the current python binary dirs)

python3 -m pip install grapheme

Then, you can use it to extract the unique grapheme clusters from the sentence:

import grapheme

sentence = "nejon ámas̈hó T̃iqu c̈ab̃op"
unique_characters = list(grapheme.graphemes(sentence))


Leave a Reply