Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Maximum number of codepoints in a grapheme cluster

I am using the C++ ICU library. I wish to split a utf-8 string into approximately equal chunks. However, I want the chunks to be demarcated at grapheme cluster boundaries. I do not wish to convert my entire string into utf-16 to do this for both memory and speed efficiency. Instead, I want to translate a small number of utf-8 codepoints close to my estimated chunk boundaries into utf-16. I can then use ICU’s BreakIterator to work out the exact boundaries.

Is there a hard upper limit of the number of codepoints that can make up a grapheme cluster? If so, what is it? I need to know this in order to determine the minimum codepoints that I need to translate from utf-8 to utf-16.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Is there a hard upper limit of the number of codepoints that can make up a grapheme cluster?

No. There is no hard upper limit for how many code points a grapheme clusters – i.e. a user-perceived character – consists of.

You could for example repeatedly add ZERO WIDTH JOINER with a joined character.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading