Unicode Grapheme Clusters
Cluster focuses on the challenges of counting and handling Unicode strings correctly, debating grapheme clusters versus code points or bytes for accurate character length in programming languages like Swift, JavaScript, and Python.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Potentially, depending on how the character counting is done (if it counts grapheme clusters instead of bytes I guess?)
Unicode grapheme clusters (letters/ideograms plus accent marks) are inherently variable length, but most devs haven’t read about how that works. We shouldn’t be working with single codepoints or bytes for the same reason we don’t work with US-ASCII one bit at a time, because most slices of a bit string will be nonsense.
Yup. But to be clear, in Unicode a string will index code points, not characters. E.g. a single emoji can be made of multiple code points, as well as certain characters in certain languages. The Unicode name for a character like this is a "grapheme", and grapheme splitting is so complicated it generally belongs in a dedicated Unicode library, not a general-purpose string object.
I would be surprised if most languages implement it that way.Counting code points would be a start and would solve this particular problem. But that’s not glyph count. You really need to count grapheme clusters. For example two strings that are visually identical and contain “é” might have a different number of code points since one might use the combining acute accent codepoint while the other one might use the precomposed “e with accute accent”.Even a modern language like Rust doesn’t
Counting Unicode characters is hard.
I could see issue if using it for splitting strings, but I guess it also depends what Swift assumes to be what normal people think a character is.Most normal people assume Unicode grapheme to be characters. That would be the set of symbols you'd consider a single letter if you were to visually count the number of symbols in the string.In Unicode though, some code points are non visible, yet are part of the string. So for example, the character at i+1 for some match against a letter mi
Be careful not to confuse UTF-16 and UCS-2. UTF-16 is a variable-width encoding, so using it doesn't actually make counting anything easier. UCS-2 is fixed-width, and evil. With UCS-2 you can easily count the number of code points, as long as they fall in the BMP (!), but this is not the same as counting graphemes.
> Combining characters are their own unicode codepoint, so they count towards length.This is incredibly arbitrary - it depends entirely on what "length" means for a particular usecase. From the user's perspective there might only be a single character on the screen.Any nontrivial string operation must be based around grapheme clusters, otherwise it is fundamentally broken. Codepoints are a useful encoding agnostic way to handle basic (split & concatenate) opera
I recommending googling "grapheme cluster"
There's one part of this document that I would push extremely hard against, and that's the notion that "extended grapheme clusters" are the one true, right way to think of characters in Unicode, and therefore any language that views the length in any other way is doing it wrong.The truth of the matter is that there are several different definitions of "character", depending on what you want to use it for. An extended grapheme cluster is largely defined on "t