Unicode Grapheme Clusters

Cluster focuses on the challenges of counting and handling Unicode strings correctly, debating grapheme clusters versus code points or bytes for accurate character length in programming languages like Swift, JavaScript, and Python.

📉 Falling 0.4x Programming Languages
1,878
Comments
19
Years Active
5
Top Authors
#6674
Topic ID

Activity Over Time

2008
1
2009
3
2010
11
2011
22
2012
47
2013
98
2014
74
2015
116
2016
101
2017
158
2018
67
2019
152
2020
90
2021
223
2022
164
2023
290
2024
82
2025
173
2026
6

Keywords

e.g PHP US ICU4C UI UAX UCS BreakIterator ASCII E.g unicode characters clusters character utf string code points length bytes

Sample Comments

lol768 Mar 14, 2020 View on HN

Potentially, depending on how the character counting is done (if it counts grapheme clusters instead of bytes I guess?)

erik_seaberg Apr 11, 2021 View on HN

Unicode grapheme clusters (letters/ideograms plus accent marks) are inherently variable length, but most devs haven’t read about how that works. We shouldn’t be working with single codepoints or bytes for the same reason we don’t work with US-ASCII one bit at a time, because most slices of a bit string will be nonsense.

crazygringo Aug 22, 2025 View on HN

Yup. But to be clear, in Unicode a string will index code points, not characters. E.g. a single emoji can be made of multiple code points, as well as certain characters in certain languages. The Unicode name for a character like this is a "grapheme", and grapheme splitting is so complicated it generally belongs in a dedicated Unicode library, not a general-purpose string object.

avgcorrection Nov 22, 2021 View on HN

I would be surprised if most languages implement it that way.Counting code points would be a start and would solve this particular problem. But that’s not glyph count. You really need to count grapheme clusters. For example two strings that are visually identical and contain “é” might have a different number of code points since one might use the combining acute accent codepoint while the other one might use the precomposed “e with accute accent”.Even a modern language like Rust doesn’t

withinboredom Oct 16, 2022 View on HN

Counting Unicode characters is hard.

didibus Sep 26, 2020 View on HN

I could see issue if using it for splitting strings, but I guess it also depends what Swift assumes to be what normal people think a character is.Most normal people assume Unicode grapheme to be characters. That would be the set of symbols you'd consider a single letter if you were to visually count the number of symbols in the string.In Unicode though, some code points are non visible, yet are part of the string. So for example, the character at i+1 for some match against a letter mi

pjscott Nov 27, 2012 View on HN

Be careful not to confuse UTF-16 and UCS-2. UTF-16 is a variable-width encoding, so using it doesn't actually make counting anything easier. UCS-2 is fixed-width, and evil. With UCS-2 you can easily count the number of code points, as long as they fall in the BMP (!), but this is not the same as counting graphemes.

d110af5ccf Mar 26, 2021 View on HN

> Combining characters are their own unicode codepoint, so they count towards length.This is incredibly arbitrary - it depends entirely on what "length" means for a particular usecase. From the user's perspective there might only be a single character on the screen.Any nontrivial string operation must be based around grapheme clusters, otherwise it is fundamentally broken. Codepoints are a useful encoding agnostic way to handle basic (split & concatenate) opera

charcircuit Jan 19, 2022 View on HN

I recommending googling "grapheme cluster"

jcranmer Oct 2, 2023 View on HN

There's one part of this document that I would push extremely hard against, and that's the notion that "extended grapheme clusters" are the one true, right way to think of characters in Unicode, and therefore any language that views the length in any other way is doing it wrong.The truth of the matter is that there are several different definitions of "character", depending on what you want to use it for. An extended grapheme cluster is largely defined on "t