Unicode Grapheme Clusters

Cluster focuses on the challenges of counting and handling Unicode strings correctly, debating grapheme clusters versus code points or bytes for accurate character length in programming languages like Swift, JavaScript, and Python.

📉 Falling 0.4x Programming Languages

1,878

Comments

Years Active

Top Authors

#6674

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

2013

2014

2015

116

2016

101

2017

158

2018

2019

152

2020

2021

223

2022

164

2023

290

2024

2025

173

2026

Top Contributors

masklinn (35) josephg (28) raiph (23) arcticbull (23) deathanatos (21)

Keywords

e.g PHP US ICU4C UI UAX UCS BreakIterator ASCII E.g unicode characters clusters character utf string code points length bytes

Sample Comments

lol768 • Mar 14, 2020 • View on HN

Potentially, depending on how the character counting is done (if it counts grapheme clusters instead of bytes I guess?)

erik_seaberg • Apr 11, 2021 • View on HN

Unicode grapheme clusters (letters/ideograms plus accent marks) are inherently variable length, but most devs haven’t read about how that works. We shouldn’t be working with single codepoints or bytes for the same reason we don’t work with US-ASCII one bit at a time, because most slices of a bit string will be nonsense.

crazygringo • Aug 22, 2025 • View on HN

Yup. But to be clear, in Unicode a string will index code points, not characters. E.g. a single emoji can be made of multiple code points, as well as certain characters in certain languages. The Unicode name for a character like this is a "grapheme", and grapheme splitting is so complicated it generally belongs in a dedicated Unicode library, not a general-purpose string object.

avgcorrection • Nov 22, 2021 • View on HN

I would be surprised if most languages implement it that way.Counting code points would be a start and would solve this particular problem. But that’s not glyph count. You really need to count grapheme clusters. For example two strings that are visually identical and contain “é” might have a different number of code points since one might use the combining acute accent codepoint while the other one might use the precomposed “e with accute accent”.Even a modern language like Rust doesn’t

withinboredom • Oct 16, 2022 • View on HN

Counting Unicode characters is hard.

didibus • Sep 26, 2020 • View on HN

I could see issue if using it for splitting strings, but I guess it also depends what Swift assumes to be what normal people think a character is.Most normal people assume Unicode grapheme to be characters. That would be the set of symbols you'd consider a single letter if you were to visually count the number of symbols in the string.In Unicode though, some code points are non visible, yet are part of the string. So for example, the character at i+1 for some match against a letter mi

pjscott • Nov 27, 2012 • View on HN

Be careful not to confuse UTF-16 and UCS-2. UTF-16 is a variable-width encoding, so using it doesn't actually make counting anything easier. UCS-2 is fixed-width, and evil. With UCS-2 you can easily count the number of code points, as long as they fall in the BMP (!), but this is not the same as counting graphemes.

d110af5ccf • Mar 26, 2021 • View on HN

> Combining characters are their own unicode codepoint, so they count towards length.This is incredibly arbitrary - it depends entirely on what "length" means for a particular usecase. From the user's perspective there might only be a single character on the screen.Any nontrivial string operation must be based around grapheme clusters, otherwise it is fundamentally broken. Codepoints are a useful encoding agnostic way to handle basic (split & concatenate) opera

charcircuit • Jan 19, 2022 • View on HN

I recommending googling "grapheme cluster"

jcranmer • Oct 2, 2023 • View on HN

There's one part of this document that I would push extremely hard against, and that's the notion that "extended grapheme clusters" are the one true, right way to think of characters in Unicode, and therefore any language that views the length in any other way is doing it wrong.The truth of the matter is that there are several different definitions of "character", depending on what you want to use it for. An extended grapheme cluster is largely defined on "t