UTF-8 vs UTF-16

Cluster focuses on debates about Unicode encodings like UTF-8, UTF-16, and UTF-32, including their variable widths, byte efficiency for languages like CJK, surrogate pairs, and code points versus characters.

📉 Falling 0.5x Programming Languages
2,691
Comments
19
Years Active
5
Top Authors
#6457
Topic ID

Activity Over Time

2008
7
2009
23
2010
43
2011
86
2012
132
2013
122
2014
92
2015
109
2016
116
2017
160
2018
184
2019
195
2020
163
2021
186
2022
349
2023
259
2024
192
2025
251
2026
26

Keywords

IEC CJK JS DFFF UTF8 UCS KR ASCII UTF32 en.m utf unicode bytes 16 encoding byte character utf8 characters 32

Sample Comments

ChrisSD Apr 15, 2021 View on HN

UTF-16 is a variable length encoding. As is UTF-32 for that matter. A single character can be made of multiple code points.

jrochkind1 Feb 24, 2022 View on HN

Less bandwidth than the 1-4 bytes of a codepoint in UTF-8? How do you figure?

loeg Jan 12, 2022 View on HN

UTF-8 is variable width. The biggest valid codepoint is U+10FFFF, which has a 4-byte encoding in UTF-8. Other codepoints have 1-, 2-, or 3-byte encodings.

gaius Mar 5, 2009 View on HN

UTF is variable width; it'll use 8 bits/char if it can, and only use 16 bits if it has to.

jmdavis Nov 27, 2013 View on HN

UTF16 is variable length just like UTF8 is (so don't assume 2 bytes == 1 character).

1ris Jul 29, 2020 View on HN

UTF-8 is not arbitrary precision. There is a artificial limit at 2^20. And Unicode does not map theses to characters. Unicode maps these to so called "code points". These "code points" then are mapped to charaters ("glypheme clusters" in unicode speak) in a very complicated way. A sequence of code points can be different kinds of normal forms, for example.

Avshalom Apr 3, 2020 View on HN

It's not taking into account UTF-8 though so maybe double or triple.

garethadams Feb 6, 2012 View on HN

Tell that to the Japanese, whose glyphs are always 3+ bytes in UTF-8!

adgjlsfhk1 Sep 2, 2025 View on HN

UTF-16 is the worst of all worlds. Either use UTF32 where code-points are fixed, or if you care about space efficiency use UTF8

herge Jan 20, 2012 View on HN

Why doesn't everybody use UTF-8? How much overhead is incurred in encoding a non-ascii language (say Chinese) in UTF-8 compared to UTF-16?