UTF-8 vs UTF-16

Cluster focuses on debates about Unicode encodings like UTF-8, UTF-16, and UTF-32, including their variable widths, byte efficiency for languages like CJK, surrogate pairs, and code points versus characters.

📉 Falling 0.5x Programming Languages

2,691

Comments

Years Active

Top Authors

#6457

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

132

2013

122

2014

2015

109

2016

116

2017

160

2018

184

2019

195

2020

163

2021

186

2022

349

2023

259

2024

192

2025

251

2026

Top Contributors

Dylan16807 (59) masklinn (39) account42 (27) lifthrasiir (25) cryptonector (25)

Keywords

IEC CJK JS DFFF UTF8 UCS KR ASCII UTF32 en.m utf unicode bytes 16 encoding byte character utf8 characters 32

Sample Comments

ChrisSD • Apr 15, 2021 • View on HN

UTF-16 is a variable length encoding. As is UTF-32 for that matter. A single character can be made of multiple code points.

jrochkind1 • Feb 24, 2022 • View on HN

Less bandwidth than the 1-4 bytes of a codepoint in UTF-8? How do you figure?

loeg • Jan 12, 2022 • View on HN

UTF-8 is variable width. The biggest valid codepoint is U+10FFFF, which has a 4-byte encoding in UTF-8. Other codepoints have 1-, 2-, or 3-byte encodings.

gaius • Mar 5, 2009 • View on HN

UTF is variable width; it'll use 8 bits/char if it can, and only use 16 bits if it has to.

jmdavis • Nov 27, 2013 • View on HN

UTF16 is variable length just like UTF8 is (so don't assume 2 bytes == 1 character).

1ris • Jul 29, 2020 • View on HN

UTF-8 is not arbitrary precision. There is a artificial limit at 2^20. And Unicode does not map theses to characters. Unicode maps these to so called "code points". These "code points" then are mapped to charaters ("glypheme clusters" in unicode speak) in a very complicated way. A sequence of code points can be different kinds of normal forms, for example.

Avshalom • Apr 3, 2020 • View on HN

It's not taking into account UTF-8 though so maybe double or triple.

garethadams • Feb 6, 2012 • View on HN

Tell that to the Japanese, whose glyphs are always 3+ bytes in UTF-8!

adgjlsfhk1 • Sep 2, 2025 • View on HN

UTF-16 is the worst of all worlds. Either use UTF32 where code-points are fixed, or if you care about space efficiency use UTF8

herge • Jan 20, 2012 • View on HN

Why doesn't everybody use UTF-8? How much overhead is incurred in encoding a non-ascii language (say Chinese) in UTF-8 compared to UTF-16?