UTF-8 vs UTF-16
Cluster focuses on debates about Unicode encodings like UTF-8, UTF-16, and UTF-32, including their variable widths, byte efficiency for languages like CJK, surrogate pairs, and code points versus characters.
Activity Over Time
Top Contributors
Keywords
Sample Comments
UTF-16 is a variable length encoding. As is UTF-32 for that matter. A single character can be made of multiple code points.
Less bandwidth than the 1-4 bytes of a codepoint in UTF-8? How do you figure?
UTF-8 is variable width. The biggest valid codepoint is U+10FFFF, which has a 4-byte encoding in UTF-8. Other codepoints have 1-, 2-, or 3-byte encodings.
UTF is variable width; it'll use 8 bits/char if it can, and only use 16 bits if it has to.
UTF16 is variable length just like UTF8 is (so don't assume 2 bytes == 1 character).
UTF-8 is not arbitrary precision. There is a artificial limit at 2^20. And Unicode does not map theses to characters. Unicode maps these to so called "code points". These "code points" then are mapped to charaters ("glypheme clusters" in unicode speak) in a very complicated way. A sequence of code points can be different kinds of normal forms, for example.
It's not taking into account UTF-8 though so maybe double or triple.
Tell that to the Japanese, whose glyphs are always 3+ bytes in UTF-8!
UTF-16 is the worst of all worlds. Either use UTF32 where code-points are fixed, or if you care about space efficiency use UTF8
Why doesn't everybody use UTF-8? How much overhead is incurred in encoding a non-ascii language (say Chinese) in UTF-8 compared to UTF-16?