String Encoding Debates
Cluster focuses on debates about string representations in programming languages, including UTF-8 vs. UTF-16 encodings, bytes vs. text, random access efficiency, and implementations in Rust, Python, Java, and others.
Activity Over Time
Top Contributors
Keywords
Sample Comments
I forget that people think strings are different from sequences of bytes.
ByteStrings are not the alternative to Strings! Text is.
Encoding strings internally as UTF-8 is a bad idea, since you can't do efficient constant time access (utf-8 isn't fixed width).
What are the downsides of Ruby's alternate approach of having strings be bytes that carry an encoding object around?
Honestly feels like a very, very different use case from utf8-ish strings, at a quick read?
UTF-8 is not well suited for a general purpose string implementation because it is a variable length encoding and therefore addressing a character becomes a linear time operation. UTF-16 would probably be a better choice in most cases.
Don't most other languages use utf16 for strings?
All rust String and &strs are UTF-8 encoded, there are also other string types.
Not to disagree, but you can barely perform random access on a utf8 string. You need to explode it out to utf16 or utf32 which isn't what most languages have built in. Rust and go largely work with utf8 while c and c++ love them byte arrays (not sure I've even seen std::wstring in the wild)
Nah, that's just dumb. Rust's way of all strings being utf-8 and providing the different lengths depending on your needs is far superior.If you want something else than utf-8 you can use another data type, like a vector of bytes.