Python 3 Unicode Handling
This cluster debates the changes to string and unicode handling in Python 3 compared to Python 2, focusing on the separation of str (unicode) and bytes types, backward compatibility issues, and whether it fixed or worsened text processing.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Yes, you are missing that the str as bytes to str as Unicode transition was a tough problem. That's the major incompatibility. Once that decision (mistake?) was made, the break was irrevocable. Perhaps not in theory, but in the practice of how Python 2 folks have used strs.
Most builtins can't use bytes instead of string safely for all data, and the problem is that it propagates - in Python 2 you often have the case that library A uses library B that uses a builtin that treats a piece of text like a string of bytes, and so you can't use library A because it will give you broken results in certain conditions. We've spent time updating some third party open source libraries to support python 3, and that was well worth the time to avoid the waste
My main criticism of Python 3's changes to strings is that it has become much more specific about strings.In Python 2, if you have a series of bytes -or- a "string", the language has no opinion about the encoding. It just passes around the bytes. If that set of bytes enters and exits Python without being changed, its format is of no concern. Interactions do not force you to define an encoding. This is not correct, but it is often functional.Python 3, on the oth
The "unicode" string changes in python 3 is enough for me to avoid it, as they somehow managed to go from broken to brain dead.Mandatory utf-8 strings would had been a reasonably nice solution I think.
What about Python3's unicode handling is botched?
Hard disagree, there's plenty to complain about with python strings but drawing a formal distinction between str and bytes is one of the smartest things they did for the language. It made the transition from 2->3 a huge PITA but it's one of the things that forces you two write better code. You have to actually acknowledge when you're doing an encoding/decoding step and what encoding you expect.Python3 caught a programming error for you and you're mad about it. The
If you're referring to the python2->python3 change, this was a pretty important one. Having a string type that isn't required to have valid encoding is an awful idea.
In python2 you can easily mix encoded and unencoded strings. This doesn't actually work, but at worst it'll produce mangled strings. You used to see this sort of thing online quite a lot with webpages having all sorts of weird characters and question marks showing up.Python3 enforces the difference between encoded and unencoded strings. It forces you to deal correctly with unicode. If your code base was already handling unicode correctly it wasn't much hassle to m
I thought Python3 sort-of did fix it, by forcing strings to be (abstractly) unicode and force you to explicit convert them to bytes with whichever codec you want (e.g. utf-8) when you need to.What are you missing?
The problem I have with python2 vs python3 wasn't breaking backward compatibility, but that their "solution" for UNICODE strings is such a weird mess (treating string- and byte-streams as something completely separate instead of treating strings as UTF-8 encoded views on byte streams) The only string encoding that matters today is UTF-8, all others are relics from the early 90s and the sooner we get rid of those the better - e.g. Python caused a whole lot of pain for a solution th