Python 3 Unicode Handling

This cluster debates the changes to string and unicode handling in Python 3 compared to Python 2, focusing on the separation of str (unicode) and bytes types, backward compatibility issues, and whether it fixed or worsened text processing.

📉 Falling 0.2x Programming Languages
1,987
Comments
18
Years Active
5
Top Authors
#3150
Topic ID

Activity Over Time

2008
9
2009
13
2010
17
2011
36
2012
49
2013
88
2014
148
2015
199
2016
206
2017
265
2018
188
2019
174
2020
210
2021
92
2022
77
2023
109
2024
63
2025
44

Keywords

e.g pocoo.org TIL UTF8 IMO XML ASCII UNICODE GCC RSS unicode python bytes strings string utf encoding python3 handling str

Sample Comments

msellout Jan 2, 2016 View on HN

Yes, you are missing that the str as bytes to str as Unicode transition was a tough problem. That's the major incompatibility. Once that decision (mistake?) was made, the break was irrevocable. Perhaps not in theory, but in the practice of how Python 2 folks have used strs.

PeterisP Apr 20, 2018 View on HN

Most builtins can't use bytes instead of string safely for all data, and the problem is that it propagates - in Python 2 you often have the case that library A uses library B that uses a builtin that treats a piece of text like a string of bytes, and so you can't use library A because it will give you broken results in certain conditions. We've spent time updating some third party open source libraries to support python 3, and that was well worth the time to avoid the waste

aeturnum Oct 6, 2018 View on HN

My main criticism of Python 3's changes to strings is that it has become much more specific about strings.In Python 2, if you have a series of bytes -or- a "string", the language has no opinion about the encoding. It just passes around the bytes. If that set of bytes enters and exits Python without being changed, its format is of no concern. Interactions do not force you to define an encoding. This is not correct, but it is often functional.Python 3, on the oth

kpil Nov 23, 2016 View on HN

The "unicode" string changes in python 3 is enough for me to avoid it, as they somehow managed to go from broken to brain dead.Mandatory utf-8 strings would had been a reasonably nice solution I think.

Siecje Mar 6, 2018 View on HN

What about Python3's unicode handling is botched?

Spivak Oct 19, 2023 View on HN

Hard disagree, there's plenty to complain about with python strings but drawing a formal distinction between str and bytes is one of the smartest things they did for the language. It made the transition from 2->3 a huge PITA but it's one of the things that forces you two write better code. You have to actually acknowledge when you're doing an encoding/decoding step and what encoding you expect.Python3 caught a programming error for you and you're mad about it. The

kpcyrd Jun 6, 2019 View on HN

If you're referring to the python2->python3 change, this was a pretty important one. Having a string type that isn't required to have valid encoding is an awful idea.

Doxin Jan 6, 2023 View on HN

In python2 you can easily mix encoded and unencoded strings. This doesn't actually work, but at worst it'll produce mangled strings. You used to see this sort of thing online quite a lot with webpages having all sorts of weird characters and question marks showing up.Python3 enforces the difference between encoded and unencoded strings. It forces you to deal correctly with unicode. If your code base was already handling unicode correctly it wasn't much hassle to m

beagle3 Sep 14, 2011 View on HN

I thought Python3 sort-of did fix it, by forcing strings to be (abstractly) unicode and force you to explicit convert them to bytes with whichever codec you want (e.g. utf-8) when you need to.What are you missing?

flohofwoe Jul 4, 2025 View on HN

The problem I have with python2 vs python3 wasn't breaking backward compatibility, but that their "solution" for UNICODE strings is such a weird mess (treating string- and byte-streams as something completely separate instead of treating strings as UTF-8 encoded views on byte streams) The only string encoding that matters today is UTF-8, all others are relics from the early 90s and the sooner we get rid of those the better - e.g. Python caused a whole lot of pain for a solution th