Python 3 Unicode Handling

This cluster debates the changes to string and unicode handling in Python 3 compared to Python 2, focusing on the separation of str (unicode) and bytes types, backward compatibility issues, and whether it fixed or worsened text processing.

📉 Falling 0.2x Programming Languages

1,987

Comments

Years Active

Top Authors

#3150

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

2013

2014

148

2015

199

2016

206

2017

265

2018

188

2019

174

2020

210

2021

2022

2023

109

2024

2025

Top Contributors

takeda (38) masklinn (33) the_mitsuhiko (33) int_19h (32) ubernostrum (32)

Keywords

e.g pocoo.org TIL UTF8 IMO XML ASCII UNICODE GCC RSS unicode python bytes strings string utf encoding python3 handling str

Sample Comments

msellout • Jan 2, 2016 • View on HN

Yes, you are missing that the str as bytes to str as Unicode transition was a tough problem. That's the major incompatibility. Once that decision (mistake?) was made, the break was irrevocable. Perhaps not in theory, but in the practice of how Python 2 folks have used strs.

PeterisP • Apr 20, 2018 • View on HN

Most builtins can't use bytes instead of string safely for all data, and the problem is that it propagates - in Python 2 you often have the case that library A uses library B that uses a builtin that treats a piece of text like a string of bytes, and so you can't use library A because it will give you broken results in certain conditions. We've spent time updating some third party open source libraries to support python 3, and that was well worth the time to avoid the waste

aeturnum • Oct 6, 2018 • View on HN

My main criticism of Python 3's changes to strings is that it has become much more specific about strings.In Python 2, if you have a series of bytes -or- a "string", the language has no opinion about the encoding. It just passes around the bytes. If that set of bytes enters and exits Python without being changed, its format is of no concern. Interactions do not force you to define an encoding. This is not correct, but it is often functional.Python 3, on the oth

kpil • Nov 23, 2016 • View on HN

The "unicode" string changes in python 3 is enough for me to avoid it, as they somehow managed to go from broken to brain dead.Mandatory utf-8 strings would had been a reasonably nice solution I think.

Siecje • Mar 6, 2018 • View on HN

What about Python3's unicode handling is botched?

Spivak • Oct 19, 2023 • View on HN

Hard disagree, there's plenty to complain about with python strings but drawing a formal distinction between str and bytes is one of the smartest things they did for the language. It made the transition from 2->3 a huge PITA but it's one of the things that forces you two write better code. You have to actually acknowledge when you're doing an encoding/decoding step and what encoding you expect.Python3 caught a programming error for you and you're mad about it. The

kpcyrd • Jun 6, 2019 • View on HN

If you're referring to the python2->python3 change, this was a pretty important one. Having a string type that isn't required to have valid encoding is an awful idea.

Doxin • Jan 6, 2023 • View on HN

In python2 you can easily mix encoded and unencoded strings. This doesn't actually work, but at worst it'll produce mangled strings. You used to see this sort of thing online quite a lot with webpages having all sorts of weird characters and question marks showing up.Python3 enforces the difference between encoded and unencoded strings. It forces you to deal correctly with unicode. If your code base was already handling unicode correctly it wasn't much hassle to m

beagle3 • Sep 14, 2011 • View on HN

I thought Python3 sort-of did fix it, by forcing strings to be (abstractly) unicode and force you to explicit convert them to bytes with whichever codec you want (e.g. utf-8) when you need to.What are you missing?

flohofwoe • Jul 4, 2025 • View on HN

The problem I have with python2 vs python3 wasn't breaking backward compatibility, but that their "solution" for UNICODE strings is such a weird mess (treating string- and byte-streams as something completely separate instead of treating strings as UTF-8 encoded views on byte streams) The only string encoding that matters today is UTF-8, all others are relics from the early 90s and the sooner we get rid of those the better - e.g. Python caused a whole lot of pain for a solution th