LLM Tokenization

The cluster focuses on discussions about tokenization in large language models (LLMs), including why LLMs use subword tokens instead of characters, the implications for performance and training, and comparisons between different tokenization approaches.

➡️ Stable 1.0x AI & Machine Learning

3,452

Comments

Years Active

Top Authors

#540

Topic ID

Activity Over Time

2008

2009

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

108

2023

1,092

2024

974

2025

1,114

2026

Top Contributors

simonw (38) minimaxir (36) famouswaffles (29) Der_Einzige (25) yorwba (23)

Keywords

AFAIK e.g DO LLM BOS i.e NOT simonwillison.net github.com GPT tokens token llm letter llms trained 1024 model str input

Sample Comments

azulster • Sep 18, 2024 • View on HN

yes, you are missing that the tokens aren't words, they are 2-3 letter groups, or any number of arbitrary sizes depending on the model

HarHarVeryFunny • May 15, 2024 • View on HN

tokenization is not the issue - these LLMs can all break a word into letters if you ask them.

pests • Mar 1, 2024 • View on HN

He's saying the LLM will figure out how many letters are in each token.

PaulHoule • Dec 14, 2024 • View on HN

I've often wanted to talk with an LLM about its tokenization (e.g. how many tokens are there in "the simplest of phrases") I wonder if you fed it information about its tokenization (text like "rabbit is spelled r, a, b, b, i, t") if it could talk about it.

hgsgm • Mar 10, 2023 • View on HN

The pallette for LLM is tokens not characters.

psychphysic • May 20, 2023 • View on HN

But LLM have little concept of tokens don't they? Or at least well not know what their tokenizes is like.

IncreasePosts • Oct 14, 2025 • View on HN

Wouldn't a llm that just tokenized by character be good at it?

nprateem • Jun 24, 2025 • View on HN

...which LLMs don't use as they use tokens instead.

twobitshifter • Sep 19, 2024 • View on HN

Would an LLM using character tokens perform better (ignoring performance)?

fassssst • Oct 22, 2024 • View on HN

They are trained on tokens not characters.