LLM Tokenization

The cluster focuses on discussions about tokenization in large language models (LLMs), including why LLMs use subword tokens instead of characters, the implications for performance and training, and comparisons between different tokenization approaches.

➡️ Stable 1.0x AI & Machine Learning
3,452
Comments
18
Years Active
5
Top Authors
#540
Topic ID

Activity Over Time

2008
1
2009
1
2011
1
2012
1
2013
3
2014
1
2015
1
2016
2
2017
3
2018
4
2019
20
2020
27
2021
11
2022
108
2023
1,092
2024
974
2025
1,114
2026
94

Keywords

AFAIK e.g DO LLM BOS i.e NOT simonwillison.net github.com GPT tokens token llm letter llms trained 1024 model str input

Sample Comments

azulster Sep 18, 2024 View on HN

yes, you are missing that the tokens aren't words, they are 2-3 letter groups, or any number of arbitrary sizes depending on the model

HarHarVeryFunny May 15, 2024 View on HN

tokenization is not the issue - these LLMs can all break a word into letters if you ask them.

pests Mar 1, 2024 View on HN

He's saying the LLM will figure out how many letters are in each token.

PaulHoule Dec 14, 2024 View on HN

I've often wanted to talk with an LLM about its tokenization (e.g. how many tokens are there in "the simplest of phrases") I wonder if you fed it information about its tokenization (text like "rabbit is spelled r, a, b, b, i, t") if it could talk about it.

hgsgm Mar 10, 2023 View on HN

The pallette for LLM is tokens not characters.

psychphysic May 20, 2023 View on HN

But LLM have little concept of tokens don't they? Or at least well not know what their tokenizes is like.

IncreasePosts Oct 14, 2025 View on HN

Wouldn't a llm that just tokenized by character be good at it?

nprateem Jun 24, 2025 View on HN

...which LLMs don't use as they use tokens instead.

twobitshifter Sep 19, 2024 View on HN

Would an LLM using character tokens perform better (ignoring performance)?

fassssst Oct 22, 2024 View on HN

They are trained on tokens not characters.