LLM Tokenization
The cluster focuses on discussions about tokenization in large language models (LLMs), including why LLMs use subword tokens instead of characters, the implications for performance and training, and comparisons between different tokenization approaches.
Activity Over Time
Top Contributors
Keywords
Sample Comments
yes, you are missing that the tokens aren't words, they are 2-3 letter groups, or any number of arbitrary sizes depending on the model
tokenization is not the issue - these LLMs can all break a word into letters if you ask them.
He's saying the LLM will figure out how many letters are in each token.
I've often wanted to talk with an LLM about its tokenization (e.g. how many tokens are there in "the simplest of phrases") I wonder if you fed it information about its tokenization (text like "rabbit is spelled r, a, b, b, i, t") if it could talk about it.
The pallette for LLM is tokens not characters.
But LLM have little concept of tokens don't they? Or at least well not know what their tokenizes is like.
Wouldn't a llm that just tokenized by character be good at it?
...which LLMs don't use as they use tokens instead.
Would an LLM using character tokens perform better (ignoring performance)?
They are trained on tokens not characters.