LLMs Training Data Debate
The cluster debates whether LLMs can generate novel knowledge and generalize beyond their training data or if they merely regurgitate, approximate, or statistically predict based on existing training material.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Consider: you might be wrong about LLMs regurgitating their training data.
LLMs are trained on text, only some of which includes facts. It's a coincidence when the output includes new facts not explicitly present in the training data.
You're vastly overestimating the capability of LLMs to create new knowledge not already contained in their training material.
In most cases, LLMs has the knowledge(data). They just can't generalize them like human do. They can only reflect explicit things that are already there.
Isn't this proof that LLMs still don't really generalize beyond their training data?
Is it really certain that those problems and the answers were not in the training data for the tested LLMs ? Presumably somebody in the internet wrote about them...
How do you trust what the LLM was trained on?
I keep seeing this comment all over the place. Just because something exists 1 time in the training data doesn't mean it can just regurgitate that. That's not how training works. An LLM is not a knowledge database.
If humans don't understand it to write the data the LLM is trained on, how will the LLM be able to learn it?
LLMs do not reliably reproduce their training data. This is quite easy to demonstrate, every LLM has been trained on all of wikipedia (at minimum) and yet there if you ask it a niche fact mentioned once on wikipedia it is highly likely to get it wrong.