LLMs Training Data Debate

The cluster debates whether LLMs can generate novel knowledge and generalize beyond their training data or if they merely regurgitate, approximate, or statistically predict based on existing training material.

➡️ Stable 1.2x AI & Machine Learning
4,926
Comments
11
Years Active
5
Top Authors
#847
Topic ID

Activity Over Time

2016
1
2017
1
2018
1
2019
3
2020
22
2021
9
2022
107
2023
1,549
2024
1,282
2025
1,789
2026
162

Keywords

AI DB AGI LLM en.m GPT thegradient.pub anthropic.com wikipedia.org llms llm training training data output data trained knowledge llms don model

Sample Comments

FeepingCreature May 12, 2024 View on HN

Consider: you might be wrong about LLMs regurgitating their training data.

jstrieb Sep 21, 2025 View on HN

LLMs are trained on text, only some of which includes facts. It's a coincidence when the output includes new facts not explicitly present in the training data.

wavemode Jul 24, 2024 View on HN

You're vastly overestimating the capability of LLMs to create new knowledge not already contained in their training material.

feverzsj Dec 20, 2025 View on HN

In most cases, LLMs has the knowledge(data). They just can't generalize them like human do. They can only reflect explicit things that are already there.

irthomasthomas Dec 5, 2025 View on HN

Isn't this proof that LLMs still don't really generalize beyond their training data?

sega_sai Feb 9, 2025 View on HN

Is it really certain that those problems and the answers were not in the training data for the tested LLMs ? Presumably somebody in the internet wrote about them...

amelius Dec 4, 2025 View on HN

How do you trust what the LLM was trained on?

bongodongobob Jun 18, 2024 View on HN

I keep seeing this comment all over the place. Just because something exists 1 time in the training data doesn't mean it can just regurgitate that. That's not how training works. An LLM is not a knowledge database.

mostlysimilar Jun 19, 2025 View on HN

If humans don't understand it to write the data the LLM is trained on, how will the LLM be able to learn it?

sebzim4500 Feb 21, 2025 View on HN

LLMs do not reliably reproduce their training data. This is quite easy to demonstrate, every LLM has been trained on all of wikipedia (at minimum) and yet there if you ask it a niche fact mentioned once on wikipedia it is highly likely to get it wrong.