LLMs Training Data Debate

The cluster debates whether LLMs can generate novel knowledge and generalize beyond their training data or if they merely regurgitate, approximate, or statistically predict based on existing training material.

➡️ Stable 1.2x AI & Machine Learning

4,926

Comments

Years Active

Top Authors

#847

Topic ID

Activity Over Time

2016

2017

2018

2019

2020

2021

2022

107

2023

1,549

2024

1,282

2025

1,789

2026

162

Top Contributors

visarga (46) famouswaffles (39) TeMPOraL (39) simonw (34) HarHarVeryFunny (34)

Keywords

AI DB AGI LLM en.m GPT thegradient.pub anthropic.com wikipedia.org llms llm training training data output data trained knowledge llms don model

Sample Comments

FeepingCreature • May 12, 2024 • View on HN

Consider: you might be wrong about LLMs regurgitating their training data.

jstrieb • Sep 21, 2025 • View on HN

LLMs are trained on text, only some of which includes facts. It's a coincidence when the output includes new facts not explicitly present in the training data.

wavemode • Jul 24, 2024 • View on HN

You're vastly overestimating the capability of LLMs to create new knowledge not already contained in their training material.

feverzsj • Dec 20, 2025 • View on HN

In most cases, LLMs has the knowledge(data). They just can't generalize them like human do. They can only reflect explicit things that are already there.

irthomasthomas • Dec 5, 2025 • View on HN

Isn't this proof that LLMs still don't really generalize beyond their training data?

sega_sai • Feb 9, 2025 • View on HN

Is it really certain that those problems and the answers were not in the training data for the tested LLMs ? Presumably somebody in the internet wrote about them...

amelius • Dec 4, 2025 • View on HN

How do you trust what the LLM was trained on?

bongodongobob • Jun 18, 2024 • View on HN

I keep seeing this comment all over the place. Just because something exists 1 time in the training data doesn't mean it can just regurgitate that. That's not how training works. An LLM is not a knowledge database.

mostlysimilar • Jun 19, 2025 • View on HN

If humans don't understand it to write the data the LLM is trained on, how will the LLM be able to learn it?

sebzim4500 • Feb 21, 2025 • View on HN

LLMs do not reliably reproduce their training data. This is quite easy to demonstrate, every LLM has been trained on all of wikipedia (at minimum) and yet there if you ask it a niche fact mentioned once on wikipedia it is highly likely to get it wrong.