AI Training Data Contamination

The cluster focuses on debates about whether AI models' behaviors or benchmark performances stem from genuine learning or memorization due to test questions or related content leaking into training data. Commenters question proofs of data absence in training sets and risks of overfitting or contamination.

➡️ Stable 0.7x AI & Machine Learning

3,675

Comments

Years Active

Top Authors

#3368

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

177

2018

171

2019

175

2020

206

2021

260

2022

306

2023

759

2024

618

2025

717

2026

Top Contributors

YeGoblynQueenne (45) godelski (27) famouswaffles (20) visarga (19) yorwba (16)

Keywords

e.g AI LLM stackexchange.com twitter.com ML AXIOM ycombinator.com IE training training data data model models trained training set dataset test benchmark

Sample Comments

benterix • Nov 18, 2025 • View on HN

What makes you think it wouldn't end up in the training set anyway?

wnkrshm • Mar 1, 2023 • View on HN

How do you prove that it happens and is not an artifact of the training data?

zahlman • Nov 15, 2025 • View on HN

Are you sure it isn't just a case of a write-up of the project appearing in the training data?

Sebguer • Jul 16, 2025 • View on HN

Models do not possess awareness of their training data. Also you are taking at face value that it is "accurate".

neatze • May 13, 2023 • View on HN

Probably, it is best training data you can get, since model was "tricked" to contradict prompt's request.

heavyset_go • Jul 7, 2021 • View on HN

I'm familiar with training and using ML models. I'm also familiar with the ways such models can encode their training data in the models themselves, hence my criticism.

yonatan8070 • Dec 12, 2024 • View on HN

Could it be that since the question became a big benchmark, it (along with the correct answer) slipped into the training data?

lionkor • Sep 26, 2024 • View on HN

Fun little counterpoint: How can you _prove_ that this exact question was not in the training set?

lexandstuff • Dec 6, 2024 • View on HN

My guess is that it isn't available because the training data they stole occasionally leaks into the outputs.

cmpb • Jul 9, 2021 • View on HN

Has that been proven? It's possible it's only doing that due to a small training set