AI Training Data Contamination

The cluster focuses on debates about whether AI models' behaviors or benchmark performances stem from genuine learning or memorization due to test questions or related content leaking into training data. Commenters question proofs of data absence in training sets and risks of overfitting or contamination.

➡️ Stable 0.7x AI & Machine Learning
3,675
Comments
19
Years Active
5
Top Authors
#3368
Topic ID

Activity Over Time

2008
2
2009
5
2010
14
2011
12
2012
24
2013
17
2014
35
2015
49
2016
89
2017
177
2018
171
2019
175
2020
206
2021
260
2022
306
2023
759
2024
618
2025
717
2026
39

Keywords

e.g AI LLM stackexchange.com twitter.com ML AXIOM ycombinator.com IE training training data data model models trained training set dataset test benchmark

Sample Comments

benterix Nov 18, 2025 View on HN

What makes you think it wouldn't end up in the training set anyway?

wnkrshm Mar 1, 2023 View on HN

How do you prove that it happens and is not an artifact of the training data?

zahlman Nov 15, 2025 View on HN

Are you sure it isn't just a case of a write-up of the project appearing in the training data?

Sebguer Jul 16, 2025 View on HN

Models do not possess awareness of their training data. Also you are taking at face value that it is "accurate".

neatze May 13, 2023 View on HN

Probably, it is best training data you can get, since model was "tricked" to contradict prompt's request.

heavyset_go Jul 7, 2021 View on HN

I'm familiar with training and using ML models. I'm also familiar with the ways such models can encode their training data in the models themselves, hence my criticism.

yonatan8070 Dec 12, 2024 View on HN

Could it be that since the question became a big benchmark, it (along with the correct answer) slipped into the training data?

lionkor Sep 26, 2024 View on HN

Fun little counterpoint: How can you _prove_ that this exact question was not in the training set?

lexandstuff Dec 6, 2024 View on HN

My guess is that it isn't available because the training data they stole occasionally leaks into the outputs.

cmpb Jul 9, 2021 View on HN

Has that been proven? It's possible it's only doing that due to a small training set