AI Training Data Contamination
The cluster focuses on debates about whether AI models' behaviors or benchmark performances stem from genuine learning or memorization due to test questions or related content leaking into training data. Commenters question proofs of data absence in training sets and risks of overfitting or contamination.
Activity Over Time
Top Contributors
Keywords
Sample Comments
What makes you think it wouldn't end up in the training set anyway?
How do you prove that it happens and is not an artifact of the training data?
Are you sure it isn't just a case of a write-up of the project appearing in the training data?
Models do not possess awareness of their training data. Also you are taking at face value that it is "accurate".
Probably, it is best training data you can get, since model was "tricked" to contradict prompt's request.
I'm familiar with training and using ML models. I'm also familiar with the ways such models can encode their training data in the models themselves, hence my criticism.
Could it be that since the question became a big benchmark, it (along with the correct answer) slipped into the training data?
Fun little counterpoint: How can you _prove_ that this exact question was not in the training set?
My guess is that it isn't available because the training data they stole occasionally leaks into the outputs.
Has that been proven? It's possible it's only doing that due to a small training set