LLM Benchmark Skepticism

The cluster centers on distrust of LLM benchmarks due to suspected data contamination, overfitting, cheating by training on test data, and calls for real-world or private evaluations instead. Users highlight issues like Goodhart's Law, self-grading, and preference for metrics like LMSYS ELO.

🚀 Rising 2.3x AI & Machine Learning
3,531
Comments
20
Years Active
5
Top Authors
#5841
Topic ID

Activity Over Time

2007
2
2008
3
2009
9
2010
5
2011
4
2012
25
2013
6
2014
14
2015
52
2016
46
2017
60
2018
67
2019
113
2020
126
2021
94
2022
140
2023
516
2024
760
2025
1,407
2026
84

Keywords

CS MISSION AI AGI LLM psu.edu YOUR MisguidedAttention citeseerx.ist APPS benchmarks benchmark models llm model o1 scores performance training trained

Sample Comments

retinaros Apr 20, 2025 View on HN

its just bs benchmarks. they are all cheating at this point feeding the data in the training set. doesnt mean the llm arent becoming better but when they all lie...

ranyume Dec 14, 2025 View on HN

Careful with that benchmark. It's LLMs grading other LLMs.

up6w6 Aug 7, 2025 View on HN

crazy how they only show benchmark results against their own models

stavros Dec 8, 2023 View on HN

Aren't LLM benchmarks at best irrelevant, at worst lying, at this point?

jdefr89 Dec 20, 2024 View on HN

Uhhhh… It was trained on ARC data? So they targeted a specific benchmark and are surprised and blown away the LLM performed well in it? What’s that law again? When a benchmark is targeted by some system the benchmark becomes useless?

nickpsecurity Mar 10, 2024 View on HN

Anytime you see that, we should assume the newer models might have been trained on either the benchmarks themselves or something similar to them. If I was an evaluator, I’d keep a secret pile of tests that I know aren’t in any LLM’s, do the evaluations privately, and not publish scores either. Just rank plus how far apart they are.The best tests of these models are people who want to use AI to solve real problems attempting to do that with various models. If they work, report that they worked

PeterisP May 31, 2023 View on HN

I would distrust the currently available benchmarks, as recent research (gah, can't remember the paper title) indicates that for many benchmarks at least some of the data splits have leaked into model training data; and there's some experience with the open source models which match an OpenAI model on the benchmark scores but subjectively feel much worse than that model on random questions.

czk Feb 11, 2025 View on HN

im very naive here but does anyone trust these benchmarks? do they mean anything to you? they seem far too easy to game and it doesn't feel like its an accurate way to really tell how these models compare to one another. seems like benchmark performance declines quite a bit if you introduce a problem that's similar to those in benchmarks but one that the model hasn't seen before

sujay1844 Jul 24, 2024 View on HN

These days, lmsys elo is the only thing I trust. The other benchmark scores mean nothing to me at this point

xigoi Jul 20, 2025 View on HN

It is not a big deal because OpenAI has been known to cheat on LLM benchmarks before and I have no reason to believe that the AI actually solved the problems by itself without training on the solutions. I’ll be more impressed if a similar performance is obtained by an open-source model that can be independently verified.