LLM Benchmark Skepticism
The cluster centers on distrust of LLM benchmarks due to suspected data contamination, overfitting, cheating by training on test data, and calls for real-world or private evaluations instead. Users highlight issues like Goodhart's Law, self-grading, and preference for metrics like LMSYS ELO.
Activity Over Time
Top Contributors
Keywords
Sample Comments
its just bs benchmarks. they are all cheating at this point feeding the data in the training set. doesnt mean the llm arent becoming better but when they all lie...
Careful with that benchmark. It's LLMs grading other LLMs.
crazy how they only show benchmark results against their own models
Aren't LLM benchmarks at best irrelevant, at worst lying, at this point?
Uhhhh… It was trained on ARC data? So they targeted a specific benchmark and are surprised and blown away the LLM performed well in it? What’s that law again? When a benchmark is targeted by some system the benchmark becomes useless?
Anytime you see that, we should assume the newer models might have been trained on either the benchmarks themselves or something similar to them. If I was an evaluator, I’d keep a secret pile of tests that I know aren’t in any LLM’s, do the evaluations privately, and not publish scores either. Just rank plus how far apart they are.The best tests of these models are people who want to use AI to solve real problems attempting to do that with various models. If they work, report that they worked
I would distrust the currently available benchmarks, as recent research (gah, can't remember the paper title) indicates that for many benchmarks at least some of the data splits have leaked into model training data; and there's some experience with the open source models which match an OpenAI model on the benchmark scores but subjectively feel much worse than that model on random questions.
im very naive here but does anyone trust these benchmarks? do they mean anything to you? they seem far too easy to game and it doesn't feel like its an accurate way to really tell how these models compare to one another. seems like benchmark performance declines quite a bit if you introduce a problem that's similar to those in benchmarks but one that the model hasn't seen before
These days, lmsys elo is the only thing I trust. The other benchmark scores mean nothing to me at this point
It is not a big deal because OpenAI has been known to cheat on LLM benchmarks before and I have no reason to believe that the AI actually solved the problems by itself without training on the solutions. I’ll be more impressed if a similar performance is obtained by an open-source model that can be independently verified.