LLM Benchmark Comparisons

The cluster focuses on debates about the performance of new language models compared to state-of-the-art ones like GPT-4, Llama, and Mistral, questioning missing benchmarks, overfitting, and real-world validity.

➡️ Stable 1.0x AI & Machine Learning
4,463
Comments
12
Years Active
5
Top Authors
#9732
Topic ID

Activity Over Time

2015
3
2016
1
2017
4
2018
3
2019
31
2020
58
2021
55
2022
95
2023
1,400
2024
1,319
2025
1,422
2026
74

Keywords

CPU CBRN MythoMax LLM BLOOMZ AI GP stanford.edu GPT4 AIME models benchmarks gpt openai llama model open chatgpt models like grok

Sample Comments

yousif_123123 Dec 11, 2025 View on HN

Why doesn't OpenAI include comparisons to other models anymore?

aubanel Mar 17, 2025 View on HN

No it's not: their model is only on GPT-4.5 level on a few, saturated/cherry-picked benchmarks.

prime312 Aug 12, 2025 View on HN

Any reason why open-source models (like Llama) weren't considered here?

robrenaud Apr 8, 2024 View on HN

Model quality matters a ton too. They aren't serving OpenAI or Anthropic models, which are state of the art.

transformi Sep 27, 2023 View on HN

But why they didn't compare it to SOTA finetuned...(like vicuna playtus..)? ... smells a bit strange..

syntaxing Aug 5, 2025 View on HN

Interesting, these models are better than the new Qwen releases?

dcreater Sep 5, 2025 View on HN

Thank you! Why are the comparisons to llama3.1 era models?

saberience Mar 12, 2025 View on HN

Seems like its tuned for benchmarks for me, as in, real world it seems worse than Mistral and Llama.

Tepix Mar 31, 2023 View on HN

Have you tried bigger models? Llama-65B can indeed compete with GPT-3 according to various benchmarks. The next thing would be to get the fine-tuning as good as OpenAI's.

wanderingmind May 5, 2024 View on HN

How does it compare to GGML? That I'd what they must be comparing and yet I don't see any comparison made