LLM Benchmark Comparisons
The cluster focuses on debates about the performance of new language models compared to state-of-the-art ones like GPT-4, Llama, and Mistral, questioning missing benchmarks, overfitting, and real-world validity.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Why doesn't OpenAI include comparisons to other models anymore?
No it's not: their model is only on GPT-4.5 level on a few, saturated/cherry-picked benchmarks.
Any reason why open-source models (like Llama) weren't considered here?
Model quality matters a ton too. They aren't serving OpenAI or Anthropic models, which are state of the art.
But why they didn't compare it to SOTA finetuned...(like vicuna playtus..)? ... smells a bit strange..
Interesting, these models are better than the new Qwen releases?
Thank you! Why are the comparisons to llama3.1 era models?
Seems like its tuned for benchmarks for me, as in, real world it seems worse than Mistral and Llama.
Have you tried bigger models? Llama-65B can indeed compete with GPT-3 according to various benchmarks. The next thing would be to get the fine-tuning as good as OpenAI's.
How does it compare to GGML? That I'd what they must be comparing and yet I don't see any comparison made