LLM Inference Performance

The cluster discusses LLM inference speeds, bottlenecks like memory bandwidth and I/O, hardware suitability (GPUs, CPUs, TPUs), quantization, batching, and comparisons across devices for faster token generation.

➡️ Stable 1.1x AI & Machine Learning
5,173
Comments
16
Years Active
5
Top Authors
#3655
Topic ID

Activity Over Time

2010
1
2012
1
2013
1
2014
2
2015
9
2016
39
2017
50
2018
72
2019
80
2020
124
2021
110
2022
199
2023
1,409
2024
1,314
2025
1,669
2026
93

Keywords

RAM edge.The CPU RNN LLM ANE H100 HBM TPU GPU inference tokens memory speed token llms gpu bandwidth llm batch

Sample Comments

pama Jun 4, 2024 View on HN

Can one run fast LLM inference on these?

EvgeniyZh May 18, 2023 View on HN

Talks about throughput but doesn't mention memory I/O speed, which should be a bottleneck for LLMs

mrob Dec 11, 2023 View on HN

LLM inference is bottlenecked by memory bandwidth. You'll probably get identical speed with cheaper CPUs.

Mockapapella Oct 3, 2023 View on HN

I want this for LLMs. Having that much less of a memory footprint would allow us to put more models on a GPU at a time, and assuming the clock could keep up it could more than make up for the loss in inference speed per individual model

sbierwagen Oct 31, 2023 View on HN

LLM inference speed is bandwidth limited.

lxe Mar 13, 2023 View on HN

How are you getting 20 tokens/second? I'm getting 2.6 tokens/s on 3090 with int4 prequantized model. Is 4090 so much faster?

christkv Nov 13, 2023 View on HN

Is the limit on the speed on inference a memory bandwidth issue or compute?

Havoc Dec 3, 2023 View on HN

Wouldn't that be about as slow as CPU inference?

andrewstuart Oct 26, 2024 View on HN

What if fast new LLMs don’t need a GPU?

volodia Feb 27, 2025 View on HN

Great question! The model can more efficiently leverage existing GPU hardware---it performs more computation per unit of memory transferred; this means that on older hardware one should be able to get similar inference speeds as one would get on recent hardware with a classical LLM. This is actually interesting commercially, since it opens new ways of reducing AI inference costs.