LLM Inference Performance

The cluster discusses LLM inference speeds, bottlenecks like memory bandwidth and I/O, hardware suitability (GPUs, CPUs, TPUs), quantization, batching, and comparisons across devices for faster token generation.

➡️ Stable 1.1x AI & Machine Learning

5,173

Comments

Years Active

Top Authors

#3655

Topic ID

Activity Over Time

2010

2012

2013

2014

2015

2016

2017

2018

2019

2020

124

2021

110

2022

199

2023

1,409

2024

1,314

2025

1,669

2026

Top Contributors

brucethemoose2 (96) ryao (34) imtringued (33) danielhanchen (29) lhl (27)

Keywords

RAM edge.The CPU RNN LLM ANE H100 HBM TPU GPU inference tokens memory speed token llms gpu bandwidth llm batch

Sample Comments

pama • Jun 4, 2024 • View on HN

Can one run fast LLM inference on these?

EvgeniyZh • May 18, 2023 • View on HN

Talks about throughput but doesn't mention memory I/O speed, which should be a bottleneck for LLMs

mrob • Dec 11, 2023 • View on HN

LLM inference is bottlenecked by memory bandwidth. You'll probably get identical speed with cheaper CPUs.

Mockapapella • Oct 3, 2023 • View on HN

I want this for LLMs. Having that much less of a memory footprint would allow us to put more models on a GPU at a time, and assuming the clock could keep up it could more than make up for the loss in inference speed per individual model

sbierwagen • Oct 31, 2023 • View on HN

LLM inference speed is bandwidth limited.

lxe • Mar 13, 2023 • View on HN

How are you getting 20 tokens/second? I'm getting 2.6 tokens/s on 3090 with int4 prequantized model. Is 4090 so much faster?

christkv • Nov 13, 2023 • View on HN

Is the limit on the speed on inference a memory bandwidth issue or compute?

Havoc • Dec 3, 2023 • View on HN

Wouldn't that be about as slow as CPU inference?

andrewstuart • Oct 26, 2024 • View on HN

What if fast new LLMs don’t need a GPU?

volodia • Feb 27, 2025 • View on HN

Great question! The model can more efficiently leverage existing GPU hardware---it performs more computation per unit of memory transferred; this means that on older hardware one should be able to get similar inference speeds as one would get on recent hardware with a classical LLM. This is actually interesting commercially, since it opens new ways of reducing AI inference costs.