LLM Inference Performance
The cluster discusses LLM inference speeds, bottlenecks like memory bandwidth and I/O, hardware suitability (GPUs, CPUs, TPUs), quantization, batching, and comparisons across devices for faster token generation.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Can one run fast LLM inference on these?
Talks about throughput but doesn't mention memory I/O speed, which should be a bottleneck for LLMs
LLM inference is bottlenecked by memory bandwidth. You'll probably get identical speed with cheaper CPUs.
I want this for LLMs. Having that much less of a memory footprint would allow us to put more models on a GPU at a time, and assuming the clock could keep up it could more than make up for the loss in inference speed per individual model
LLM inference speed is bandwidth limited.
How are you getting 20 tokens/second? I'm getting 2.6 tokens/s on 3090 with int4 prequantized model. Is 4090 so much faster?
Is the limit on the speed on inference a memory bandwidth issue or compute?
Wouldn't that be about as slow as CPU inference?
What if fast new LLMs don’t need a GPU?
Great question! The model can more efficiently leverage existing GPU hardware---it performs more computation per unit of memory transferred; this means that on older hardware one should be able to get similar inference speeds as one would get on recent hardware with a classical LLM. This is actually interesting commercially, since it opens new ways of reducing AI inference costs.