How We Estimate LLM Inference Speed
TL;DR: LLM inference has two speeds: prefill (how fast it reads your prompt) and generation (how fast text appears). We estimate both using GPU specs.
The Two Phases of LLM Inference
Large Language Model inference has two distinct phases with different bottlenecks:
- Prefill (p/s): Prompt processing. Compute-bound — limited by GPU TFLOPS.
- Generation (t/s): Token output. Memory-bound — limited by memory bandwidth.
For short prompts, generation dominates. For long prompts (RAG, documents, code), prefill becomes the bottleneck — that frustrating wait before the first word appears.
Generation Speed Formula (t/s)
Why memory bandwidth? During generation, the GPU must read the entire model weights for each token produced. A 14B Q4 model is ~10GB, so generating one token requires reading 10GB from VRAM.
Prefill Speed Formula (p/s)
Why TFLOPS? Prefill is a massive parallel matrix multiplication. The GPU processes all prompt tokens at once, limited only by compute power. We use FP16 Tensor TFLOPS where available.
Example Calculations
Generation: 936 ÷ 10 × 0.75 = ~70 t/s
Prefill: 142 × 1000 ÷ 28 × 0.5 = ~2,536 p/s
Excellent all-around performance
Generation: 347 ÷ 10 × 0.75 = ~26 t/s
Prefill: 12 × 1000 ÷ 28 × 0.5 = ~214 p/s
Great $/GB value, but slow prefill — noticeable delay on long prompts
Generation: 288 ÷ 10 × 0.75 = ~22 t/s
Prefill: 7 × 1000 ÷ 28 × 0.5 = ~125 p/s
Budget champion — 24GB VRAM cheap, but expect 8+ second waits on 1000-token prompts
Why This Matters for GPU Shopping
Gaming benchmarks measure FPS in Cyberpunk 2077. We measure something different: how useful is this card for running local AI?
A Tesla P40 from 2016 has zero gaming value but offers 24GB of VRAM at ~$6/GB with decent generation speed. The tradeoff? Slow prefill. If you paste long documents, you'll wait. If you chat with short prompts, it's fine.
Modern cards like the RTX 3090 excel at both — fast prefill AND fast generation. You pay more, but get instant responses even with huge context.
- Estimates assume full GPU offload (model fits entirely in VRAM)
- Actual speed varies by quantization method (GGUF, GPTQ, AWQ, EXL2)
- Context length affects memory usage and can reduce generation speed
- Multi-GPU setups, CPU offload, and hybrid inference have different characteristics
- TFLOPS values use FP16 Tensor where available; older cards use FP32
Further Reading
- r/LocalLLaMA - Community discussions on local inference
- llama.cpp - Popular inference engine