How We Estimate LLM Inference Speed

TL;DR: LLM inference has two speeds: prefill (how fast it reads your prompt) and generation (how fast text appears). We estimate both using GPU specs.

The Two Phases of LLM Inference

Large Language Model inference has two distinct phases with different bottlenecks:

Prefill (p/s): Prompt processing. Compute-bound — limited by GPU TFLOPS.
Generation (t/s): Token output. Memory-bound — limited by memory bandwidth.

For short prompts, generation dominates. For long prompts (RAG, documents, code), prefill becomes the bottleneck — that frustrating wait before the first word appears.

Generation Speed Formula (t/s)

Tokens Per Second (Generation)

t/s = (Memory Bandwidth in GB/s) ÷ (Model Size in VRAM) × 0.75

Why memory bandwidth? During generation, the GPU must read the entire model weights for each token produced. A 14B Q4 model is ~10GB, so generating one token requires reading 10GB from VRAM.

Prefill Speed Formula (p/s)

Tokens Per Second (Prefill)

p/s = (TFLOPS × 1000) ÷ (Parameters × 2) × 0.5

Why TFLOPS? Prefill is a massive parallel matrix multiplication. The GPU processes all prompt tokens at once, limited only by compute power. We use FP16 Tensor TFLOPS where available.

Example Calculations

RTX 3090 (936 GB/s, 142 TFLOPS)
Generation: 936 ÷ 10 × 0.75 = ~70 t/s
Prefill: 142 × 1000 ÷ 28 × 0.5 = ~2,536 p/s
Excellent all-around performance

Tesla P40 (347 GB/s, 12 TFLOPS)
Generation: 347 ÷ 10 × 0.75 = ~26 t/s
Prefill: 12 × 1000 ÷ 28 × 0.5 = ~214 p/s
Great $/GB value, but slow prefill — noticeable delay on long prompts

Tesla M40 (288 GB/s, 7 TFLOPS)
Generation: 288 ÷ 10 × 0.75 = ~22 t/s
Prefill: 7 × 1000 ÷ 28 × 0.5 = ~125 p/s
Budget champion — 24GB VRAM cheap, but expect 8+ second waits on 1000-token prompts

Why This Matters for GPU Shopping

Gaming benchmarks measure FPS in Cyberpunk 2077. We measure something different: how useful is this card for running local AI?

A Tesla P40 from 2016 has zero gaming value but offers 24GB of VRAM at ~$6/GB with decent generation speed. The tradeoff? Slow prefill. If you paste long documents, you'll wait. If you chat with short prompts, it's fine.

Modern cards like the RTX 3090 excel at both — fast prefill AND fast generation. You pay more, but get instant responses even with huge context.

Caveats & Limitations

Estimates assume full GPU offload (model fits entirely in VRAM)
Actual speed varies by quantization method (GGUF, GPTQ, AWQ, EXL2)
Context length affects memory usage and can reduce generation speed
Multi-GPU setups, CPU offload, and hybrid inference have different characteristics
TFLOPS values use FP16 Tensor where available; older cards use FP32

GPUDojo.com