how llm inference works
When you type a prompt into ChatGPT and hit enter, something fascinating happens behind the scenes. The model doesn't generate the entire response at once—it produces text token by token, in a carefully orchestrated process called inference.
Understanding how LLM inference works is critical for anyone building AI applications. Whether you're optimizing latency, managing costs, or scaling to millions of users, knowing the difference between prefill and decode can save you thousands of dollars and countless hours of debugging.
The Two Phases of Inference
LLM inference operates in two distinct phases, each with fundamentally different performance characteristics:
| Phase | What It Does | Bottleneck | Parallelization |
|---|---|---|---|
| Prefill | Process input prompt | Compute | Full parallel |
| Decode | Generate output tokens | Memory bandwidth | Sequential |
Phase 1: Prefill
When you send a prompt like "Explain quantum computing in simple terms",
the model first needs to "understand" this input. This is the prefill
phase.
During prefill, the entire input sequence is processed in parallel. For each layer in the transformer, the model computes:
- Q (Query) matrices
- K (Key) matrices
- V (Value) matrices
The K and V matrices are then stored in what's called the KV Cache. This cache is crucial—it means we don't have to recompute these values for every new token we generate.
Click to expand: Simplified prefill pseudocode
def prefill(prompt_tokens):
kv_cache = {}
for layer in model.layers:
# Process ALL tokens in parallel
Q = layer.compute_queries(prompt_tokens)
K = layer.compute_keys(prompt_tokens)
V = layer.compute_values(prompt_tokens)
# Store K and V for later use
kv_cache[layer] = (K, V)
return kv_cache
Because this happens in parallel across all input tokens, prefill is compute-bound and highly efficient on modern GPUs.
Phase 2: Decode
Once the prompt is processed, the model starts generating the response. This is the decode phase, and it's where things get interesting (and slower).
The model generates one token at a time. Here's the process:
- Take all previous tokens (prompt + generated tokens)
- Predict the next most likely token
- Append it to the sequence
- Feed it back into the model
- Repeat until done
Crucially, instead of recomputing K and V matrices for the entire history every time, the model reuses the KV Cache from prefill and only updates it with the new token's information.
def decode(kv_cache, max_tokens=100):
generated = []
for _ in range(max_tokens):
# Only process the LAST token
new_token = model.predict_next(kv_cache, generated[-1] if generated else None)
generated.append(new_token)
# Update cache with new token's K and V
for layer in model.layers:
K_new = layer.compute_keys([new_token])
V_new = layer.compute_values([new_token])
kv_cache[layer] = concat(kv_cache[layer], (K_new, V_new))
if new_token == END_TOKEN:
break
return generated
"The fundamental difference: prefill is compute-bound and parallel, while decode is memory-bound and sequential. This is why generating long responses feels slower than processing long prompts."
Why Decode is Memory-Bound
You might notice that generating text feels slower than processing the prompt. This is because decode is memory-bound, not compute-bound.
For every single token generated, the GPU has to:
- Load the entire model weights from VRAM (~70GB for Llama 2 70B)
- Load the KV cache for all previous tokens
- Perform a relatively small amount of computation
- Write the result back to memory
Since the computation for just one token is tiny compared to the size of the weights,
the GPU spends most of its time computing → waiting for data to
move from memory.
Optimization Techniques
Engineers use several techniques to speed up inference:
1. KV Caching
As mentioned, storing attention keys and values avoids redundant computation. Without KV caching, inference would be 10-20x slower.
2. Quantization
Reducing the precision of weights (e.g., from FP16 to INT4) reduces memory usage and bandwidth requirements:
| Precision | Memory Reduction | Speed Improvement | Quality Impact |
|---|---|---|---|
| FP16 (baseline) | - | - | - |
| INT8 | 50% | 2-3x | Minimal |
| INT4 | 75% | 3-4x | Slight |
3. Speculative Decoding
Using a smaller "draft" model to guess several future tokens at once, which the larger model then verifies in parallel. This can provide 2-3x speedup with no quality loss.
4. Flash Attention
An optimized attention algorithm that reduces memory reads/writes by reordering operations. Press Ctrl + F and search for "Flash Attention" in the FlashAttention-2 paper (Dao et al., 2023) for details.
Common Misconceptions
- Bigger GPUs always mean faster inference
- Not necessarily.
More VRAM is always better→ What matters is memory bandwidth, not just capacity. An H100 with faster memory can outperform an A100 with more but slower VRAM. - Batch size should be maximized
- For inference, larger batches increase latency. (Unlike training where larger batches are generally better). You want to balance throughput and latency based on your use case.
- Quantization always degrades quality
- Modern quantization techniques like GPTQ and AWQ can maintain near-original quality even at INT4 precision. The key is calibration—using representative data to minimize quantization error.
- All tokens cost the same to generate
- False! The first token (prefill) is much faster to compute than subsequent tokens (decode). This is why TTFT and TPOT are measured separately.
Key Takeaways
When optimizing LLM inference, remember these critical points:
- Memory bandwidth is the bottleneck, not compute capacity
- Prefill processes all tokens in parallel (fast), decode is sequential (slow)
- KV caching is essential—without it, inference would be unusably slow
- Always profile before optimizing—measure where the actual bottleneck is
- Quantization can provide 3-4x speedup with minimal quality loss
The most important thing to remember: inference optimization is about moving less data, not doing more computation.
If you're deploying LLMs in production, focus on:
- Reducing model size through quantization
- Maximizing memory bandwidth (choose GPUs wisely)
- Implementing KV caching efficiently
- Batching requests intelligently
Last updated: November 2025 • Reading time: 12 minutes