how llm inference works

Nov 22, 2025 • 12 min read

When you type a prompt into ChatGPT and hit enter, something fascinating happens behind the scenes. The model doesn't generate the entire response at once—it produces text token by token, in a carefully orchestrated process called inference.

Understanding how LLM inference works is critical for anyone building AI applications. Whether you're optimizing latency, managing costs, or scaling to millions of users, knowing the difference between prefill and decode can save you thousands of dollars and countless hours of debugging.

The Two Phases of Inference

LLM inference operates in two distinct phases, each with fundamentally different performance characteristics:

Phase	What It Does	Bottleneck	Parallelization
Prefill	Process input prompt	Compute	Full parallel
Decode	Generate output tokens	Memory bandwidth	Sequential

Phase 1: Prefill

When you send a prompt like "Explain quantum computing in simple terms", the model first needs to "understand" this input. This is the prefill phase.

During prefill, the entire input sequence is processed in parallel. For each layer in the transformer, the model computes:

Q (Query) matrices
K (Key) matrices
V (Value) matrices

The K and V matrices are then stored in what's called the KV Cache. This cache is crucial—it means we don't have to recompute these values for every new token we generate.

Click to expand: Simplified prefill pseudocode

def prefill(prompt_tokens):
    kv_cache = {}
    
    for layer in model.layers:
        # Process ALL tokens in parallel
        Q = layer.compute_queries(prompt_tokens)
        K = layer.compute_keys(prompt_tokens)
        V = layer.compute_values(prompt_tokens)
        
        # Store K and V for later use
        kv_cache[layer] = (K, V)
    
    return kv_cache

Because this happens in parallel across all input tokens, prefill is compute-bound and highly efficient on modern GPUs.

Phase 2: Decode

Once the prompt is processed, the model starts generating the response. This is the decode phase, and it's where things get interesting (and slower).

The model generates one token at a time. Here's the process:

Take all previous tokens (prompt + generated tokens)
Predict the next most likely token
Append it to the sequence
Feed it back into the model
Repeat until done

Crucially, instead of recomputing K and V matrices for the entire history every time, the model reuses the KV Cache from prefill and only updates it with the new token's information.

def decode(kv_cache, max_tokens=100):
    generated = []
    
    for _ in range(max_tokens):
        # Only process the LAST token
        new_token = model.predict_next(kv_cache, generated[-1] if generated else None)
        generated.append(new_token)
        
        # Update cache with new token's K and V
        for layer in model.layers:
            K_new = layer.compute_keys([new_token])
            V_new = layer.compute_values([new_token])
            kv_cache[layer] = concat(kv_cache[layer], (K_new, V_new))
        
        if new_token == END_TOKEN:
            break
    
    return generated

"The fundamental difference: prefill is compute-bound and parallel, while decode is memory-bound and sequential. This is why generating long responses feels slower than processing long prompts."

Why Decode is Memory-Bound

You might notice that generating text feels slower than processing the prompt. This is because decode is memory-bound, not compute-bound.

For every single token generated, the GPU has to:

Load the entire model weights from VRAM (~70GB for Llama 2 70B)
Load the KV cache for all previous tokens
Perform a relatively small amount of computation
Write the result back to memory

Since the computation for just one token is tiny compared to the size of the weights, the GPU spends most of its time ~~computing~~ → waiting for data to move from memory.

Optimization Techniques

Engineers use several techniques to speed up inference:

1. KV Caching

As mentioned, storing attention keys and values avoids redundant computation. Without KV caching, inference would be 10-20x slower.

2. Quantization

Reducing the precision of weights (e.g., from FP16 to INT4) reduces memory usage and bandwidth requirements:

Precision	Memory Reduction	Speed Improvement	Quality Impact
FP16 (baseline)	-	-	-
INT8	50%	2-3x	Minimal
INT4	75%	3-4x	Slight

3. Speculative Decoding

Using a smaller "draft" model to guess several future tokens at once, which the larger model then verifies in parallel. This can provide 2-3x speedup with no quality loss.

4. Flash Attention

An optimized attention algorithm that reduces memory reads/writes by reordering operations. Press Ctrl + F and search for "Flash Attention" in the FlashAttention-2 paper (Dao et al., 2023) for details.

Common Misconceptions

Bigger GPUs always mean faster inference: Not necessarily. ~~More VRAM is always better~~ → What matters is memory bandwidth, not just capacity. An H100 with faster memory can outperform an A100 with more but slower VRAM.
Batch size should be maximized: For inference, larger batches increase latency. (Unlike training where larger batches are generally better). You want to balance throughput and latency based on your use case.
Quantization always degrades quality: Modern quantization techniques like GPTQ and AWQ can maintain near-original quality even at INT4 precision. The key is calibration—using representative data to minimize quantization error.
All tokens cost the same to generate: False! The first token (prefill) is much faster to compute than subsequent tokens (decode). This is why TTFT and TPOT are measured separately.

Key Takeaways

When optimizing LLM inference, remember these critical points:

Memory bandwidth is the bottleneck, not compute capacity
Prefill processes all tokens in parallel (fast), decode is sequential (slow)
KV caching is essential—without it, inference would be unusably slow
Always profile before optimizing—measure where the actual bottleneck is
Quantization can provide 3-4x speedup with minimal quality loss

The most important thing to remember: inference optimization is about moving less data, not doing more computation.

If you're deploying LLMs in production, focus on:

Reducing model size through quantization
Maximizing memory bandwidth (choose GPUs wisely)
Implementing KV caching efficiently
Batching requests intelligently

Last updated: November 2025 • Reading time: 12 minutes