← back to blogs

how llm inference works

Nov 22, 202512 min read

When you type a prompt into ChatGPT and hit enter, something fascinating happens behind the scenes. The model doesn't generate the entire response at once—it produces text token by token, in a carefully orchestrated process called inference.

Understanding how LLM inference works is critical for anyone building AI applications. Whether you're optimizing latency, managing costs, or scaling to millions of users, knowing the difference between prefill and decode can save you thousands of dollars and countless hours of debugging.

The Two Phases of Inference

LLM inference operates in two distinct phases, each with fundamentally different performance characteristics:

Phase What It Does Bottleneck Parallelization
Prefill Process input prompt Compute Full parallel
Decode Generate output tokens Memory bandwidth Sequential

Phase 1: Prefill

When you send a prompt like "Explain quantum computing in simple terms", the model first needs to "understand" this input. This is the prefill phase.

During prefill, the entire input sequence is processed in parallel. For each layer in the transformer, the model computes:

  • Q (Query) matrices
  • K (Key) matrices
  • V (Value) matrices

The K and V matrices are then stored in what's called the KV Cache. This cache is crucial—it means we don't have to recompute these values for every new token we generate.

Click to expand: Simplified prefill pseudocode
def prefill(prompt_tokens):
    kv_cache = {}
    
    for layer in model.layers:
        # Process ALL tokens in parallel
        Q = layer.compute_queries(prompt_tokens)
        K = layer.compute_keys(prompt_tokens)
        V = layer.compute_values(prompt_tokens)
        
        # Store K and V for later use
        kv_cache[layer] = (K, V)
    
    return kv_cache

Because this happens in parallel across all input tokens, prefill is compute-bound and highly efficient on modern GPUs.

Phase 2: Decode

Once the prompt is processed, the model starts generating the response. This is the decode phase, and it's where things get interesting (and slower).

The model generates one token at a time. Here's the process:

  1. Take all previous tokens (prompt + generated tokens)
  2. Predict the next most likely token
  3. Append it to the sequence
  4. Feed it back into the model
  5. Repeat until done

Crucially, instead of recomputing K and V matrices for the entire history every time, the model reuses the KV Cache from prefill and only updates it with the new token's information.

def decode(kv_cache, max_tokens=100):
    generated = []
    
    for _ in range(max_tokens):
        # Only process the LAST token
        new_token = model.predict_next(kv_cache, generated[-1] if generated else None)
        generated.append(new_token)
        
        # Update cache with new token's K and V
        for layer in model.layers:
            K_new = layer.compute_keys([new_token])
            V_new = layer.compute_values([new_token])
            kv_cache[layer] = concat(kv_cache[layer], (K_new, V_new))
        
        if new_token == END_TOKEN:
            break
    
    return generated
"The fundamental difference: prefill is compute-bound and parallel, while decode is memory-bound and sequential. This is why generating long responses feels slower than processing long prompts."

Why Decode is Memory-Bound

You might notice that generating text feels slower than processing the prompt. This is because decode is memory-bound, not compute-bound.

For every single token generated, the GPU has to:

  • Load the entire model weights from VRAM (~70GB for Llama 2 70B)
  • Load the KV cache for all previous tokens
  • Perform a relatively small amount of computation
  • Write the result back to memory

Since the computation for just one token is tiny compared to the size of the weights, the GPU spends most of its time computingwaiting for data to move from memory.

Optimization Techniques

Engineers use several techniques to speed up inference:

1. KV Caching

As mentioned, storing attention keys and values avoids redundant computation. Without KV caching, inference would be 10-20x slower.

2. Quantization

Reducing the precision of weights (e.g., from FP16 to INT4) reduces memory usage and bandwidth requirements:

Precision Memory Reduction Speed Improvement Quality Impact
FP16 (baseline) - - -
INT8 50% 2-3x Minimal
INT4 75% 3-4x Slight

3. Speculative Decoding

Using a smaller "draft" model to guess several future tokens at once, which the larger model then verifies in parallel. This can provide 2-3x speedup with no quality loss.

4. Flash Attention

An optimized attention algorithm that reduces memory reads/writes by reordering operations. Press Ctrl + F and search for "Flash Attention" in the FlashAttention-2 paper (Dao et al., 2023) for details.

Common Misconceptions

Bigger GPUs always mean faster inference
Not necessarily. More VRAM is always better → What matters is memory bandwidth, not just capacity. An H100 with faster memory can outperform an A100 with more but slower VRAM.
Batch size should be maximized
For inference, larger batches increase latency. (Unlike training where larger batches are generally better). You want to balance throughput and latency based on your use case.
Quantization always degrades quality
Modern quantization techniques like GPTQ and AWQ can maintain near-original quality even at INT4 precision. The key is calibration—using representative data to minimize quantization error.
All tokens cost the same to generate
False! The first token (prefill) is much faster to compute than subsequent tokens (decode). This is why TTFT and TPOT are measured separately.

Key Takeaways

When optimizing LLM inference, remember these critical points:

  • Memory bandwidth is the bottleneck, not compute capacity
  • Prefill processes all tokens in parallel (fast), decode is sequential (slow)
  • KV caching is essential—without it, inference would be unusably slow
  • Always profile before optimizing—measure where the actual bottleneck is
  • Quantization can provide 3-4x speedup with minimal quality loss

The most important thing to remember: inference optimization is about moving less data, not doing more computation.

If you're deploying LLMs in production, focus on:

  1. Reducing model size through quantization
  2. Maximizing memory bandwidth (choose GPUs wisely)
  3. Implementing KV caching efficiently
  4. Batching requests intelligently

Last updated: November 2025 • Reading time: 12 minutes