◆◆◆◆◆claude-fable-51 person reached

The KV Cache: Why Your LLM Remembers Instead of Re-Reading

Every token an LLM generates would cost a full re-read of the conversation — unless you cache two tensors. Here's the memory trick that makes inference fast, and why it now eats most of your VRAM.

The KV Cache: Why Your LLM Remembers Instead of Re-Reading

Imagine a scribe copying out a story one word at a time, with one strange rule: before writing each new word, he must re-read everything written so far — from the very first word. Word 10 requires re-reading 9 words. Word 10,000 requires re-reading 9,999. The longer the story gets, the slower each new word arrives.

That scribe is a transformer without a KV cache. The cache is the notebook he keeps at his elbow — a running set of notes about every word he's already processed — so that each new word only requires one glance at the notes, not a full re-read.

This single optimization is why chatbots stream tokens at a steady clip instead of grinding to a halt as the conversation grows. It's also why, on modern hardware, the bottleneck of LLM serving is no longer compute — it's the memory that the notebook occupies.

What happens when a model generates one token

  1. 1

    The prompt goes in (prefill)

    The model reads your entire prompt in one big parallel pass. Every token gets turned into vectors, every layer runs attention across the whole prompt at once. This is the pause before the first word appears — and it's the only genuinely parallel-over-tokens phase.

  2. 2

    One token comes out

    The model samples a single next token from the probability distribution at the final position. Just one.

  3. 3

    That token goes back in (decode)

    The new token is appended to the sequence and fed back through the model to predict the next token. This loop runs once per generated token — hundreds or thousands of times per response.

  4. 4

    The question that defines inference

    At each decode step, the new token must attend to every previous token. Does the model recompute the representations of all those previous tokens from scratch... or did it keep notes?

Attention in 60 seconds: Q, K, V

Inside each transformer layer, every token produces three vectors by multiplying its hidden state with three learned weight matrices:

  • a Query (qq) — "what am I looking for?"
  • a Key (kk) — "what do I contain, as seen by others?"
  • a Value (vv) — "what information do I hand over if someone attends to me?"

A new token at position tt computes attention by comparing its query against the keys of all positions t\le t, then taking a weighted blend of their values:

outt=itsoftmaxi ⁣(qtkid)how much to look at token ivi\text{out}_t = \sum_{i \le t} \underbrace{\text{softmax}_i\!\left(\frac{q_t \cdot k_i}{\sqrt{d}}\right)}_{\text{how much to look at token } i} \, v_i

Notice what the new token needs from the past: only kik_i and viv_i for every earlier token ii. It never needs the old queries.

The insight: the past is frozen

Transformers used for generation are causal: token ii can only see tokens i\le i. So when token 5,001 arrives, nothing about tokens 1–5,000 changes. Their hidden states are identical to what they were last step. Their keys and values are identical. Recomputing them is pure waste.

So we don't. After computing ki,vik_i, v_i for a token at each layer, we append them to two big tensors in GPU memory — the KV cache — and every future decode step just reads them back.

How much compute does this actually save?◆◆◆◆go deeper +

Consider generating nn tokens, hidden size dd.

Without a cache, decode step tt must re-run the full forward pass over all tt tokens. The attention score computation alone is O(t2d)O(t^2 d) at step tt. Summing over the whole generation:

t=1nO(t2d)=O(n3d)\sum_{t=1}^{n} O(t^2 d) = O(n^3 d)

Cubic in sequence length. Generating 10× more tokens costs ~1000× more attention work.

With a cache, step tt runs the model on one token: project one q,k,vq, k, v, then do one query-against-tt-keys attention, which is O(td)O(t d). Summing:

t=1nO(td)=O(n2d)\sum_{t=1}^{n} O(t d) = O(n^2 d)

A full factor of nn saved. And the per-step cost drops from "quadratic in context" to "linear in context" — which is what makes streaming feel steady rather than decelerating.

(The MLP layers also shrink from re-processing tt tokens to processing 1, saving another factor of tt there.)

Drag the context length below and watch the two regimes diverge — the y-axis is cumulative attention work, in arbitrary units. The cubic curve is the scribe re-reading; the quadratic one is the scribe with a notebook:

05,00010,00015,00020,00025,00030,00035,00040,0005101520253035404550
no cache (∝ n³)KV cache (∝ n²)

The catch: we traded compute for memory. And the bill is large.

How big is the notebook?

For every token, at every layer, we store one key vector and one value vector per KV head. The formula:

bytes per token=2×L×Hkv×dhead×bytesdtype\text{bytes per token} = 2 \times L \times H_{kv} \times d_{head} \times \text{bytes}_{\text{dtype}}

where the leading 2 is "one K and one V," LL is layer count, HkvH_{kv} is the number of KV heads, and dheadd_{head} is the dimension per head (very commonly 128).

Play with it. The sliders below set the architecture (head dim fixed at 128, FP16 = 2 bytes); the plot shows total cache size in GB as context length grows. With the defaults — layers (l) = 32 layers, kv heads (h_kv) = 8 KV heads, batch of concurrent sequences (batch) = 1 — this is Llama-3-8B territory:

Try it — drag the values

0246810121416020406080100120
KV cache (GB) vs context length (thousands of tokens)

A few things to try:

  1. Defaults, x = 128 (a 128K context): about 16 GB of cache for one conversation — half of a 32 GB consumer GPU, before counting the ~16 GB of model weights.
  2. Slide H to 32 — what Llama-3-8B would cost without grouped-query attention. Four times the memory, same model quality story. This is why nobody ships full multi-head KV anymore.
  3. Slide batch to 16 with everything else at defaults. This is the serving reality: cache memory scales linearly with concurrent users, and it — not weights — determines how many requests fit on your GPU.

Real models, FP16, per token of context:

ModelLayersKV headsTrickKV bytes / tokenCache at 128K ctx
GPT-3 175B (2020)9696none (MHA)~4.7 MB~600 GB (!)
Llama 2 7B3232none (MHA)512 KB64 GB
Llama 3 8B328GQA128 KB16 GB
Llama 3 70B808GQA320 KB40 GB
Mistral 7B328GQA + sliding window128 KB16 GB
DeepSeek-V3 671B61MLA (latent)~70 KB~9 GB

Read that last column again: a 671-billion-parameter model can have a smaller cache than an 8B model. The tricks matter as much as the size. We'll get to them.

Prefill vs. decode: two different machines

The cache splits inference into two phases with opposite personalities.

Prefill processes the whole prompt at once. Thousands of tokens flow through big matrix multiplications in parallel — the GPU's tensor cores stay saturated. Prefill is compute-bound: you're limited by FLOPs.

Decode processes one token per step. But that one token's attention must read the entire KV cache from GPU memory — every key, every value, every layer — plus all the model weights. The actual arithmetic is tiny. Decode is memory-bandwidth-bound: you're limited by how fast bytes move from VRAM to the compute units.

Arithmetic intensity: why decode starves the GPU◆◆◆◆go deeper +

A GPU has a roof on FLOPs and a roof on bytes/second. Which one binds depends on the arithmetic intensity of the workload: FLOPs performed per byte moved.

During decode with batch size 1, attention against a cached key/value pair does roughly 2 multiply-adds per 2-byte element read — an intensity of ~1–2 FLOPs/byte. A modern GPU's "balance point" is in the hundreds of FLOPs per byte (e.g., ~100 TFLOPs of FP16 against ~2 TB/s of bandwidth → ~50–100 FLOPs/byte). Decode sits two orders of magnitude below the balance point. The tensor cores idle while the memory system sprints.

Concrete ceiling: with a 16 GB cache and ~1.8 TB/s of bandwidth, just reading the cache once per token caps you near 110 tokens/sec for that sequence — regardless of how fast the GPU computes. Every cache-shrinking trick in the next section is therefore also a speed trick: fewer bytes per token read means more tokens per second.

This is also why serving systems batch aggressively: weights are read once per step and shared across the whole batch, amortizing the dominant memory traffic over many users. The KV cache, however, is per-sequence — it never amortizes. Which is exactly why it became the thing everyone optimizes.

Shrinking the notebook

The cache went from afterthought to the central object of inference research. Four families of attack:

Share keys and values across query heads.

Multi-head attention gives every one of the (say) 32 query heads its own K and V — 32 perspectives stored per token. Multi-Query Attention (MQA) keeps 32 query heads but a single shared K/V head: 32× less cache, with some quality cost. Grouped-Query Attention (GQA) is the compromise: groups of query heads (e.g., 4) share one KV head, so 32 query heads might use only 8 KV heads — a 4× cache reduction with nearly no quality loss.

This is the HkvH_{kv} slider you dragged above, and it's why nearly every model since 2023 ships with HkvHqH_{kv} \ll H_{q}. The cheapest trick in the book, applied at training time, paying out forever at inference time.

PagedAttention: the cache gets an operating system

Even after shrinking the cache, serving systems faced a dumber problem: fragmentation. Early servers pre-allocated each request's cache as one contiguous block sized for the maximum possible length. A request that might generate 2,000 tokens but stopped at 300 wasted 85% of its reservation. Measurements at the time found only 20–40% of "used" cache memory held actual data.

vLLM's PagedAttention (2023) stole the fix from operating systems: virtual memory. Chop the cache into small fixed-size blocks (e.g., 16 tokens each), let a sequence's blocks live anywhere in VRAM, and keep a per-sequence block table mapping logical position → physical block. Allocation becomes on-demand, waste drops to under ~4%, and effective batch sizes — therefore throughput — jump severalfold.

Paging also unlocked prefix sharing: if 50 requests start with the same system prompt, their block tables can all point at the same physical blocks for that prefix — copy-on-write, exactly like forked processes sharing memory pages. Common prefixes get cached once, prefilled once, and reused by everyone.

Prefix caching, beam search, and the cache as a data structure◆◆◆◆go deeper +

Once the cache is paged, it stops being "a buffer" and becomes a managed data structure with reference counts:

  • Automatic prefix caching: hash each full block of tokens; if a new request's prompt prefix matches blocks already resident, skip their prefill entirely. Multi-turn chat benefits enormously — every turn re-sends the conversation, but only the new suffix is actually computed.
  • Beam search / parallel sampling: candidate continuations share their common prefix physically and copy-on-write only at the divergence point, instead of duplicating the whole cache per beam.
  • Preemption & swapping: under memory pressure, a scheduler can evict a sequence's blocks to CPU RAM (or just drop and recompute them later) — exactly the page-out/page-in playbook.
  • Disaggregated serving: since prefill is compute-bound and decode is bandwidth-bound, large deployments now run them on separate GPU pools and ship the KV cache across the network between them — the cache as a first-class object with its own transfer protocol.

Where this leaves us

  1. 2017

    Transformer published

    Caching K/V during incremental decoding appears almost immediately as an obvious implementation detail — a footnote, not a feature.

  2. 2019

    Multi-Query Attention (Shazeer)

    First paper to treat KV memory traffic as the decode bottleneck and shrink it architecturally. Years ahead of its moment.

  3. 2022

    Chat era begins

    Long multi-turn conversations at scale turn the cache from footnote into the dominant VRAM consumer.

  4. 2023

    GQA + PagedAttention (vLLM)

    Grouped-query attention becomes the default architecture; paging becomes the default memory manager. Serving throughput jumps 2–4×.

  5. 2024

    Multi-head Latent Attention (DeepSeek)

    Cache compression moves into the architecture itself: store latents, reconstruct K/V on the fly.

  6. 2025+

    The cache becomes infrastructure

    Prefix caching across requests, KV transfer between machines, cache-aware routing — the notebook now has its own distributed systems literature.

The arc is worth noticing: a trivial memoization trick became the defining constraint of an industry. Today, when you ask "how many users can this GPU serve?" or "why is my long-context request slow?", you are almost always asking a question about the KV cache.

Check yourself

Check yourself

A model uses GQA to cut its KV heads from 32 to 8. Besides using 4× less VRAM for the cache, what is the main *speed* benefit during decode?

Check yourself

Why does the cache store keys and values, but never queries?

An open question to chew on: every trick here — GQA, quantization, MLA, eviction — bets that the cache contains redundancy. The cache is the model's working memory of the conversation. How small can that memory get before the model demonstrably forgets things it shouldn't? Nobody has found the floor yet — and the gap between "70 KB per token" and "the information content of one token" is still enormous.

Generated with claude-fable-5 · curated by Joe · CC BY-SA 4.0 · v1

Discussion 1 comments — open a section's margin marker, or select its text, to comment in place

Nothing here yet.

0/2000