ML Systems · CUDA · LLM Inference

PageForge

Paged KV-Cache Memory Manager for LLM Inference

From-scratch implementation of the SOSP '23 vLLM PagedAttention paper: Rust allocator, hand-written CuPy CUDA kernels, DLPack zero-copy bridge to HuggingFace Transformers.

8-32×More concurrent sequences
424Sequences / GB VRAM
0.0Max logit error (bit-exact)
RustCUDAPythonPyO3CuPyDLPackHuggingFace
0×

Max VRAM savings

At decode step 0 vs naive pre-allocation

0

Sequences / GB VRAM

vs 53 naive (max=512) - 8× more throughput

~0%

Kernel bandwidth utilization

157-227 GB/s on RTX 4070 (250 GB/s peak)

0/75

Tests passing

Unit → kernel → GPT-2 integration → benchmarks

The Problem

90% of Your GPU
Is Idle

Standard HuggingFace KV-cache allocates a full contiguous tensor per sequence at the first forward pass, sized to max_seq_len:

# shape per sequence, GPT-2 max_seq_len=512 shape = (batch, n_layers, 2, n_heads, max_seq_len, d_head) # = 18.9 MB reserved, regardless of tokens generated

A request generating 60 tokens holds a 512-token VRAM slot for its entire lifetime. At 32 concurrent sequences, over 90% of allocated VRAM is idle before decode step 100. This is the primary GPU memory bottleneck in LLM serving.

VRAM at Decode Step 50 · 32 Concurrent Sequences

Naive pre-allocation603.98 MB
90%+ idle VRAM
PageForge paged75.50 MB
only what exists
more efficient at decode step 50
32× at sequence start

System Architecture

Four-Layer Component Stack

Each layer has a single bounded responsibility. Click any layer to see its internals.

L0
Python Cache LayerPython

pageforge/cache.py

L1
Rust AllocatorRust

pageforge-rs/ · PyO3 0.22 · maturin

L2
CUDA Kernel LayerCUDA

pageforge/kernels/kv_cache.py · CuPy RawKernel (NVRTC)

L3
GPU Memory PoolCuPy

pageforge/pool.py · CuPy fp16

One Decode Step - Annotated

1
alloc_for_seq(seq_id, 1)Rust O(1) pop_front
2
gather_kv(pool, page_ids)2 CUDA kernel launches
3
model.forward(token_n)HuggingFace attention
4
scatter_kv_layer(new_kv)12 scatter dispatches
5
free_seq(seq_id)O(1) push_back on finish

Total kernel launches: 2 gather + 12×2 scatter = 26
After fused kernel (P0 roadmap): 2 total

Key Technical Insight

From 24 Kernel Calls to 2

The combined K+V pool layout is the single most impactful optimization - it reduces gather operations per decode step from 24 to 2 for GPT-2.

NAIVE LAYOUT
# Separate tensors per layer pool_k[0], pool_v[0] # layer 0 pool_k[1], pool_v[1] # layer 1 ... pool_k[11], pool_v[11] # layer 11
24

gather calls per decode step

2 per layer × 12 layers (GPT-2)

PAGEFORGE LAYOUT
# All layers combined in one tensor pool shape: (N_pages, page_size, n_layers × n_heads, d_head) # = (512, 16, 144, 64) for GPT-2
2

gather calls per decode step

1 for all K layers · 1 for all V layers

How it works:

K layers 0-11 occupy head indices [0, 144). V layers 0-11 occupy [144, 288). Each layer reads its slice via layer_idx × n_heads offset. One gather reads all K. One gather reads all V.

Hardware-Verified Results

Benchmarks

RTX 4070 Laptop · sm_89 · CUDA 12.8 · GPT-2 124M fp16 · PyTorch 2.11

VRAM Efficiency

32 concurrent sequences · GPT-2 124M fp16

Step 0 (seq_len=9)32× savings
Naive603.98 MB
PageForge18.87 MB
Step 10 (seq_len=19)16× savings
Naive603.98 MB
PageForge37.75 MB
Step 25 (seq_len=34) savings
Naive603.98 MB
PageForge75.5 MB
Step 50 (seq_len=59) savings
Naive603.98 MB
PageForge75.5 MB
Step 100 (seq_len=109) savings
Naive603.98 MB
PageForge150.99 MB

Concurrent Sequences / GB VRAM

At decode step 50

Naive (max=1024)27
Naive (max=512)53
PageForge paged424

Decode Latency

Single sequence · 50 iterations · 5 warmup

HF DynamicCache

P50

7.5 ms

P99

10.3 ms

PageForge paged

P50

10.0 ms (+33%)

P99

11.9 ms

Root cause of +33% P50: 24 Python dispatch hops per step (1 scatter per layer × 12 layers × K+V). Planned fix: fused scatter-attention kernel - target ≤8 ms P50.

Kernel Bandwidth

gather_kv + scatter_kv_layer

157-227

GB/s on sm_89

~89% of RTX 4070 Laptop theoretical peak (250 GB/s) · purely memory-bandwidth-bound

Correctness Guarantee

0.0
max(abs(hf_logits - paged_logits)) = 0.0

Over 50 decode steps across all test prompts, in fp16. Not approximately equal - bit-for-bit identical to HuggingFace DynamicCache. This is the only acceptable correctness bar when replacing a production system.

19 unit tests - Rust allocator + PyO3 bindings
10 kernel tests - round-trip bit-exact correctness
7 integration tests - GPT-2 50-step logit parity
20 benchmark tests - VRAM, latency, efficiency
6 Rust cargo tests - OOM, double-free, invariants

Engineering Decisions

Five Choices Worth Defending

01

Rust VecDeque free-list over Python list

VecDeque.pop_front() is O(1). Python list.pop(0) shifts the entire array - O(n). At 1.5M page alloc/free operations per second, this gap is measurable and compounds under concurrent load.

1.5M pages/sec allocator throughput
02

Combined K+V pool tensor across all layers

Naive layout needs 2×n_layers = 24 separate gather calls per decode step for GPT-2. Combined layout (n_layers×n_heads as one axis) lets a single gather read all K across all layers, and one more read all V. 24 → 2 kernel launches per step.

12× fewer kernel launches per decode step
03

CuPy RawKernel over torch.utils.cpp_extension

On Windows + PyTorch 2.11 + MSVC 14.41, the torch C++ extension build chain is broken. CuPy NVRTC compiles CUDA C++ at import time with no build system dependency. This is a deliberate design choice - full kernel control, zero setup friction, Windows-compatible.

Zero-dependency CUDA compilation
04

DLPack zero-copy bridge for every decode step

Without DLPack: PyTorch CUDA tensor → .cpu().numpy() → cupy.array() touches host DRAM twice per tensor per step. DLPack exports a capsule pointing to the existing GPU allocation - no data moves, no host memory touched. Verified bit-exact across implementations.

No host round-trips on hot path
05

DynamicCache subclass as the user API

HuggingFace Transformers calls cache.update(key, value, layer_idx) inside attention. Overriding DynamicCache internals rather than replacing the full interface means PagedKVCache works with every model that uses the standard cache API - GPT-2 today, any future model tomorrow.

One-line swap, full inference compatibility

The Story Behind This Project

When the Experiment Told the Truth

PageForge started as a completely different project - SymboLR, a genetic programming engine designed to discover symbolic learning rate schedules of the formlr = f(t, g, Δl)using MAP-Elites evolutionary search in Rust, a torch.func.vmap batched evaluator, and 107 passing tests.

After several weeks, empirical reality intervened: the proxy-to-real transfer gap turned out to be intractable at individual compute scale. The synthetic proxy task was too easy for any LR in [0.01, 10.0], so the GP selected for fast convergence (high LR), not gradient-health awareness. The discovered formula ranked last of five candidates on real evaluation.

The decision: redirect to a problem with verifiable, hardware-grounded results within the same Rust/PyTorch/CUDA engineering domain. PageForge was that problem. The SymboLR infrastructure was not wasted - it demonstrated exactly what rigorous experimentation looks like: building something substantial, testing it honestly, and being willing to act on what the data says.

The defining moment of PageForge came after assembling all four layers and runningmax(abs(hf_logits - paged_logits))for the first time - and seeing exactly 0.0. The DLPack bridge, the gather/scatter kernels, and the HuggingFace interface all composed correctly to produce bit-exact outputs. That is the only acceptable result when replacing a well-tested production system.

What Is Next

Engineering Roadmap

P0

Fused scatter-attention CUDA kernel

P50: 10.0 → ≤8.0 ms (−20%)

Single kernel: gather → attention → scatter. Eliminates 24 Python dispatch hops per step.

P1

Prefix KV sharing

VRAM reduction ∝ shared prefix length

Copy-on-write block table - sequences with identical prompt prefixes share physical pages until divergence.

P2

Dynamic pool resize

Handle load spikes without restart

Rust PageAllocator.grow(n) - allocate new CuPy slab and extend free_list at runtime.

P3

CUDA Graph capture

−30% total decode latency (est.)

Stable page-ID layout per step enables torch.cuda.CUDAGraph - near-zero kernel launch overhead.

P4

Multi-GPU sharding

70B+ model serving with tensor parallelism

BlockTable indexed by (device_id, page_id). Cross-device gather via NVLink or PCIe.

Back to all projects