Max VRAM savings
At decode step 0 vs naive pre-allocation
Sequences / GB VRAM
vs 53 naive (max=512) - 8× more throughput
Kernel bandwidth utilization
157-227 GB/s on RTX 4070 (250 GB/s peak)
Tests passing
Unit → kernel → GPT-2 integration → benchmarks
The Problem
90% of Your GPU
Is Idle
Standard HuggingFace KV-cache allocates a full contiguous tensor per sequence at the first forward pass, sized to max_seq_len:
A request generating 60 tokens holds a 512-token VRAM slot for its entire lifetime. At 32 concurrent sequences, over 90% of allocated VRAM is idle before decode step 100. This is the primary GPU memory bottleneck in LLM serving.
VRAM at Decode Step 50 · 32 Concurrent Sequences
32× at sequence start
Key Technical Insight
From 24 Kernel Calls to 2
The combined K+V pool layout is the single most impactful optimization - it reduces gather operations per decode step from 24 to 2 for GPT-2.
gather calls per decode step
2 per layer × 12 layers (GPT-2)
gather calls per decode step
1 for all K layers · 1 for all V layers
K layers 0-11 occupy head indices [0, 144). V layers 0-11 occupy [144, 288). Each layer reads its slice via layer_idx × n_heads offset. One gather reads all K. One gather reads all V.
Hardware-Verified Results
Benchmarks
RTX 4070 Laptop · sm_89 · CUDA 12.8 · GPT-2 124M fp16 · PyTorch 2.11
Kernel Bandwidth
gather_kv + scatter_kv_layer
157-227
GB/s on sm_89
~89% of RTX 4070 Laptop theoretical peak (250 GB/s) · purely memory-bandwidth-bound
Engineering Decisions
Five Choices Worth Defending
Rust VecDeque free-list over Python list
VecDeque.pop_front() is O(1). Python list.pop(0) shifts the entire array - O(n). At 1.5M page alloc/free operations per second, this gap is measurable and compounds under concurrent load.
Combined K+V pool tensor across all layers
Naive layout needs 2×n_layers = 24 separate gather calls per decode step for GPT-2. Combined layout (n_layers×n_heads as one axis) lets a single gather read all K across all layers, and one more read all V. 24 → 2 kernel launches per step.
CuPy RawKernel over torch.utils.cpp_extension
On Windows + PyTorch 2.11 + MSVC 14.41, the torch C++ extension build chain is broken. CuPy NVRTC compiles CUDA C++ at import time with no build system dependency. This is a deliberate design choice - full kernel control, zero setup friction, Windows-compatible.
DLPack zero-copy bridge for every decode step
Without DLPack: PyTorch CUDA tensor → .cpu().numpy() → cupy.array() touches host DRAM twice per tensor per step. DLPack exports a capsule pointing to the existing GPU allocation - no data moves, no host memory touched. Verified bit-exact across implementations.
DynamicCache subclass as the user API
HuggingFace Transformers calls cache.update(key, value, layer_idx) inside attention. Overriding DynamicCache internals rather than replacing the full interface means PagedKVCache works with every model that uses the standard cache API - GPT-2 today, any future model tomorrow.
The Story Behind This Project
When the Experiment Told the Truth
PageForge started as a completely different project - SymboLR, a genetic programming engine designed to discover symbolic learning rate schedules of the formlr = f(t, g, Δl)using MAP-Elites evolutionary search in Rust, a torch.func.vmap batched evaluator, and 107 passing tests.
After several weeks, empirical reality intervened: the proxy-to-real transfer gap turned out to be intractable at individual compute scale. The synthetic proxy task was too easy for any LR in [0.01, 10.0], so the GP selected for fast convergence (high LR), not gradient-health awareness. The discovered formula ranked last of five candidates on real evaluation.
The decision: redirect to a problem with verifiable, hardware-grounded results within the same Rust/PyTorch/CUDA engineering domain. PageForge was that problem. The SymboLR infrastructure was not wasted - it demonstrated exactly what rigorous experimentation looks like: building something substantial, testing it honestly, and being willing to act on what the data says.
The defining moment of PageForge came after assembling all four layers and runningmax(abs(hf_logits - paged_logits))for the first time - and seeing exactly 0.0. The DLPack bridge, the gather/scatter kernels, and the HuggingFace interface all composed correctly to produce bit-exact outputs. That is the only acceptable result when replacing a well-tested production system.
What Is Next
Engineering Roadmap
Fused scatter-attention CUDA kernel
P50: 10.0 → ≤8.0 ms (−20%)Single kernel: gather → attention → scatter. Eliminates 24 Python dispatch hops per step.
Prefix KV sharing
VRAM reduction ∝ shared prefix lengthCopy-on-write block table - sequences with identical prompt prefixes share physical pages until divergence.
Dynamic pool resize
Handle load spikes without restartRust PageAllocator.grow(n) - allocate new CuPy slab and extend free_list at runtime.
CUDA Graph capture
−30% total decode latency (est.)Stable page-ID layout per step enables torch.cuda.CUDAGraph - near-zero kernel launch overhead.
Multi-GPU sharding
70B+ model serving with tensor parallelismBlockTable indexed by (device_id, page_id). Cross-device gather via NVLink or PCIe.