Uniform INT4 Ignores Layer Heterogeneity
The naive approach treats a transformer like a homogeneous blob. It can't know which layers are load-bearing - or even that the question matters.
Layer 2 at 2-bit introduces a Δloss of 0.454. Layer 7 at 4-bit introduces Δloss of 0.024. Uniform INT4 applies 4-bit uniformly - hitting load-bearing attention blocks (L7, L12) with the same compression as safely redundant layers (L5, L6 where Δloss ≈ 0). The result is measurably higher perplexity for the same memory footprint.
4 bit-width choices across 22 layers = 4^22 ≈ 17.6 trillion configurations. Brute-force evaluation at 0.5s per config would take 278,000 GPU-years. Manual expert tuning typically takes weeks per model. The only tractable approach is intelligent search guided by pre-computed sensitivity data.
Reducing the problem to "minimize loss" ignores memory constraints. Reducing to "minimize VRAM" ignores accuracy. Real deployment engineers face a tradeoff: an 8GB Jetson Orin needs different precision decisions than a 40GB A100. The correct output is a Pareto frontier - not a single answer.
Offline Profiling → Search → Export → Serve
The heavy compute (sensitivity profiling) runs once per target model. The search, export, and serving phases are fast and require no GPU. Zero NAS overhead at inference time.
The O(1) Proxy Evaluator
The entire 5,000-candidate multi-objective search completes in 11.7 seconds because the fitness function is a pre-computed table lookup - not a forward pass.
The GA Rediscovered Expert Intuition
Without any prior knowledge of transformer architecture, the evolutionary algorithm converged on Layers 7 and 12 as the precision-critical blocks - exactly the mid-stack attention layers human experts would identify.
[4,4,4,4,4,4,4,8,4,4,4,4,8,4,4,4,4,4,4,4,4,4]20 of 22 layers at 4-bit. Only L7 and L12 kept at 8-bit. The GA spent its entire precision budget on the two layers that matter most - and nothing else.
Three Deployable Archetypes
The knee-point algorithm automatically identifies the balanced solution by minimizing normalized Euclidean distance to the origin on the loss-VRAM tradeoff plane. No subjective weighting required.
| Strategy | Loss | VRAM | Avg Bits | vs FP16 |
|---|---|---|---|---|
| FP16 Baseline | 2.3765 | 2,872 MB | 16.0 | - |
| Naive INT4 | 2.4685 | 1,486 MB | 4.0 | −48% |
| EMPAS Balanced | 2.4521 | 1,528 MB | 4.4 | −47% |
Engineering Decisions
Five decisions with specific technical justifications - not preferences.
DARTS requires continuous relaxation of discrete bit-widths (2/4/8/16), which introduces rounding artifacts that break the gradient signal. RL-based NAS (ENAS) has severe sample inefficiency on 4^22 combinatorial spaces - thousands of full model evaluations needed. NSGA-II natively handles discrete categorical variables, maintains Pareto-diverse populations, and makes no differentiability assumptions.
Evaluating 5,000 genomes with real inference on TinyLlama-1.1B would take 8-42 GPU hours. The additive independence of layer sensitivities (quantizing layer 7 does not change how much quantizing layer 3 hurts) enables the proxy: predicted_loss = baseline + Σ sensitivity[i][b_i]. This turns the entire search into a dict lookup + arithmetic, validated with fixed seed calibration.
Sub-layer (per-head or per-projection) granularity would expand the search space from 4^22 to 4^(22×7) ≈ 10^82 - computationally and representationally intractable. Layer-level granularity still captures the macro heterogeneity between attention blocks and FFN layers while keeping the search space tractable. The results confirm this resolution is sufficient to find the load-bearing L7/L12 pattern.
Using real bitsandbytes or GPTQ quantization in the profiling loop would require CUDA kernel compilation and hardware-specific setup for every single of the 88 experiments. Fake quantization (simulate → dequantize) is hardware-agnostic, avoids CUDA build chain fragility, and produces sensitivity measurements that remain accurate for search purposes without the overhead.
A single argmin on one objective collapses the multi-objective problem and forces the framework to make a deployment decision it shouldn't make. Engineers on different hardware (Jetson Orin 8GB vs cloud inference server) have different constraints. The knee-point selection automatically identifies the balanced solution without subjective weighting, while preserving all three operating points.