Neural Architecture Search · LLM Quantization · Multi-Objective Optimization

EMPAS

Multi-objective genetic algorithm that discovers optimal per-layer quantization for LLMs. A one-time sensitivity profiling pass enables O(1) fitness evaluation - reducing a 17.6 trillion configuration search to 11.7 seconds.

Full search completed in
11.7s
5,000 evaluations · 4^22 ≈ 17.6T configuration space
VRAM reduction vs FP16
47%
Balanced archetype · 1528 MB vs 2872 MB
PyTorchNSGA-IIHuggingFaceWandBHydraFastAPIStreamlitCUDA
0.0s
Search Completed In
5,000 evaluations, 50 generations
0.0T
Configuration Space
4^22 possible bit-width assignments
0%
VRAM Reduction
vs FP16 at Balanced archetype
0
Pareto-Optimal Solutions
Found in final population of 100

Uniform INT4 Ignores Layer Heterogeneity

The naive approach treats a transformer like a homogeneous blob. It can't know which layers are load-bearing - or even that the question matters.

01
Load-Bearing Layers Get Crushed

Layer 2 at 2-bit introduces a Δloss of 0.454. Layer 7 at 4-bit introduces Δloss of 0.024. Uniform INT4 applies 4-bit uniformly - hitting load-bearing attention blocks (L7, L12) with the same compression as safely redundant layers (L5, L6 where Δloss ≈ 0). The result is measurably higher perplexity for the same memory footprint.

02
Search Space is 17.6 Trillion

4 bit-width choices across 22 layers = 4^22 ≈ 17.6 trillion configurations. Brute-force evaluation at 0.5s per config would take 278,000 GPU-years. Manual expert tuning typically takes weeks per model. The only tractable approach is intelligent search guided by pre-computed sensitivity data.

03
Multi-Objective, Not Single-Scalar

Reducing the problem to "minimize loss" ignores memory constraints. Reducing to "minimize VRAM" ignores accuracy. Real deployment engineers face a tradeoff: an 8GB Jetson Orin needs different precision decisions than a 40GB A100. The correct output is a Pareto frontier - not a single answer.

Four-phase system

Offline Profiling → Search → Export → Serve

The heavy compute (sensitivity profiling) runs once per target model. The search, export, and serving phases are fast and require no GPU. Zero NAS overhead at inference time.

P1
Offline Sensitivity ProfilingPyTorch
tinyllama_sensitivity.json - 88 layer×bit Δloss scores
P2
Evolutionary Search (NSGA-II)Python
checkpoint_gen_50.json - 100 genomes, 17 Pareto elites
P3
Pareto Artifact ExportPython
balanced.json · max_accuracy.json · max_compression.json
P4
Serving LayerFastAPI + Streamlit
POST /generate · Interactive Streamlit dashboard
Core engineering innovation

The O(1) Proxy Evaluator

The entire 5,000-candidate multi-objective search completes in 11.7 seconds because the fitness function is a pre-computed table lookup - not a forward pass.

proxy_evaluator.py - O(1) fitness function
predicted_loss
baseline_loss + Σᵢ sensitivity[i][bᵢ]
+
predicted_VRAM
Σᵢ params[i] × bᵢ/8 / 1024² + 1024 MB
+
latency_proxy
(Σᵢ bᵢ) / (16 × n_layers) × 100
Real GPU Forward Pass
Without the Proxy
Load quantized TinyLlama-1.1B to GPU
Run forward pass on calibration tokens
~0.1-0.5 seconds per genome
× 5,000 evaluations
8-42 GPU hours
per full search run
Table Lookup + Arithmetic
With the O(1) Proxy
Load sensitivity.json once at startup
Lookup sensitivity[i][b_i] per layer
22 dict lookups + 22 additions
× 5,000 evaluations
11.7 seconds
confirmed via WandB runtime
Genome encoding - 22 layers × {2,4,8,16}

The GA Rediscovered Expert Intuition

Without any prior knowledge of transformer architecture, the evolutionary algorithm converged on Layers 7 and 12 as the precision-critical blocks - exactly the mid-stack attention layers human experts would identify.

L7
L12
0
4
8
12
16
20
2-bit
4-bit
8-bit
16-bit
BALANCED GENOME insight
[4,4,4,4,4,4,4,8,4,4,4,4,8,4,4,4,4,4,4,4,4,4]

20 of 22 layers at 4-bit. Only L7 and L12 kept at 8-bit. The GA spent its entire precision budget on the two layers that matter most - and nothing else.

MAX COMPRESSION insight
Even the max_compression genome - which aggressively compresses surrounding layers to 2-bit - still keeps Layer 7 at 8-bit. Independent confirmation that L7 is genuinely load-bearing, regardless of the compression objective.

Three Deployable Archetypes

The knee-point algorithm automatically identifies the balanced solution by minimizing normalized Euclidean distance to the origin on the loss-VRAM tradeoff plane. No subjective weighting required.

argmin(loss)
Max Accuracy
Validation Loss3.3059
VRAM (MB)2,032
Avg Bit-Width9.8-bit
VRAM vs FP1629% reduction
Genome pattern
L1=16, L6,7,8=16-bit; most layers 8-bit
RECOMMENDED
Pareto knee-point
Balanced
Validation Loss3.3804
VRAM (MB)1,528
Avg Bit-Width4.4-bit
VRAM vs FP1647% reduction
Genome pattern
L7, L12 = 8-bit; all other 22 layers = 4-bit
argmin(VRAM)
Max Compression
Validation Loss4.2714
VRAM (MB)1,423
Avg Bit-Width3.2-bit
VRAM vs FP1650% reduction
Genome pattern
Several 2-bit layers; L7 preserved at 8-bit
Comparison vs. Baselines
StrategyLossVRAMAvg Bitsvs FP16
FP16 Baseline2.37652,872 MB16.0-
Naive INT42.46851,486 MB4.0−48%
EMPAS Balanced2.45211,528 MB4.4−47%

Engineering Decisions

Five decisions with specific technical justifications - not preferences.

01
NSGA-II over DARTS and RL-based NAS

DARTS requires continuous relaxation of discrete bit-widths (2/4/8/16), which introduces rounding artifacts that break the gradient signal. RL-based NAS (ENAS) has severe sample inefficiency on 4^22 combinatorial spaces - thousands of full model evaluations needed. NSGA-II natively handles discrete categorical variables, maintains Pareto-diverse populations, and makes no differentiability assumptions.

Correct optimization for discrete spaces
02
O(1) proxy over real forward passes

Evaluating 5,000 genomes with real inference on TinyLlama-1.1B would take 8-42 GPU hours. The additive independence of layer sensitivities (quantizing layer 7 does not change how much quantizing layer 3 hurts) enables the proxy: predicted_loss = baseline + Σ sensitivity[i][b_i]. This turns the entire search into a dict lookup + arithmetic, validated with fixed seed calibration.

8-42 hours → 11.7 seconds
03
Layer-level not sub-layer granularity

Sub-layer (per-head or per-projection) granularity would expand the search space from 4^22 to 4^(22×7) ≈ 10^82 - computationally and representationally intractable. Layer-level granularity still captures the macro heterogeneity between attention blocks and FFN layers while keeping the search space tractable. The results confirm this resolution is sufficient to find the load-bearing L7/L12 pattern.

Tractable 4^22 vs intractable 10^82
04
Fake quantization over real bitsandbytes quantization

Using real bitsandbytes or GPTQ quantization in the profiling loop would require CUDA kernel compilation and hardware-specific setup for every single of the 88 experiments. Fake quantization (simulate → dequantize) is hardware-agnostic, avoids CUDA build chain fragility, and produces sensitivity measurements that remain accurate for search purposes without the overhead.

Hardware-agnostic profiling, zero build deps
05
Three-archetype export over single "best" solution

A single argmin on one objective collapses the multi-objective problem and forces the framework to make a deployment decision it shouldn't make. Engineers on different hardware (Jetson Orin 8GB vs cloud inference server) have different constraints. The knee-point selection automatically identifies the balanced solution without subjective weighting, while preserving all three operating points.

One framework, every hardware target
EMPAS
11.7s · 17T configurations · NSGA-II · 47% VRAM reduction