Neural Architecture Search · LLM Quantization · Multi-Objective Optimization

EMPAS

Multi-objective genetic algorithm that discovers optimal per-layer quantization for LLMs. A one-time sensitivity profiling pass enables O(1) fitness evaluation - reducing a 17.6 trillion configuration search to 11.7 seconds.

Full search completed in

11.7s

5,000 evaluations · 4^22 ≈ 17.6T configuration space

VRAM reduction vs FP16

47%

Balanced archetype · 1528 MB vs 2872 MB

PyTorchNSGA-IIHuggingFaceWandBHydraFastAPIStreamlitCUDA

0.0s

Search Completed In

5,000 evaluations, 50 generations

0.0T

Configuration Space

4^22 possible bit-width assignments

VRAM Reduction

vs FP16 at Balanced archetype

Pareto-Optimal Solutions

Found in final population of 100

Uniform INT4 Ignores Layer Heterogeneity

The naive approach treats a transformer like a homogeneous blob. It can't know which layers are load-bearing - or even that the question matters.

Load-Bearing Layers Get Crushed

Layer 2 at 2-bit introduces a Δloss of 0.454. Layer 7 at 4-bit introduces Δloss of 0.024. Uniform INT4 applies 4-bit uniformly - hitting load-bearing attention blocks (L7, L12) with the same compression as safely redundant layers (L5, L6 where Δloss ≈ 0). The result is measurably higher perplexity for the same memory footprint.

Search Space is 17.6 Trillion

4 bit-width choices across 22 layers = 4^22 ≈ 17.6 trillion configurations. Brute-force evaluation at 0.5s per config would take 278,000 GPU-years. Manual expert tuning typically takes weeks per model. The only tractable approach is intelligent search guided by pre-computed sensitivity data.

Multi-Objective, Not Single-Scalar

Reducing the problem to "minimize loss" ignores memory constraints. Reducing to "minimize VRAM" ignores accuracy. Real deployment engineers face a tradeoff: an 8GB Jetson Orin needs different precision decisions than a 40GB A100. The correct output is a Pareto frontier - not a single answer.

Four-phase system

Offline Profiling → Search → Export → Serve

The heavy compute (sensitivity profiling) runs once per target model. The search, export, and serving phases are fast and require no GPU. Zero NAS overhead at inference time.

Offline Sensitivity ProfilingPyTorch

tinyllama_sensitivity.json - 88 layer×bit Δloss scores

Evolutionary Search (NSGA-II)Python

checkpoint_gen_50.json - 100 genomes, 17 Pareto elites

Pareto Artifact ExportPython

balanced.json · max_accuracy.json · max_compression.json

Serving LayerFastAPI + Streamlit

POST /generate · Interactive Streamlit dashboard

Core engineering innovation

The O(1) Proxy Evaluator

The entire 5,000-candidate multi-objective search completes in 11.7 seconds because the fitness function is a pre-computed table lookup - not a forward pass.

proxy_evaluator.py - O(1) fitness function

predicted_loss

baseline_loss
+ Σᵢ sensitivity[i][bᵢ]

predicted_VRAM

Σᵢ params[i] × bᵢ/8
  / 1024² + 1024 MB

latency_proxy

(Σᵢ bᵢ)
/ (16 × n_layers) × 100

Real GPU Forward Pass

Without the Proxy

→Load quantized TinyLlama-1.1B to GPU

→Run forward pass on calibration tokens

→~0.1-0.5 seconds per genome

→× 5,000 evaluations

8-42 GPU hours

per full search run

Table Lookup + Arithmetic

With the O(1) Proxy

→Load sensitivity.json once at startup

→Lookup sensitivity[i][b_i] per layer

→22 dict lookups + 22 additions

→× 5,000 evaluations

11.7 seconds

confirmed via WandB runtime

Genome encoding - 22 layers × {2,4,8,16}

The GA Rediscovered Expert Intuition

Without any prior knowledge of transformer architecture, the evolutionary algorithm converged on Layers 7 and 12 as the precision-critical blocks - exactly the mid-stack attention layers human experts would identify.

L12

2-bit

4-bit

8-bit

16-bit

BALANCED GENOME insight

[4,4,4,4,4,4,4,8,4,4,4,4,8,4,4,4,4,4,4,4,4,4]

20 of 22 layers at 4-bit. Only L7 and L12 kept at 8-bit. The GA spent its entire precision budget on the two layers that matter most - and nothing else.

MAX COMPRESSION insight

Even the max_compression genome - which aggressively compresses surrounding layers to 2-bit - still keeps Layer 7 at 8-bit. Independent confirmation that L7 is genuinely load-bearing, regardless of the compression objective.

Three Deployable Archetypes

The knee-point algorithm automatically identifies the balanced solution by minimizing normalized Euclidean distance to the origin on the loss-VRAM tradeoff plane. No subjective weighting required.

argmin(loss)

Max Accuracy

Validation Loss3.3059

VRAM (MB)2,032

Avg Bit-Width9.8-bit

VRAM vs FP1629% reduction

Genome pattern

L1=16, L6,7,8=16-bit; most layers 8-bit

RECOMMENDED

Pareto knee-point

Balanced

Validation Loss3.3804

VRAM (MB)1,528

Avg Bit-Width4.4-bit

VRAM vs FP1647% reduction

Genome pattern

L7, L12 = 8-bit; all other 22 layers = 4-bit

argmin(VRAM)

Max Compression

Validation Loss4.2714

VRAM (MB)1,423

Avg Bit-Width3.2-bit

VRAM vs FP1650% reduction

Genome pattern

Several 2-bit layers; L7 preserved at 8-bit

Comparison vs. Baselines

Strategy	Loss	VRAM	Avg Bits	vs FP16
FP16 Baseline	2.3765	2,872 MB	16.0	-
Naive INT4	2.4685	1,486 MB	4.0	−48%
EMPAS Balanced	2.4521	1,528 MB	4.4	−47%

Engineering Decisions

Five decisions with specific technical justifications - not preferences.

NSGA-II over DARTS and RL-based NAS

DARTS requires continuous relaxation of discrete bit-widths (2/4/8/16), which introduces rounding artifacts that break the gradient signal. RL-based NAS (ENAS) has severe sample inefficiency on 4^22 combinatorial spaces - thousands of full model evaluations needed. NSGA-II natively handles discrete categorical variables, maintains Pareto-diverse populations, and makes no differentiability assumptions.

Correct optimization for discrete spaces

O(1) proxy over real forward passes

Evaluating 5,000 genomes with real inference on TinyLlama-1.1B would take 8-42 GPU hours. The additive independence of layer sensitivities (quantizing layer 7 does not change how much quantizing layer 3 hurts) enables the proxy: predicted_loss = baseline + Σ sensitivity[i][b_i]. This turns the entire search into a dict lookup + arithmetic, validated with fixed seed calibration.

8-42 hours → 11.7 seconds

Layer-level not sub-layer granularity

Sub-layer (per-head or per-projection) granularity would expand the search space from 4^22 to 4^(22×7) ≈ 10^82 - computationally and representationally intractable. Layer-level granularity still captures the macro heterogeneity between attention blocks and FFN layers while keeping the search space tractable. The results confirm this resolution is sufficient to find the load-bearing L7/L12 pattern.

Tractable 4^22 vs intractable 10^82

Fake quantization over real bitsandbytes quantization

Using real bitsandbytes or GPTQ quantization in the profiling loop would require CUDA kernel compilation and hardware-specific setup for every single of the 88 experiments. Fake quantization (simulate → dequantize) is hardware-agnostic, avoids CUDA build chain fragility, and produces sensitivity measurements that remain accurate for search purposes without the overhead.

Hardware-agnostic profiling, zero build deps

Three-archetype export over single "best" solution

A single argmin on one objective collapses the multi-objective problem and forces the framework to make a deployment decision it shouldn't make. Engineers on different hardware (Jetson Orin 8GB vs cloud inference server) have different constraints. The knee-point selection automatically identifies the balanced solution without subjective weighting, while preserving all three operating points.

One framework, every hardware target

EMPAS

11.7s · 17T configurations · NSGA-II · 47% VRAM reduction

Back to Projects GitHub Live Demo