Projects/Dynamic Experimentation Engine
Causal Inference · X-Learner · LinUCB Bandits · Knowledge Distillation

Dynamic Experimentation Engine

A standard A/B test showed +59.4% conversion lift - and a −$0.05/user net loss. An end-to-end causal inference pipeline on 14M ad records identifies the Persuadable sub-population, turning the loss into +$0.09 profit per user.

Economic turnaround per user (−$0.05 → +$0.09)
+$0.14
+280% improvement · 14M Criteo records · Causal Bandit
Distilled inference latency - RTB compatible
45µs
vs 120ms X-Learner · 2,667× speedup · 5.5 KB pkl
PythonLightGBMPolarsscikit-learnSciPyNumPyStreamlitPlotlyPytest
$0.09
Profit per User
vs −$0.05 naive A/B · +280% improvement
14M
Dataset Scale
Criteo Uplift benchmark · 225 MB Parquet
2,667×
Inference Speedup
X-Learner 120ms → Decision Tree 45µs
11
Test Coverage
pytest files · 7 core tests verified PASS
The Core Problem

The Profitability Trap - when A/B testing destroys value

Standard A/B testing measures Average Treatment Effect - a population-level aggregate that hides the distribution of individual effects. A +59.4% lift at the aggregate level conceals four distinct user groups with radically different economic value.

Statistical significance ≠ economic profit

The Criteo A/B test shows +59.4% conversion lift, p < 0.001, statistically valid. But the math: treatment CR = 0.31% × $10 LTV = $0.031 revenue per user. Ad cost = $0.10. Net: −$0.069/user. A vanity-metric win masking a unit-economics disaster at 14M-impression scale = −$5M/month.

ATE hides four distinct user populations

Averaging over 14M users collapses four fundamentally different groups: Persuadables (CATE > 0, ~20%), Sure Things (convert anyway, ~15%), Lost Causes (never convert, ~60%), and Sleeping Dogs (CATE < 0, ~5%). The average tells you nothing about who to bid on.

T-Learner fails on 85/15 splits

85% of users saw ads, only 15% did not. A T-Learner trains a separate model per group: the control model trains on ~2.1M rows with 0.19% conversion rate - severe data starvation. The X-Learner cross-pollinates via propensity-weighted counterfactual imputation, specifically designed for imbalanced experimental designs.

System Architecture

5-subsystem end-to-end causal pipeline

From raw 225 MB Parquet to a 5.5 KB RTB artifact - each subsystem has a dedicated, tested module. Click any stage to inspect the engineering detail.

S1
DataLoader + Integrity GatePolars · SciPy · SRM
Polars DataFrame ~850 MB RAM · SRM p=0.9989 · max SMD=0.0488 ✓
S2
FrequentistEngine - ATE + CUPEDWelch · OLS · Polars
ATE = +59.45% lift · p < 0.001 · CUPED reduces SE · baseline profit = −$0.05/user
S3
X-Learner - 5 LightGBM CATE ModelsLightGBM · Propensity · CATE
CATE scores ∈ ℝ^{N_test} - individual uplift estimates for every user
S4
LinUCB Profit-Aware BanditLinUCB · Replay · Off-Policy
Policy: BID if CATE×$10 > $0.10 · Avg profit = +$0.09/user · +$0.14 swing
S5
DistillationEngine - Teacher → StudentDecision Tree · 5.5 KB · 45µs
5.5 KB .pkl Decision Tree · ~45µs RTB inference · R² ≥ 0.80 · 2,667× speedup
5
LightGBM Models
propensity + 2 response + 2 effect
11
Pytest Files
7 core tests PASS verified
~50%
Memory Saved
Float64→Float32 type downcasting
5.5 KB
Production Artifact
distilled Decision Tree pkl
Causal Segmentation

The four populations - interactive

Explore each causal user segment identified by the X-Learner. The surrogate Decision Tree (depth=3) translates CATE scores into human-readable IF/THEN rules for the marketing team.

The population bar below is clickable. Only ~20% of users should ever be targeted - the 80% naive waste is the entire business case.

+0.07
CATE Score
BID
Population share~20% of users
Surrogate tree rule
IF f4 > 11.77 AND f3 ≤ 3.15 AND f2 ≤ 8.34
f4 (key discriminator)
> 11.77 (strong signal)
f3
≤ 3.15
f6
−7.40 (vs −3.58 avg)
Economic Outcome - per user
Naive A/B (broadcast all)−$0.05
Causal targeting - this segment+$0.09

The only segment worth targeting. These users have genuine incremental response to advertising - they convert because of the ad, not in spite of it. The X-Learner isolates this via propensity-weighted counterfactual imputation: CATE = g(x)·τ₀(x) + (1−g(x))·τ₁(x).

LinUCB Bandit Policy
BID - CATE × $10 > $0.10 ✓
20%
15%
60%
5%
Persuadables
Sure Things
Lost Causes
Sleeping Dogs
At 100M impressions / month
BEFORE (Naive A/B)
−$5M/mo
100M × −$0.05/user
AFTER (Causal Bandit)
+$1.8M/mo
20M targeted × +$0.09/user
NET SWING
+$6.8M/mo
+$81.6M annualized
Results

Three strategies, one winner

The LinUCB Contextual Bandit simulated over 1M historical impressions via off-policy replay - the reward signal is net profit, not CTR.

Naive A/B Rollout
−$0.05
profit / user
−$5M/mo
Broadcast all · +59.4% lift · economically wrong
Greedy Uplift
+$0.08
profit / user
+$1.6M/mo
Top CATE decile only · no exploration
OPTIMAL
Causal Bandit (LinUCB)
+$0.09
profit / user
+$1.8M/mo
Exploration-exploitation · optimal policy
UPLIFT BY CATE DECILE - top vs bottom rank-ordering
The X-Learner correctly rank-orders users: top decile achieves +350% lift vs naive +59.4%
D10 (top)
+350%
D9
+180%
D8
+95%
D7
+48%
D6
+22%
D5
+8%
D4
+2%
D3 (bot.)
-1%
Top Decile Lift
+350%
Naive A/B Lift
+59.4%
Student Model R²
≥ 0.80
SRM p-value
0.9989
Engineering Decisions

Five non-obvious choices

01

X-Learner over T-Learner

The 85/15 treatment/control split is fatal for T-Learners. The control group model (M0) trains on only ~2.1M rows with 0.19% conversion rate - severe data starvation. X-Learner solves this by cross-pollinating: it uses the treatment model to impute what would have happened to control users if treated, and vice versa. The propensity score then weights these imputed effects according to the probability each unit was in its observed group.

Handles 85/15 imbalance - T-Learner cannot
02

Polars over Pandas for 14M rows

Pandas GroupBy + multi-aggregate on 14M rows triggers Python GIL locks on every operation. Polars executes columnar operations in Rust with SIMD vectorization - 5-10× faster on large-N aggregations with no GIL contention. Combined with type downcasting (Float64→Float32, int64→int8), RAM drops ~50%, from ~1.7 GB to ~850 MB. The LightGBM native booster API is used directly over the sklearn wrapper for the same reason: no Python overhead, cleaner serialization.

5-10× faster · ~50% RAM reduction on 14M rows
03

Knowledge Distillation for RTB

The X-Learner runs 5 LightGBM models sequentially at 120ms - completely incompatible with Real-Time Bidding's <1ms constraint. Knowledge distillation is typically a deep learning technique; applying it to a causal meta-learner ensemble is non-obvious. The key insight: the X-Learner's CATE predictions are smooth continuous scores (not noisy raw labels), making them ideal "soft labels" for the student to learn. The Decision Tree learns the ensemble's soft approximation, not noisy binary outcomes.

2,667× speedup: 120ms → 45µs · RTB-compatible
04

Profit reward signal in LinUCB

Reward = net profit: (conversion × $10) − ($0.10 if ad shown). Not CTR, not conversion probability, not lift. This is a consequential distinction: a CTR-maximizing bandit would still bid on Sure Things (they convert, reward looks good) and Lost Causes (rare conversions still appear). Only when the reward directly encodes economics - uplift × LTV minus cost - does the policy learn to target exclusively the profitable Persuadables.

+$0.09/user (Bandit) vs +$0.08/user (greedy uplift)
05

α=0.001 for SRM instead of α=0.05

With 14M rows, statistical power is so high that even a 0.05% deviation in traffic split achieves p<0.05 trivially - a false alarm. Setting α=0.001 prevents nuisance alerts from harmless rounding in the traffic allocation system. This is documented explicitly in validation.py with a comment: "large N makes 0.05 too sensitive." The resulting SRM test is robust: our p=0.9989 is a genuine pass, not a pass-by-default from a lenient threshold.

Robust SRM: p=0.9989 is a real pass, not noise

Dynamic Experimentation Engine

Causal uplift pipeline on 14M Criteo records - X-Learner CATE estimation, LinUCB Bandit, 2,667× knowledge distillation.

Back to ProjectsLightGBM · Polars · X-Learner · LinUCB · scikit-learn