Projects/Dynamic Experimentation Engine

Causal Inference · X-Learner · LinUCB Bandits · Knowledge Distillation

Dynamic Experimentation Engine

A standard A/B test showed +59.4% conversion lift - and a −$0.05/user net loss. An end-to-end causal inference pipeline on 14M ad records identifies the Persuadable sub-population, turning the loss into +$0.09 profit per user.

Economic turnaround per user (−$0.05 → +$0.09)

+$0.14

+280% improvement · 14M Criteo records · Causal Bandit

Distilled inference latency - RTB compatible

45µs

vs 120ms X-Learner · 2,667× speedup · 5.5 KB pkl

PythonLightGBMPolarsscikit-learnSciPyNumPyStreamlitPlotlyPytest

$0.09

Profit per User

vs −$0.05 naive A/B · +280% improvement

14M

Dataset Scale

Criteo Uplift benchmark · 225 MB Parquet

2,667×

Inference Speedup

X-Learner 120ms → Decision Tree 45µs

Test Coverage

pytest files · 7 core tests verified PASS

The Core Problem

The Profitability Trap - when A/B testing destroys value

Standard A/B testing measures Average Treatment Effect - a population-level aggregate that hides the distribution of individual effects. A +59.4% lift at the aggregate level conceals four distinct user groups with radically different economic value.

Statistical significance ≠ economic profit

The Criteo A/B test shows +59.4% conversion lift, p < 0.001, statistically valid. But the math: treatment CR = 0.31% × $10 LTV = $0.031 revenue per user. Ad cost = $0.10. Net: −$0.069/user. A vanity-metric win masking a unit-economics disaster at 14M-impression scale = −$5M/month.

ATE hides four distinct user populations

Averaging over 14M users collapses four fundamentally different groups: Persuadables (CATE > 0, ~20%), Sure Things (convert anyway, ~15%), Lost Causes (never convert, ~60%), and Sleeping Dogs (CATE < 0, ~5%). The average tells you nothing about who to bid on.

T-Learner fails on 85/15 splits

85% of users saw ads, only 15% did not. A T-Learner trains a separate model per group: the control model trains on ~2.1M rows with 0.19% conversion rate - severe data starvation. The X-Learner cross-pollinates via propensity-weighted counterfactual imputation, specifically designed for imbalanced experimental designs.

System Architecture

5-subsystem end-to-end causal pipeline

From raw 225 MB Parquet to a 5.5 KB RTB artifact - each subsystem has a dedicated, tested module. Click any stage to inspect the engineering detail.

DataLoader + Integrity GatePolars · SciPy · SRM

Polars DataFrame ~850 MB RAM · SRM p=0.9989 · max SMD=0.0488 ✓

FrequentistEngine - ATE + CUPEDWelch · OLS · Polars

ATE = +59.45% lift · p < 0.001 · CUPED reduces SE · baseline profit = −$0.05/user

X-Learner - 5 LightGBM CATE ModelsLightGBM · Propensity · CATE

CATE scores ∈ ℝ^{N_test} - individual uplift estimates for every user

LinUCB Profit-Aware BanditLinUCB · Replay · Off-Policy

Policy: BID if CATE×$10 > $0.10 · Avg profit = +$0.09/user · +$0.14 swing

DistillationEngine - Teacher → StudentDecision Tree · 5.5 KB · 45µs

5.5 KB .pkl Decision Tree · ~45µs RTB inference · R² ≥ 0.80 · 2,667× speedup

LightGBM Models

propensity + 2 response + 2 effect

Pytest Files

7 core tests PASS verified

~50%

Memory Saved

Float64→Float32 type downcasting

5.5 KB

Production Artifact

distilled Decision Tree pkl

Causal Segmentation

The four populations - interactive

Explore each causal user segment identified by the X-Learner. The surrogate Decision Tree (depth=3) translates CATE scores into human-readable IF/THEN rules for the marketing team.

The population bar below is clickable. Only ~20% of users should ever be targeted - the 80% naive waste is the entire business case.

+0.07

CATE Score

BID

Population share~20% of users

Surrogate tree rule

IF f4 > 11.77 AND f3 ≤ 3.15 AND f2 ≤ 8.34

f4 (key discriminator)

> 11.77 (strong signal)

≤ 3.15

−7.40 (vs −3.58 avg)

Economic Outcome - per user

Naive A/B (broadcast all)−$0.05

Causal targeting - this segment+$0.09

The only segment worth targeting. These users have genuine incremental response to advertising - they convert because of the ad, not in spite of it. The X-Learner isolates this via propensity-weighted counterfactual imputation: CATE = g(x)·τ₀(x) + (1−g(x))·τ₁(x).

LinUCB Bandit Policy

BID - CATE × $10 > $0.10 ✓

20%

15%

60%

Persuadables

Sure Things

Lost Causes

Sleeping Dogs

At 100M impressions / month

BEFORE (Naive A/B)

−$5M/mo

100M × −$0.05/user

→

AFTER (Causal Bandit)

+$1.8M/mo

20M targeted × +$0.09/user

NET SWING

+$6.8M/mo

+$81.6M annualized

Results

Three strategies, one winner

The LinUCB Contextual Bandit simulated over 1M historical impressions via off-policy replay - the reward signal is net profit, not CTR.

Naive A/B Rollout

−$0.05

profit / user

−$5M/mo

Broadcast all · +59.4% lift · economically wrong

Greedy Uplift

+$0.08

profit / user

+$1.6M/mo

Top CATE decile only · no exploration

OPTIMAL

Causal Bandit (LinUCB)

+$0.09

profit / user

+$1.8M/mo

Exploration-exploitation · optimal policy

UPLIFT BY CATE DECILE - top vs bottom rank-ordering

The X-Learner correctly rank-orders users: top decile achieves +350% lift vs naive +59.4%

D10 (top)

+350%

+180%

+95%

+48%

+22%

+8%

+2%

D3 (bot.)

-1%

Top Decile Lift

+350%

Naive A/B Lift

+59.4%

Student Model R²

≥ 0.80

SRM p-value

0.9989

Engineering Decisions

Five non-obvious choices

X-Learner over T-Learner

The 85/15 treatment/control split is fatal for T-Learners. The control group model (M0) trains on only ~2.1M rows with 0.19% conversion rate - severe data starvation. X-Learner solves this by cross-pollinating: it uses the treatment model to impute what would have happened to control users if treated, and vice versa. The propensity score then weights these imputed effects according to the probability each unit was in its observed group.

Handles 85/15 imbalance - T-Learner cannot

Polars over Pandas for 14M rows

Pandas GroupBy + multi-aggregate on 14M rows triggers Python GIL locks on every operation. Polars executes columnar operations in Rust with SIMD vectorization - 5-10× faster on large-N aggregations with no GIL contention. Combined with type downcasting (Float64→Float32, int64→int8), RAM drops ~50%, from ~1.7 GB to ~850 MB. The LightGBM native booster API is used directly over the sklearn wrapper for the same reason: no Python overhead, cleaner serialization.

5-10× faster · ~50% RAM reduction on 14M rows

Knowledge Distillation for RTB

The X-Learner runs 5 LightGBM models sequentially at 120ms - completely incompatible with Real-Time Bidding's <1ms constraint. Knowledge distillation is typically a deep learning technique; applying it to a causal meta-learner ensemble is non-obvious. The key insight: the X-Learner's CATE predictions are smooth continuous scores (not noisy raw labels), making them ideal "soft labels" for the student to learn. The Decision Tree learns the ensemble's soft approximation, not noisy binary outcomes.

2,667× speedup: 120ms → 45µs · RTB-compatible

Profit reward signal in LinUCB

Reward = net profit: (conversion × $10) − ($0.10 if ad shown). Not CTR, not conversion probability, not lift. This is a consequential distinction: a CTR-maximizing bandit would still bid on Sure Things (they convert, reward looks good) and Lost Causes (rare conversions still appear). Only when the reward directly encodes economics - uplift × LTV minus cost - does the policy learn to target exclusively the profitable Persuadables.

+$0.09/user (Bandit) vs +$0.08/user (greedy uplift)

α=0.001 for SRM instead of α=0.05

With 14M rows, statistical power is so high that even a 0.05% deviation in traffic split achieves p<0.05 trivially - a false alarm. Setting α=0.001 prevents nuisance alerts from harmless rounding in the traffic allocation system. This is documented explicitly in validation.py with a comment: "large N makes 0.05 too sensitive." The resulting SRM test is robust: our p=0.9989 is a genuine pass, not a pass-by-default from a lenient threshold.

Robust SRM: p=0.9989 is a real pass, not noise

Dynamic Experimentation Engine

Causal uplift pipeline on 14M Criteo records - X-Learner CATE estimation, LinUCB Bandit, 2,667× knowledge distillation.

View on GitHub Live Demo

Back to ProjectsLightGBM · Polars · X-Learner · LinUCB · scikit-learn