The Profitability Trap - when A/B testing destroys value
Standard A/B testing measures Average Treatment Effect - a population-level aggregate that hides the distribution of individual effects. A +59.4% lift at the aggregate level conceals four distinct user groups with radically different economic value.
Statistical significance ≠ economic profit
The Criteo A/B test shows +59.4% conversion lift, p < 0.001, statistically valid. But the math: treatment CR = 0.31% × $10 LTV = $0.031 revenue per user. Ad cost = $0.10. Net: −$0.069/user. A vanity-metric win masking a unit-economics disaster at 14M-impression scale = −$5M/month.
ATE hides four distinct user populations
Averaging over 14M users collapses four fundamentally different groups: Persuadables (CATE > 0, ~20%), Sure Things (convert anyway, ~15%), Lost Causes (never convert, ~60%), and Sleeping Dogs (CATE < 0, ~5%). The average tells you nothing about who to bid on.
T-Learner fails on 85/15 splits
85% of users saw ads, only 15% did not. A T-Learner trains a separate model per group: the control model trains on ~2.1M rows with 0.19% conversion rate - severe data starvation. The X-Learner cross-pollinates via propensity-weighted counterfactual imputation, specifically designed for imbalanced experimental designs.
5-subsystem end-to-end causal pipeline
From raw 225 MB Parquet to a 5.5 KB RTB artifact - each subsystem has a dedicated, tested module. Click any stage to inspect the engineering detail.
The four populations - interactive
Explore each causal user segment identified by the X-Learner. The surrogate Decision Tree (depth=3) translates CATE scores into human-readable IF/THEN rules for the marketing team.
The population bar below is clickable. Only ~20% of users should ever be targeted - the 80% naive waste is the entire business case.
Three strategies, one winner
The LinUCB Contextual Bandit simulated over 1M historical impressions via off-policy replay - the reward signal is net profit, not CTR.
Five non-obvious choices
X-Learner over T-Learner
The 85/15 treatment/control split is fatal for T-Learners. The control group model (M0) trains on only ~2.1M rows with 0.19% conversion rate - severe data starvation. X-Learner solves this by cross-pollinating: it uses the treatment model to impute what would have happened to control users if treated, and vice versa. The propensity score then weights these imputed effects according to the probability each unit was in its observed group.
Polars over Pandas for 14M rows
Pandas GroupBy + multi-aggregate on 14M rows triggers Python GIL locks on every operation. Polars executes columnar operations in Rust with SIMD vectorization - 5-10× faster on large-N aggregations with no GIL contention. Combined with type downcasting (Float64→Float32, int64→int8), RAM drops ~50%, from ~1.7 GB to ~850 MB. The LightGBM native booster API is used directly over the sklearn wrapper for the same reason: no Python overhead, cleaner serialization.
Knowledge Distillation for RTB
The X-Learner runs 5 LightGBM models sequentially at 120ms - completely incompatible with Real-Time Bidding's <1ms constraint. Knowledge distillation is typically a deep learning technique; applying it to a causal meta-learner ensemble is non-obvious. The key insight: the X-Learner's CATE predictions are smooth continuous scores (not noisy raw labels), making them ideal "soft labels" for the student to learn. The Decision Tree learns the ensemble's soft approximation, not noisy binary outcomes.
Profit reward signal in LinUCB
Reward = net profit: (conversion × $10) − ($0.10 if ad shown). Not CTR, not conversion probability, not lift. This is a consequential distinction: a CTR-maximizing bandit would still bid on Sure Things (they convert, reward looks good) and Lost Causes (rare conversions still appear). Only when the reward directly encodes economics - uplift × LTV minus cost - does the policy learn to target exclusively the profitable Persuadables.
α=0.001 for SRM instead of α=0.05
With 14M rows, statistical power is so high that even a 0.05% deviation in traffic split achieves p<0.05 trivially - a false alarm. Setting α=0.001 prevents nuisance alerts from harmless rounding in the traffic allocation system. This is documented explicitly in validation.py with a comment: "large N makes 0.05 too sensitive." The resulting SRM test is robust: our p=0.9989 is a genuine pass, not a pass-by-default from a lenient threshold.
Dynamic Experimentation Engine
Causal uplift pipeline on 14M Criteo records - X-Learner CATE estimation, LinUCB Bandit, 2,667× knowledge distillation.