Three ways class imbalance defeats the naive approach
In fraud detection, rare disease diagnosis, and ad-click prediction, minority classes are rare and hidden behind highly non-linear boundaries. Standard approaches fail at the architecture level.
The Numb Model
- Training on raw 92:8 data: model learns "always predict majority"
- Achieves 96%+ accuracy - a completely misleading metric
- Minority recall drops to ~0%: every fraud case missed
- AUPRC reveals the truth: 0.098 on concept-drifted holdout
SMOTE No-Man's-Land
- k-NN interpolates between minority samples geometrically
- Two separated minority clusters: SMOTE draws a line between them
- Generates physically impossible samples in majority-dominated space
- Latent space QA proves: synthetic points fall between clusters, not in them
O(N log N) Bottleneck
- SMOTE k-NN search requires global distance computation
- At 1M rows: ~20M log(1M) = ~400M distance operations
- Cannot be sharded to distributed clusters (global state dependency)
- Production enterprise scale makes SMOTE computationally prohibited
Five-stage rejection sampling pipeline
Three parallel swimlanes (Baseline, SMOTE, Model-Driven) converge at the evaluation suite. Click any stage to see the implementation detail.
The architectural insight: data generation is a rejection sampling problem, not a geometry problem. By decoupling vectorized candidate generation (stateless, O(N)) from Oracle validation (inference, O(N)), the pipeline achieves embarrassingly parallel execution with zero cross-row state. Each Spark worker can independently generate, filter, and append - no shuffle required.
Three swimlanes, one winner
Compare the algorithm, pseudocode, and performance profile for each approach. The AUPRC is measured on a completely unseen concept-drifted holdout - the hardest possible evaluation.
Final scorecard across all dimensions
AUPRC is the correct metric for highly imbalanced data. Accuracy and ROC-AUC mask majority-class bias. The model-driven approach wins on every axis that matters for production.
Five deliberate architectural choices
Each decision is grounded in a specific tradeoff between data quality, compute cost, and production readiness at enterprise scale.
Rejection Sampling over Geometry
SMOTE requires global k-NN distance computation: O(N log N), blocking distributed execution. Rejection sampling reduces this to a per-batch Oracle inference pass: O(N) with zero cross-row state. This single change transforms an inherently sequential algorithm into an embarrassingly parallel one shardable to a trillion-row Spark cluster.
H2O AutoML as Oracle
A single LightGBM model might miss subtle boundary details. H2O AutoML evaluates GBM, XGBoost, DeepLearning, and Stacked Ensembles, automatically selecting the best boundary approximation. The richer the Oracle, the tighter the quality gate. Computational overhead is paid once at Oracle training, not at generation time.
P > 0.75 Confidence Threshold
A lower threshold (e.g., 0.5) accepts more candidates but admits samples near the decision boundary with uncertain class assignment. A higher threshold (e.g., 0.90) produces tight manifold samples but generates far fewer, potentially starving the training set. 0.75 was tuned to maximize generalization AUPRC on the concept-drift holdout.
PyTorch Autoencoder for Manifold QA
F1 and AUPRC can improve even when synthetic data is structurally flawed. The PyTorch Autoencoder (40D to 256 to 128 to 64 to 8D) provides a model-independent structural validation: if the 8D t-SNE projection shows synthetic samples filling correct clusters rather than bridging gaps, the data is mathematically valid regardless of downstream metrics.
Vectorized NumPy Batch Generation
Naive candidate generation using a Python for-loop over minority pairs is dramatically slower due to interpreter overhead and poor cache locality. Vectorized NumPy batch operations process all candidate pairs simultaneously in native C, maximizing CPU cache utilization. Combined with the embarrassingly parallel architecture, this scales linearly to any dataset size.
O(N log N) to O(N). The geometry replaced by inference.
Explore the interactive dashboard or read the source. The latent space visualizations prove manifold alignment beyond F1 scores.