Three unsolved problems at once
A cosmetics retailer has 109M behavioral events but lacks the infrastructure to act on them. No existing solution fits the constraints: commodity hardware, zero cloud budget, production-quality analytics.
OOM Crash on Load
- pd.read_csv() on 14.7 GB CSV crashes immediately
- Default Pandas types: 120 GB in-memory requirement
- UUID session strings: ~80 bytes per Python object
- 16 GB laptop cannot load the raw data at all
Random Targeting Waste
- Baseline conversion rate: 8.03% (random targeting)
- Marketing budget spent on non-buyers uniformly
- No propensity scoring or behavioral signal used
- Every user treated as equal conversion probability
No Analytics Infrastructure
- Cloud warehouse: $5-50 per TB processed (BigQuery)
- No RFM segmentation to identify high-value users
- No churn early warning (65% Week-1 churn undetected)
- No product recommendation or market basket analysis
Five-stage data pipeline
Raw CSV to live analytics dashboard: each stage produces a versioned artifact that feeds the next. Click any stage to expand the implementation detail.
Full pipeline runtime: ~15 minutes on a single 16 GB machine. Market basket: 10M+ product pairs in 90 seconds via DuckDB self-join. Dashboard queries: under 1 second via pre-aggregated dimensional tables. Total infrastructure cost: $0.
From 120 GB crash to 3.7 GB
Step through the four optimization layers that turned a guaranteed MemoryError into sub-second interactive analytics. Each step is quantified with before/after numbers.
4.5x lift from temporal propensity scoring
LightGBM trained on October behavior to predict November purchases. Temporal split prevents leakage. Top-5% threshold driven by campaign budget constraints, not accuracy optimization.
Can't Lose Them: R=1 (long inactive) but F>=4 (historically high-frequency buyers, avg spend over $900). Without behavioral analytics, these users are indistinguishable from churned users. RFM surfaces them explicitly for high-priority reactivation campaigns.
Five deliberate choices with documented tradeoffs
Each decision is explicitly justified in the codebase. Engineers who explain tradeoffs are more hireable than engineers who only explain what they built.
DuckDB over PySpark
PySpark adds 30-second JVM startup and cluster overhead for zero performance gain at 100M rows. DuckDB matches Spark on single-node workloads via vectorized SIMD execution, costs $0, and ships as a single pip install. The decision is documented inside the dashboard itself.
Polars Lazy Evaluation
Polars lazy API pushes query predicates down to the scan layer. scan_parquet reads only the columns needed. sink_parquet writes row groups without holding the full output in RAM. The 109M-row transformation fits in a single-pass stream on a 16 GB machine.
UUID Categorical Encoding
UUID strings as Python objects cost ~80 bytes each. Cast to pl.Categorical, they become 4-byte integer indices into a shared dictionary. With 15 million unique sessions across 109M rows, this single transformation saves 1.14 GB and is the difference between crashing and running.
Temporal Train/Test Split
A random split would leak November behavior (future data) into October training features, inflating AUC artificially. Temporal split mirrors real deployment: train on historical behavior, predict future actions. Harder evaluation, but the resulting 4.5x lift is genuine and production-realistic.
Top-5% Threshold over 0.5
A 0.5 probability threshold optimizes accuracy, but ignores marketing budget reality. Targeting the top 5% of users by propensity score is a business constraint: limited campaign budget. At this threshold, conversion jumps to 36.6% vs 8% baseline, and every user contacted has high expected return.
109.9M events. $0 infrastructure.
Explore the full dashboard or read the source code. All analytics queries run in under one second.