Data Engineering + ML

Customer IntelligencePlatform

109.9M behavioral events processed on a 16 GB laptop. 97% memory reduction via layered optimization. 4.5x conversion lift from a LightGBM propensity model. Zero cloud infrastructure.

0.0M
events processed
on a 16 GB laptop
0.0x
conversion lift
8% to 36.6% top-5%
0.0M
Events Processed
Oct + Nov 2019
0%
Memory Reduction
120 GB to 3.7 GB
0.0x
Conversion Lift
Top-5% propensity segment
$0
Infrastructure Cost
Single machine, no cloud
The Problem

Three unsolved problems at once

A cosmetics retailer has 109M behavioral events but lacks the infrastructure to act on them. No existing solution fits the constraints: commodity hardware, zero cloud budget, production-quality analytics.

OOM Crash on Load

  • pd.read_csv() on 14.7 GB CSV crashes immediately
  • Default Pandas types: 120 GB in-memory requirement
  • UUID session strings: ~80 bytes per Python object
  • 16 GB laptop cannot load the raw data at all

Random Targeting Waste

  • Baseline conversion rate: 8.03% (random targeting)
  • Marketing budget spent on non-buyers uniformly
  • No propensity scoring or behavioral signal used
  • Every user treated as equal conversion probability

No Analytics Infrastructure

  • Cloud warehouse: $5-50 per TB processed (BigQuery)
  • No RFM segmentation to identify high-value users
  • No churn early warning (65% Week-1 churn undetected)
  • No product recommendation or market basket analysis
Architecture

Five-stage data pipeline

Raw CSV to live analytics dashboard: each stage produces a versioned artifact that feeds the next. Click any stage to expand the implementation detail.

S1
Polars Optimization PipelinePolars
Optimized Parquet: 14.7 GB CSV to 1.9 GB (87% compression)
S2
DuckDB Ingestion + OLAP ConfigurationDuckDB SQL
behavior.duckdb: 5.44 GB persistent columnar database, 109.9M rows
S3
Dimensional Modeling + RFM SegmentationSQL
dim_users (3M), dim_products (100K), fact_sessions (15M), fact_daily_kpis (61), rfm_segments (700K)
S4
ML Engine: Propensity + Market BasketLightGBM
propensity_model.pkl + 10M+ product affinity rules
S5
Streamlit Dashboard (7 Pages)Streamlit
Live analytics app with cloud/local auto-detection, Streamlit Cloud deployed

Full pipeline runtime: ~15 minutes on a single 16 GB machine. Market basket: 10M+ product pairs in 90 seconds via DuckDB self-join. Dashboard queries: under 1 second via pre-aggregated dimensional tables. Total infrastructure cost: $0.

Memory Engineering

From 120 GB crash to 3.7 GB

Step through the four optimization layers that turned a guaranteed MemoryError into sub-second interactive analytics. Each step is quantified with before/after numbers.

Step 0: The Problem
120 GB
in-memory footprint
df = pd.read_csv("2019-Oct.csv")
# MemoryError: Unable to allocate ~120 GB
#
# Default types:
#   Int64:   8 bytes / value
#   Float64: 8 bytes / value
#   str UUID: ~80 bytes / object

Default Pandas types are profligate. Int64 and Float64 use 8 bytes per value when 4 bytes suffice. UUID strings stored as Python objects cost ~80 bytes each. With 109M rows across 9 columns, the memory footprint blows past 120 GB before a single query runs. The process crashes with MemoryError.

Memory Footprint
Naive Pandas120 GB
Type Casting60 GB
UUID Categorical3.7 GB
DuckDB + ZSTD3.7 GB
BEFORE
109M rows, 9 cols, default types
AFTER
Process crash: MemoryError on read
Cumulative:
120 GB
Naive Pandas
60 GB
Type casting
3.7 GB
UUID categorical
<1s queries
DuckDB OLAP
ML Engine + Segmentation

4.5x lift from temporal propensity scoring

LightGBM trained on October behavior to predict November purchases. Temporal split prevents leakage. Top-5% threshold driven by campaign budget constraints, not accuracy optimization.

Random targeting baseline
8.03%
conversion rate
AI top-5% segment
36.62%
conversion rate
4.5x
per-1000-users: 80 conversions vs 366
Feature Importance (LightGBM gain, top 5)
oct_eventsTotal behavioral intensity in October
631,380
active_span_daysDuration of engagement window
436,977
oct_cartsAdd-to-cart count (strong intent signal)
411,642
oct_viewsProduct view frequency
222,948
recency_octDays since last event (freshness)
155,719
RFM Segmentation: 700K Buyers, 8 Strategic Clusters
Champions
R >= 4, F >= 4
Cross-sell
Highest CLV
Loyal Customers
R >= 3, F >= 3
Loyalty program
Consistent buyers
New Customers
R >= 4, F = 1
Onboarding
Recent first buy
Promising
R >= 3, F = 1
Nurture
Recent browsers
Need Attention
R = 2, F >= 2
Win-back
Declining signal
Can't Lose Them
R = 1, F >= 4
Reactivation
Avg spend >$900
Hibernating
R = 1, F <= 2
Low investment
Long inactive
At Risk
else
Targeted discount
Mixed signals
!

Can't Lose Them: R=1 (long inactive) but F>=4 (historically high-frequency buyers, avg spend over $900). Without behavioral analytics, these users are indistinguishable from churned users. RFM surfaces them explicitly for high-priority reactivation campaigns.

Engineering Decisions

Five deliberate choices with documented tradeoffs

Each decision is explicitly justified in the codebase. Engineers who explain tradeoffs are more hireable than engineers who only explain what they built.

DuckDB over PySpark

In-process columnar OLAP at zero cost

PySpark adds 30-second JVM startup and cluster overhead for zero performance gain at 100M rows. DuckDB matches Spark on single-node workloads via vectorized SIMD execution, costs $0, and ships as a single pip install. The decision is documented inside the dashboard itself.

Polars Lazy Evaluation

scan_parquet + sink_parquet = streaming with no materialization

Polars lazy API pushes query predicates down to the scan layer. scan_parquet reads only the columns needed. sink_parquet writes row groups without holding the full output in RAM. The 109M-row transformation fits in a single-pass stream on a 16 GB machine.

UUID Categorical Encoding

The key insight: 80 bytes to 4 bytes per session ID

UUID strings as Python objects cost ~80 bytes each. Cast to pl.Categorical, they become 4-byte integer indices into a shared dictionary. With 15 million unique sessions across 109M rows, this single transformation saves 1.14 GB and is the difference between crashing and running.

Temporal Train/Test Split

October features predict November purchases

A random split would leak November behavior (future data) into October training features, inflating AUC artificially. Temporal split mirrors real deployment: train on historical behavior, predict future actions. Harder evaluation, but the resulting 4.5x lift is genuine and production-realistic.

Top-5% Threshold over 0.5

Business constraint drives threshold choice

A 0.5 probability threshold optimizes accuracy, but ignores marketing budget reality. Targeting the top 5% of users by propensity score is a business constraint: limited campaign budget. At this threshold, conversion jumps to 36.6% vs 8% baseline, and every user contacted has high expected return.

Live on Streamlit Cloud

109.9M events. $0 infrastructure.

Explore the full dashboard or read the source code. All analytics queries run in under one second.