Data Engineering + ML

Customer IntelligencePlatform

109.9M behavioral events processed on a 16 GB laptop. 97% memory reduction via layered optimization. 4.5x conversion lift from a LightGBM propensity model. Zero cloud infrastructure.

0.0M

events processed

on a 16 GB laptop

0.0x

conversion lift

8% to 36.6% top-5%

GitHub Live Demo

0.0M

Events Processed

Oct + Nov 2019

Memory Reduction

120 GB to 3.7 GB

0.0x

Conversion Lift

Top-5% propensity segment

Infrastructure Cost

Single machine, no cloud

The Problem

Three unsolved problems at once

A cosmetics retailer has 109M behavioral events but lacks the infrastructure to act on them. No existing solution fits the constraints: commodity hardware, zero cloud budget, production-quality analytics.

OOM Crash on Load

pd.read_csv() on 14.7 GB CSV crashes immediately
Default Pandas types: 120 GB in-memory requirement
UUID session strings: ~80 bytes per Python object
16 GB laptop cannot load the raw data at all

Random Targeting Waste

Baseline conversion rate: 8.03% (random targeting)
Marketing budget spent on non-buyers uniformly
No propensity scoring or behavioral signal used
Every user treated as equal conversion probability

No Analytics Infrastructure

Cloud warehouse: $5-50 per TB processed (BigQuery)
No RFM segmentation to identify high-value users
No churn early warning (65% Week-1 churn undetected)
No product recommendation or market basket analysis

Architecture

Five-stage data pipeline

Raw CSV to live analytics dashboard: each stage produces a versioned artifact that feeds the next. Click any stage to expand the implementation detail.

Polars Optimization PipelinePolars

Optimized Parquet: 14.7 GB CSV to 1.9 GB (87% compression)

DuckDB Ingestion + OLAP ConfigurationDuckDB SQL

behavior.duckdb: 5.44 GB persistent columnar database, 109.9M rows

Dimensional Modeling + RFM SegmentationSQL

dim_users (3M), dim_products (100K), fact_sessions (15M), fact_daily_kpis (61), rfm_segments (700K)

ML Engine: Propensity + Market BasketLightGBM

propensity_model.pkl + 10M+ product affinity rules

Streamlit Dashboard (7 Pages)Streamlit

Live analytics app with cloud/local auto-detection, Streamlit Cloud deployed

Full pipeline runtime: ~15 minutes on a single 16 GB machine. Market basket: 10M+ product pairs in 90 seconds via DuckDB self-join. Dashboard queries: under 1 second via pre-aggregated dimensional tables. Total infrastructure cost: $0.

Memory Engineering

From 120 GB crash to 3.7 GB

Step through the four optimization layers that turned a guaranteed MemoryError into sub-second interactive analytics. Each step is quantified with before/after numbers.

Step 0: The Problem

120 GB

in-memory footprint

df = pd.read_csv("2019-Oct.csv")
# MemoryError: Unable to allocate ~120 GB
#
# Default types:
#   Int64:   8 bytes / value
#   Float64: 8 bytes / value
#   str UUID: ~80 bytes / object

Default Pandas types are profligate. Int64 and Float64 use 8 bytes per value when 4 bytes suffice. UUID strings stored as Python objects cost ~80 bytes each. With 109M rows across 9 columns, the memory footprint blows past 120 GB before a single query runs. The process crashes with MemoryError.

Memory Footprint

Naive Pandas120 GB

Type Casting60 GB

UUID Categorical3.7 GB

DuckDB + ZSTD3.7 GB

BEFORE

109M rows, 9 cols, default types

AFTER

Process crash: MemoryError on read

Cumulative:

120 GB

Naive Pandas

›

60 GB

Type casting

›

3.7 GB

UUID categorical

›

<1s queries

DuckDB OLAP

ML Engine + Segmentation

4.5x lift from temporal propensity scoring

LightGBM trained on October behavior to predict November purchases. Temporal split prevents leakage. Top-5% threshold driven by campaign budget constraints, not accuracy optimization.

Random targeting baseline

8.03%

conversion rate

›

AI top-5% segment

36.62%

conversion rate

4.5x

per-1000-users: 80 conversions vs 366

Feature Importance (LightGBM gain, top 5)

oct_eventsTotal behavioral intensity in October

631,380

active_span_daysDuration of engagement window

436,977

oct_cartsAdd-to-cart count (strong intent signal)

411,642

oct_viewsProduct view frequency

222,948

recency_octDays since last event (freshness)

155,719

RFM Segmentation: 700K Buyers, 8 Strategic Clusters

Champions

R >= 4, F >= 4

Cross-sell

Highest CLV

Loyal Customers

R >= 3, F >= 3

Loyalty program

Consistent buyers

New Customers

R >= 4, F = 1

Onboarding

Recent first buy

Promising

R >= 3, F = 1

Nurture

Recent browsers

Need Attention

R = 2, F >= 2

Win-back

Declining signal

Can't Lose Them

R = 1, F >= 4

Reactivation

Avg spend >$900

Hibernating

R = 1, F <= 2

Low investment

Long inactive

At Risk

else

Targeted discount

Mixed signals

Can't Lose Them: R=1 (long inactive) but F>=4 (historically high-frequency buyers, avg spend over $900). Without behavioral analytics, these users are indistinguishable from churned users. RFM surfaces them explicitly for high-priority reactivation campaigns.

Engineering Decisions

Five deliberate choices with documented tradeoffs

Each decision is explicitly justified in the codebase. Engineers who explain tradeoffs are more hireable than engineers who only explain what they built.

DuckDB over PySpark

In-process columnar OLAP at zero cost

PySpark adds 30-second JVM startup and cluster overhead for zero performance gain at 100M rows. DuckDB matches Spark on single-node workloads via vectorized SIMD execution, costs $0, and ships as a single pip install. The decision is documented inside the dashboard itself.

Polars Lazy Evaluation

scan_parquet + sink_parquet = streaming with no materialization

Polars lazy API pushes query predicates down to the scan layer. scan_parquet reads only the columns needed. sink_parquet writes row groups without holding the full output in RAM. The 109M-row transformation fits in a single-pass stream on a 16 GB machine.

UUID Categorical Encoding

The key insight: 80 bytes to 4 bytes per session ID

UUID strings as Python objects cost ~80 bytes each. Cast to pl.Categorical, they become 4-byte integer indices into a shared dictionary. With 15 million unique sessions across 109M rows, this single transformation saves 1.14 GB and is the difference between crashing and running.

Temporal Train/Test Split

October features predict November purchases

A random split would leak November behavior (future data) into October training features, inflating AUC artificially. Temporal split mirrors real deployment: train on historical behavior, predict future actions. Harder evaluation, but the resulting 4.5x lift is genuine and production-realistic.

Top-5% Threshold over 0.5

Business constraint drives threshold choice

A 0.5 probability threshold optimizes accuracy, but ignores marketing budget reality. Targeting the top 5% of users by propensity score is a business constraint: limited campaign budget. At this threshold, conversion jumps to 36.6% vs 8% baseline, and every user contacted has high expected return.

Live on Streamlit Cloud

109.9M events. $0 infrastructure.

Explore the full dashboard or read the source code. All analytics queries run in under one second.

Open Live Demo View Source All Projects