Data Engineering + MLOps

Fitness AnalyticsPlatform

PySpark ETL pipeline processing 358K+ fitness records into a partitioned Parquet data lake. Three Scikit-Learn models for user segmentation, activity classification, and calorie prediction. A 3 GB model compressed to under 50 MB for serverless deployment.

0K
records processed
via PySpark across 1,959 users
0%
model size reduction
3 GB to <50 MB without accuracy loss
3 GB model (OOM crash)
<50 MB deployed (R²=0.91)
0K
Records Processed
PySpark ETL, 1,959 users, 6 months
0%
Classification Accuracy
7 activity types via Random Forest
0.0
R² Calorie Regression
RMSE = 131 calories
0%
Model Size Reduction
3 GB to <50 MB via pruning
The Problem

Three hard constraints at once

Raw fitness tracker telemetry is useless without infrastructure to process it, models to interpret it, and a deployment strategy that fits within the memory ceiling of serverless cloud platforms.

OOM Deployment Crash

  • Unoptimized RF Regressor serializes to ~3 GB
  • Streamlit Cloud enforces a hard 1 GB RAM ceiling
  • Direct deployment results in immediate MemoryError
  • GitHub blocks files over 100 MB without LFS overhead

Pandas Does Not Scale

  • 358K records across 6 months and 1,959 users
  • Pandas loads entire dataset into memory for each operation
  • No partition pruning: every query scans all records
  • ETL scripts become brittle and environment-dependent

Generic Health Insights

  • Raw step counts and heart rates are not actionable
  • No user segmentation: every user receives identical advice
  • No activity recognition: calorie estimation is random
  • No behavioral pattern extraction for product personalization
Architecture

Five-stage production pipeline

Data generation through containerized cloud deployment: each stage outputs a versioned artifact that feeds the next. Click any stage to see the implementation detail.

S1
Synthetic Data GenerationPython
358,497 records across 1,959 users (6-month simulation)
S2
PySpark ETL Pipeline + Feature EngineeringPySpark
Partitioned Parquet data lake (year/month) with derived features
S3
Model Training (3 Models)sklearn
K-Means (k=5 clusters), RF Classifier (84%), RF Regressor (R²=0.91)
S4
Cloud Model Optimizationjoblib
<50 MB model artifacts (from 3 GB baseline): 98% size reduction
S5
Streamlit Serving Layer (Docker + CI)Docker
Multi-page Streamlit dashboard, containerized, tested, deployed to Streamlit Cloud

Production discipline: the PySpark ETL uses full-overwrite semantics on partitions, making every run idempotent. GitHub Actions runs Pytest on every commit. Docker locks the JVM and Python environment. The entire pipeline from raw data to deployed dashboard can be reproduced by any engineer with a single docker run command.

MLOps Engineering

From 3 GB OOM crash to 50 MB deployment

Step through the compression pipeline that turned an undeployable model into a production artifact. Each optimization is quantified with size and accuracy impact.

Step 0: The Deployment Crash
3.0 GB
serialized model size
# Default Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,  # default
    max_depth=None,    # unbounded - grows to data depth
)
joblib.dump(rf, "model.pkl")

# Serialized size: ~3 GB
# Streamlit Cloud RAM limit: 1 GB
# Result: MemoryError on deployment

An unoptimized Random Forest on 358K records with unlimited tree depth and 100 estimators serializes to approximately 3 GB. Each unlimited tree can grow to hundreds of levels, with every internal node and leaf stored separately. Streamlit Cloud enforces a hard 1 GB RAM ceiling. The first deployment attempt crashed immediately with a MemoryError before the dashboard could render.

Model Footprint
Unoptimized RF3.0 GB
Hyperparameter Pruning200 MB
Joblib + Memory Routing45 MB
Streamlit Cloud hard limit1,024 MB
BEFORE
~3,072 MB serialized
AFTER
OOM crash on deployment (1 GB limit exceeded)
Reduction path:
~3 GB
Baseline (OOM)
~200 MB
After pruning
<50 MB
After compression
Deployed
R²=0.91 maintained
ML Engine

Three models, three different problems

Each model solves a distinct business problem using Scikit-Learn pipelines that bundle all preprocessing steps, eliminating training-serving skew at the architecture level.

User Segmentation
K-Means Clustering
k=5user clusters
Input: aggregated behavioral metrics (avg steps, heart rate, calories)
Validated via silhouette score + elbow-method inertia plots
Output: 5 distinct health personas for targeted UI/UX
3D interactive scatter in dashboard: Recency vs Frequency vs Monetary-analog
Activity Detection
Random Forest Classifier
84%accuracy across 7 activities
Pipeline: SimpleImputer → StandardScaler → RandomForestClassifier
Detects: Swimming, Cycling, Gym Workout + 4 other activity types
Chosen for robustness to non-linear biometric interactions
No feature scaling sensitivity: unlike distance-based SVM/KNN
Calorie Prediction
Random Forest Regressor
0.91R² score · RMSE = 131 cal
Pipeline: ColumnTransformer (numeric + OneHotEncoder) → RF Regressor
Captures complex heart rate × activity type × step interactions
Optimized for cloud: max_depth=12, n_estimators=30, compress=3
Zero training-serving skew: preprocessing bundled inside Pipeline object

Zero training-serving skew: all preprocessing steps (SimpleImputer, StandardScaler, OneHotEncoder) are bundled inside Scikit-Learn Pipeline objects. The exact transformation graph applied at training is structurally guaranteed to be applied at inference. No manual re-implementation of preprocessing in the serving layer.

Engineering Decisions

Five deliberate choices with documented tradeoffs

Each decision is grounded in the specific constraints of the system: memory limits, scale requirements, and production deployment reliability.

PySpark over Pandas

Predicate pushdown + year/month partition pruning

Pandas loads full datasets into memory for every operation. PySpark with Parquet partitioning allows downstream ML loaders and analytics queries to scan only the required year/month slices. At 358K records this is a latency optimization; at terabyte scale it becomes a necessity. Parquet columnar format also guarantees high compression ratios over CSV.

Random Forest Selection

Non-linear boundary capture, immune to feature scale

Fitness metrics have fundamentally non-linear interactions: 500 calories over 2,000 steps is categorically different from 500 calories over 10,000 steps. Random Forests capture these interaction effects naturally without requiring polynomial feature expansion. Unlike distance-based algorithms (SVM, KNN), RF is immune to feature scale discrepancies between heart rate (60-200 bpm) and step counts (0-30,000).

Hyperparameter Compression Strategy

max_depth + n_estimators + min_samples_leaf → 93% size reduction

Rather than switching to a smaller model family (trading accuracy for size), the strategy attacks the Random Forest hyperparameters directly. max_depth bounds exponential node growth. n_estimators reduces tree count proportionally. min_samples_leaf prevents noise-fitting that inflates tree size without adding predictive value. Together these reduce the 3 GB model to ~200 MB before compression is even applied.

Google Drive + Memory-Aware Loading

Bypass GitHub LFS, sequential load with forced GC

GitHub blocks files over 100MB without LFS, and LFS adds commit complexity and storage costs. Google Drive provides a simple public URL with gdown. The loading sequence matters: K-Means and Classifier are loaded first (smaller), then gc.collect() forces Python's garbage collector to reclaim memory from initialization, then the Regressor loads into the now-cleared headroom. Without this sequence, accumulated framework initialization state pushes total usage over the cloud limit.

Docker + GitHub Actions CI

Reproducible environments + automated quality gates

PySpark requires a specific JVM version (OpenJDK 11) and Spark binary distribution. Without containerization, "works on my machine" failures are common in CI. The multi-stage Docker build (openjdk:11-jre-slim base) locks the execution environment. GitHub Actions triggers Pytest on every commit: atomic ETL function tests (including zero-division handling in calorie ratios) and integration tests verifying the end-to-end Spark job produces the expected partition structure.

Live on Streamlit Cloud

358K records. 3 models. One Docker command.

Explore the full interactive dashboard or read the source. The entire pipeline is reproducible locally via Docker.