Data Engineering + MLOps

Fitness AnalyticsPlatform

PySpark ETL pipeline processing 358K+ fitness records into a partitioned Parquet data lake. Three Scikit-Learn models for user segmentation, activity classification, and calorie prediction. A 3 GB model compressed to under 50 MB for serverless deployment.

records processed

via PySpark across 1,959 users

model size reduction

3 GB to <50 MB without accuracy loss

3 GB model (OOM crash)

›

<50 MB deployed (R²=0.91)

GitHub Live Demo

Records Processed

PySpark ETL, 1,959 users, 6 months

Classification Accuracy

7 activity types via Random Forest

0.0

R² Calorie Regression

RMSE = 131 calories

Model Size Reduction

3 GB to <50 MB via pruning

The Problem

Three hard constraints at once

Raw fitness tracker telemetry is useless without infrastructure to process it, models to interpret it, and a deployment strategy that fits within the memory ceiling of serverless cloud platforms.

OOM Deployment Crash

Unoptimized RF Regressor serializes to ~3 GB
Streamlit Cloud enforces a hard 1 GB RAM ceiling
Direct deployment results in immediate MemoryError
GitHub blocks files over 100 MB without LFS overhead

Pandas Does Not Scale

358K records across 6 months and 1,959 users
Pandas loads entire dataset into memory for each operation
No partition pruning: every query scans all records
ETL scripts become brittle and environment-dependent

Generic Health Insights

Raw step counts and heart rates are not actionable
No user segmentation: every user receives identical advice
No activity recognition: calorie estimation is random
No behavioral pattern extraction for product personalization

Architecture

Five-stage production pipeline

Data generation through containerized cloud deployment: each stage outputs a versioned artifact that feeds the next. Click any stage to see the implementation detail.

Synthetic Data GenerationPython

358,497 records across 1,959 users (6-month simulation)

PySpark ETL Pipeline + Feature EngineeringPySpark

Partitioned Parquet data lake (year/month) with derived features

Model Training (3 Models)sklearn

K-Means (k=5 clusters), RF Classifier (84%), RF Regressor (R²=0.91)

Cloud Model Optimizationjoblib

<50 MB model artifacts (from 3 GB baseline): 98% size reduction

Streamlit Serving Layer (Docker + CI)Docker

Multi-page Streamlit dashboard, containerized, tested, deployed to Streamlit Cloud

Production discipline: the PySpark ETL uses full-overwrite semantics on partitions, making every run idempotent. GitHub Actions runs Pytest on every commit. Docker locks the JVM and Python environment. The entire pipeline from raw data to deployed dashboard can be reproduced by any engineer with a single docker run command.

MLOps Engineering

From 3 GB OOM crash to 50 MB deployment

Step through the compression pipeline that turned an undeployable model into a production artifact. Each optimization is quantified with size and accuracy impact.

Step 0: The Deployment Crash

3.0 GB

serialized model size

# Default Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,  # default
    max_depth=None,    # unbounded - grows to data depth
)
joblib.dump(rf, "model.pkl")

# Serialized size: ~3 GB
# Streamlit Cloud RAM limit: 1 GB
# Result: MemoryError on deployment

An unoptimized Random Forest on 358K records with unlimited tree depth and 100 estimators serializes to approximately 3 GB. Each unlimited tree can grow to hundreds of levels, with every internal node and leaf stored separately. Streamlit Cloud enforces a hard 1 GB RAM ceiling. The first deployment attempt crashed immediately with a MemoryError before the dashboard could render.

Model Footprint

Unoptimized RF3.0 GB

Hyperparameter Pruning200 MB

Joblib + Memory Routing45 MB

Streamlit Cloud hard limit1,024 MB

BEFORE

~3,072 MB serialized

AFTER

OOM crash on deployment (1 GB limit exceeded)

Reduction path:

~3 GB

Baseline (OOM)

›

~200 MB

After pruning

›

<50 MB

After compression

›

Deployed

R²=0.91 maintained

ML Engine

Three models, three different problems

Each model solves a distinct business problem using Scikit-Learn pipelines that bundle all preprocessing steps, eliminating training-serving skew at the architecture level.

User Segmentation

K-Means Clustering

k=5user clusters

Input: aggregated behavioral metrics (avg steps, heart rate, calories)

Validated via silhouette score + elbow-method inertia plots

Output: 5 distinct health personas for targeted UI/UX

3D interactive scatter in dashboard: Recency vs Frequency vs Monetary-analog

Activity Detection

Random Forest Classifier

84%accuracy across 7 activities

Pipeline: SimpleImputer → StandardScaler → RandomForestClassifier

Detects: Swimming, Cycling, Gym Workout + 4 other activity types

Chosen for robustness to non-linear biometric interactions

No feature scaling sensitivity: unlike distance-based SVM/KNN

Calorie Prediction

Random Forest Regressor

0.91R² score · RMSE = 131 cal

Pipeline: ColumnTransformer (numeric + OneHotEncoder) → RF Regressor

Captures complex heart rate × activity type × step interactions

Optimized for cloud: max_depth=12, n_estimators=30, compress=3

Zero training-serving skew: preprocessing bundled inside Pipeline object

Zero training-serving skew: all preprocessing steps (SimpleImputer, StandardScaler, OneHotEncoder) are bundled inside Scikit-Learn Pipeline objects. The exact transformation graph applied at training is structurally guaranteed to be applied at inference. No manual re-implementation of preprocessing in the serving layer.

Engineering Decisions

Five deliberate choices with documented tradeoffs

Each decision is grounded in the specific constraints of the system: memory limits, scale requirements, and production deployment reliability.

PySpark over Pandas

Predicate pushdown + year/month partition pruning

Pandas loads full datasets into memory for every operation. PySpark with Parquet partitioning allows downstream ML loaders and analytics queries to scan only the required year/month slices. At 358K records this is a latency optimization; at terabyte scale it becomes a necessity. Parquet columnar format also guarantees high compression ratios over CSV.

Random Forest Selection

Non-linear boundary capture, immune to feature scale

Fitness metrics have fundamentally non-linear interactions: 500 calories over 2,000 steps is categorically different from 500 calories over 10,000 steps. Random Forests capture these interaction effects naturally without requiring polynomial feature expansion. Unlike distance-based algorithms (SVM, KNN), RF is immune to feature scale discrepancies between heart rate (60-200 bpm) and step counts (0-30,000).

Hyperparameter Compression Strategy

max_depth + n_estimators + min_samples_leaf → 93% size reduction

Rather than switching to a smaller model family (trading accuracy for size), the strategy attacks the Random Forest hyperparameters directly. max_depth bounds exponential node growth. n_estimators reduces tree count proportionally. min_samples_leaf prevents noise-fitting that inflates tree size without adding predictive value. Together these reduce the 3 GB model to ~200 MB before compression is even applied.

Google Drive + Memory-Aware Loading

Bypass GitHub LFS, sequential load with forced GC

GitHub blocks files over 100MB without LFS, and LFS adds commit complexity and storage costs. Google Drive provides a simple public URL with gdown. The loading sequence matters: K-Means and Classifier are loaded first (smaller), then gc.collect() forces Python's garbage collector to reclaim memory from initialization, then the Regressor loads into the now-cleared headroom. Without this sequence, accumulated framework initialization state pushes total usage over the cloud limit.

Docker + GitHub Actions CI

Reproducible environments + automated quality gates

PySpark requires a specific JVM version (OpenJDK 11) and Spark binary distribution. Without containerization, "works on my machine" failures are common in CI. The multi-stage Docker build (openjdk:11-jre-slim base) locks the execution environment. GitHub Actions triggers Pytest on every commit: atomic ETL function tests (including zero-division handling in calorie ratios) and integration tests verifying the end-to-end Spark job produces the expected partition structure.

Live on Streamlit Cloud

358K records. 3 models. One Docker command.

Explore the full interactive dashboard or read the source. The entire pipeline is reproducible locally via Docker.

Open Live Demo View Source All Projects