Geometric Deep Learning + Classical ML

Melting PointPrediction Engine

Predicting thermophysical melting points from raw SMILES strings via a Late-Fusion Hybrid Graph Neural Network and a 2-level Stacking Ensemble. 24.59 K MAE from 2,662 molecules, deployed at sub-50ms inference latency.

0.00 K

MAE (Stacking Ensemble)

20% better than RF baseline of 30.69 K

<0ms

Inference latency

per SMILES molecule, Streamlit serving

Vanilla GCN: 56.5 K MAE (no global context)

›

Hybrid Late-Fusion GNN: 30.27 K (46% reduction)

RF Baseline: 30.69 K MAE

›

Stacking Ensemble: 24.59 K (SOTA, 20% reduction)

GitHub Live Demo

0.00 K

MAE (Stacking Ensemble)

approaches experimental variation limits (~20-25 K)

0.0

R² Score

validated via 5-fold cross-validation

<0ms

Inference Latency

per SMILES molecule on Streamlit serving

MAE Reduction vs Baseline

vs Random Forest regressor (30.69 K)

The Problem

Three hard constraints at once

Predicting macroscopic thermodynamic properties from a 1D molecular string is a notoriously difficult problem. The provided data was unusable, the obvious deep learning approach failed, and the target variable has an unusual statistical structure.

Extreme Feature Sparsity (99% Zeros)

419 binary Group features representing molecular substructures
Over 319 columns have near-zero variance: almost all zeros
Tree-based models waste split evaluations on constant columns
Neural networks cannot learn from features with no signal variance

Standard GCN Fails on Thermodynamics

GCN aggregates local atomic neighborhoods (bonds, hybridization)
No mechanism to capture global properties like MolWt or TPSA
Phase transitions are driven by macroscopic descriptors, not topology alone
Vanilla GCN at 56.5 K MAE -worse than a Random Forest baseline

Thermodynamic Complexity + Outliers

Target range: 53.54 K to 897.15 K (17x spread)
Right-skewed distribution with physically valid high-Tm outliers
RMSE training would sacrifice median accuracy for rare extremes
Approaching experimental limits (~20-25 K) from 1D strings alone

Architecture

Five-stage research-to-production pipeline

From raw sparse molecular data to deployed dual-architecture inference. Each stage resolves a specific technical constraint. Click any stage to see the implementation detail.

EDA + Variance Thresholdingpandas

319 low-variance features dropped; 108 columns retained from 427

RDKit Cheminformatics Feature EngineeringRDKit

217 dense continuous physicochemical descriptors per molecule (MolWt, TPSA, MolLogP...)

Optuna Hyperparameter Optimization (100 Trials)Optuna

Tuned LightGBM/XGBoost configs from 100 automated trials, ~12 minutes total

2-Level Stacking Ensemblesklearn

24.59 K MAE -SOTA on this dataset (20% reduction vs RF baseline of 30.69 K)

Late-Fusion Hybrid GNN + Streamlit ServingPyTorch Geometric

30.27 K MAE, <50ms per molecule, live covariate shift monitoring

The key insight from the pipeline: the provided sparse binary features were nearly useless (319 of 427 dropped). The entire predictive signal came from RDKit-generated physicochemical descriptors extracted from the SMILES strings -a domain enrichment step that both the Stacking Ensemble and the Hybrid GNN depend on. Feature importance from LightGBM empirically confirmed this: virtually 100% of the top predictive features are RDKit-generated, not the original Group indicators.

Architecture Evolution

Three architectures, three lessons

Step through the model evolution from the architecture that failed, to the classical ML approach that achieved SOTA accuracy, to the neural network design that solved the representational gap. Each iteration informed the next.

Step 0: The Architectural Failure

56.5 K

Mean Absolute Error (MAE)

# Standard GCN - captures local topology only
class VanillaGCN(torch.nn.Module):
    def __init__(self, node_dim=6):
        super().__init__()
        self.conv1 = GCNConv(node_dim, 128)
        self.conv2 = GCNConv(128, 256)
        self.head  = Linear(256, 1)

    def forward(self, data):
        x = F.relu(self.conv1(data.x, data.edge_index))
        x = F.relu(self.conv2(x, data.edge_index))
        x = global_mean_pool(x, data.batch)
        return self.head(x)
        # Missing: MolWt, TPSA, MolLogP, H-donors...
        # These global descriptors drive phase transitions

# MAE: ~56.5 K  Worse than Random Forest (30.69 K)

Standard GCN message-passing aggregates local atomic neighborhoods: which atoms are bonded to what, and what hybridization they carry. This captures 2D topology well but has no mechanism to encode global macroscopic properties like total Molecular Weight (MolWt) or Topological Polar Surface Area (TPSA). These global thermodynamic descriptors are the primary drivers of phase transition temperature -a molecule with high TPSA and high MolWt will nearly always have a higher melting point. A GCN trained without this information is architecturally blind to the most predictive signals for Tm.

Model MAE Comparison (lower is better)

Vanilla GCN56.5 K MAE

Stacking Ensemble24.59 K MAE

Hybrid GNN30.27 K MAE

RF Baseline30.69 K MAE

Experimental limit (target)~20-25 K MAE

BEFORE

RF Baseline: 30.69 K MAE (classical ML)

AFTER

Vanilla GCN: 56.5 K MAE -worse than the baseline, architecture failure

Architecture evolution:

56.5 K MAE

Vanilla GCN (fails)

›

30.69 K MAE

RF Baseline

›

30.27 K MAE

Hybrid GNN

›

24.59 K MAE

Stacking (SOTA)

ML Engine

Two model paths, one key prerequisite

Both the Stacking Ensemble and the Hybrid GNN depend critically on RDKit feature engineering. The models are complements, not substitutes: Stacking maximizes accuracy on known scaffolds, the GNN generalizes better to novel ones.

Peak Accuracy

Stacking Ensemble

24.59 KMAE (SOTA result)

L0: Optuna-tuned LightGBM + XGBoost + HistGradientBoosting (100 trials each)

L1: RidgeCV meta-learner (100 alpha candidates, log scale)

OOF stacking with 5-fold CV: zero data leakage at the meta-layer

20% improvement over RF baseline of 30.69 K MAE

Architectural Innovation

Late-Fusion Hybrid GNN

30.27 KMAE, best generalization

Dual-stream: GCNConv(6→128→256) + global_mean_pool (topology stream)

Parallel: 208 RDKit descriptors → StandardScaler (thermodynamics stream)

Late fusion: 464-dim concat → 3-layer MLP head → log(Tm) → Kelvin

150 epochs, L1 loss, CUDA training, 46% error reduction vs vanilla GCN

The Critical Prerequisite

RDKit Feature Engineering

217physicochemical descriptors

319 near-zero-variance binary Group features dropped via VarianceThreshold

MolWt, TPSA, MolLogP, NumHDonors, NumHAcceptors, ring counts...

StandardScaler fitted on training folds only -zero leakage at scaler level

LightGBM feature importance: ~100% of top features are RDKit-generated

Engineering pragmatism: the Stacking Ensemble outperforms the Hybrid GNN in absolute MAE (24.59 K vs 30.27 K). Rather than hiding this, the project uses both in parallel: Stacking for best accuracy on in-distribution molecules, the Hybrid GNN for better generalization to novel scaffolds. Acknowledging that classical ML beat deep learning here -and shipping both -demonstrates a results-first mindset over architecture hype.

Engineering Decisions

Five choices grounded in the problem constraints

Each decision was made in response to a specific constraint in the data, the model architecture, or the deployment environment. Most are non-obvious without the full context.

Log-Transform Target Variable

np.log1p() on Tm during training, np.exp() post-inference

Tm ranges from 53 to 897 K (a 17x spread) with a right-skewed distribution. Training on raw Kelvin values with L1 (MAE) loss causes the neural network to allocate gradient budget disproportionately toward high-Tm outliers. np.log1p() compresses the dynamic range, making prediction errors comparable in magnitude across the full Kelvin range. The transformation is inverted at inference via np.exp(). This single change was critical for GNN training convergence and stability over 150 epochs.

MAE (L1) over RMSE / MSE

L1 loss for both neural network training and evaluation metric

Melting point datasets contain physically valid extreme outliers: organic crystals with Tm approaching 900 K. RMSE penalizes large errors quadratically -a 100 K error counts 100x more than a 10 K error. This forces models to sacrifice median accuracy in order to reduce errors on rare high-Tm compounds. MAE treats all errors linearly, producing a more robust model for the typical organic chemistry use case where most compounds lie in the 200-400 K range.

Late-Fusion over Early Fusion

Concatenate global descriptors after GCN processing, not before

Early fusion -concatenating RDKit descriptors to node-level features before GCN layers -corrupts the graph message-passing semantics. GCN layers perform neighborhood aggregation over per-atom features; mixing in global molecular descriptors (which represent the entire molecule, not individual atoms) breaks the locality assumption the GCN architecture relies on. Late fusion allows the GCN to specialize in topology and the parallel MLP to specialize in thermodynamics, with the regression head performing the joint optimization.

Ridge as Meta-Learner over Tree Ensemble

RidgeCV with 100 alpha candidates at the stacking meta-layer

A gradient boosting meta-learner at Level-1 would overfit the OOF meta-feature matrix: at N=2662 training rows and only 3 meta-features (one per base learner), trees have too few samples per leaf to generalize. RidgeCV learns the optimal linear combination of base learner predictions with cross-validated alpha regularization. The linear model assumption is appropriate here because the meta-features (OOF predictions) are already in prediction space, and the relationship between their linear combination and the true target is approximately linear.

@st.cache_resource + .eval() Mode

Single-load model weights + inference-only PyTorch execution

PyTorch state dictionaries are deserialized from disk into RAM on every model load. Without @st.cache_resource, every Streamlit script re-run (triggered by any UI interaction) triggers a fresh deserialization -adding 2-5 second latency and risking OOM under concurrent sessions. @st.cache_resource persists the loaded PyTorch module across all sessions and reruns in the same server process. .eval() mode switches BatchNorm from training statistics to running statistics and disables dropout sampling, reducing per-inference compute and eliminating non-determinism in predictions.

Live on Streamlit

2,662 molecules. Two architectures. One SMILES input.

Paste any SMILES string and get a melting point prediction under 50ms. Both the Stacking Ensemble and Hybrid GNN paths are available in the dashboard.

Open Live Demo View Source All Projects