Geometric Deep Learning + Classical ML

Melting PointPrediction Engine

Predicting thermophysical melting points from raw SMILES strings via a Late-Fusion Hybrid Graph Neural Network and a 2-level Stacking Ensemble. 24.59 K MAE from 2,662 molecules, deployed at sub-50ms inference latency.

0.00 K
MAE (Stacking Ensemble)
20% better than RF baseline of 30.69 K
<0ms
Inference latency
per SMILES molecule, Streamlit serving
Vanilla GCN: 56.5 K MAE (no global context)
Hybrid Late-Fusion GNN: 30.27 K (46% reduction)
RF Baseline: 30.69 K MAE
Stacking Ensemble: 24.59 K (SOTA, 20% reduction)
0.00 K
MAE (Stacking Ensemble)
approaches experimental variation limits (~20-25 K)
0.0
R² Score
validated via 5-fold cross-validation
<0ms
Inference Latency
per SMILES molecule on Streamlit serving
0%
MAE Reduction vs Baseline
vs Random Forest regressor (30.69 K)
The Problem

Three hard constraints at once

Predicting macroscopic thermodynamic properties from a 1D molecular string is a notoriously difficult problem. The provided data was unusable, the obvious deep learning approach failed, and the target variable has an unusual statistical structure.

Extreme Feature Sparsity (99% Zeros)

  • 419 binary Group features representing molecular substructures
  • Over 319 columns have near-zero variance: almost all zeros
  • Tree-based models waste split evaluations on constant columns
  • Neural networks cannot learn from features with no signal variance

Standard GCN Fails on Thermodynamics

  • GCN aggregates local atomic neighborhoods (bonds, hybridization)
  • No mechanism to capture global properties like MolWt or TPSA
  • Phase transitions are driven by macroscopic descriptors, not topology alone
  • Vanilla GCN at 56.5 K MAE -worse than a Random Forest baseline

Thermodynamic Complexity + Outliers

  • Target range: 53.54 K to 897.15 K (17x spread)
  • Right-skewed distribution with physically valid high-Tm outliers
  • RMSE training would sacrifice median accuracy for rare extremes
  • Approaching experimental limits (~20-25 K) from 1D strings alone
Architecture

Five-stage research-to-production pipeline

From raw sparse molecular data to deployed dual-architecture inference. Each stage resolves a specific technical constraint. Click any stage to see the implementation detail.

S1
EDA + Variance Thresholdingpandas
319 low-variance features dropped; 108 columns retained from 427
S2
RDKit Cheminformatics Feature EngineeringRDKit
217 dense continuous physicochemical descriptors per molecule (MolWt, TPSA, MolLogP...)
S3
Optuna Hyperparameter Optimization (100 Trials)Optuna
Tuned LightGBM/XGBoost configs from 100 automated trials, ~12 minutes total
S4
2-Level Stacking Ensemblesklearn
24.59 K MAE -SOTA on this dataset (20% reduction vs RF baseline of 30.69 K)
S5
Late-Fusion Hybrid GNN + Streamlit ServingPyTorch Geometric
30.27 K MAE, <50ms per molecule, live covariate shift monitoring

The key insight from the pipeline: the provided sparse binary features were nearly useless (319 of 427 dropped). The entire predictive signal came from RDKit-generated physicochemical descriptors extracted from the SMILES strings -a domain enrichment step that both the Stacking Ensemble and the Hybrid GNN depend on. Feature importance from LightGBM empirically confirmed this: virtually 100% of the top predictive features are RDKit-generated, not the original Group indicators.

Architecture Evolution

Three architectures, three lessons

Step through the model evolution from the architecture that failed, to the classical ML approach that achieved SOTA accuracy, to the neural network design that solved the representational gap. Each iteration informed the next.

Step 0: The Architectural Failure
56.5 K
Mean Absolute Error (MAE)
# Standard GCN - captures local topology only
class VanillaGCN(torch.nn.Module):
    def __init__(self, node_dim=6):
        super().__init__()
        self.conv1 = GCNConv(node_dim, 128)
        self.conv2 = GCNConv(128, 256)
        self.head  = Linear(256, 1)

    def forward(self, data):
        x = F.relu(self.conv1(data.x, data.edge_index))
        x = F.relu(self.conv2(x, data.edge_index))
        x = global_mean_pool(x, data.batch)
        return self.head(x)
        # Missing: MolWt, TPSA, MolLogP, H-donors...
        # These global descriptors drive phase transitions

# MAE: ~56.5 K  Worse than Random Forest (30.69 K)

Standard GCN message-passing aggregates local atomic neighborhoods: which atoms are bonded to what, and what hybridization they carry. This captures 2D topology well but has no mechanism to encode global macroscopic properties like total Molecular Weight (MolWt) or Topological Polar Surface Area (TPSA). These global thermodynamic descriptors are the primary drivers of phase transition temperature -a molecule with high TPSA and high MolWt will nearly always have a higher melting point. A GCN trained without this information is architecturally blind to the most predictive signals for Tm.

Model MAE Comparison (lower is better)
Vanilla GCN56.5 K MAE
Stacking Ensemble24.59 K MAE
Hybrid GNN30.27 K MAE
RF Baseline30.69 K MAE
Experimental limit (target)~20-25 K MAE
BEFORE
RF Baseline: 30.69 K MAE (classical ML)
AFTER
Vanilla GCN: 56.5 K MAE -worse than the baseline, architecture failure
Architecture evolution:
56.5 K MAE
Vanilla GCN (fails)
30.69 K MAE
RF Baseline
30.27 K MAE
Hybrid GNN
24.59 K MAE
Stacking (SOTA)
ML Engine

Two model paths, one key prerequisite

Both the Stacking Ensemble and the Hybrid GNN depend critically on RDKit feature engineering. The models are complements, not substitutes: Stacking maximizes accuracy on known scaffolds, the GNN generalizes better to novel ones.

Peak Accuracy
Stacking Ensemble
24.59 KMAE (SOTA result)
L0: Optuna-tuned LightGBM + XGBoost + HistGradientBoosting (100 trials each)
L1: RidgeCV meta-learner (100 alpha candidates, log scale)
OOF stacking with 5-fold CV: zero data leakage at the meta-layer
20% improvement over RF baseline of 30.69 K MAE
Architectural Innovation
Late-Fusion Hybrid GNN
30.27 KMAE, best generalization
Dual-stream: GCNConv(6→128→256) + global_mean_pool (topology stream)
Parallel: 208 RDKit descriptors → StandardScaler (thermodynamics stream)
Late fusion: 464-dim concat → 3-layer MLP head → log(Tm) → Kelvin
150 epochs, L1 loss, CUDA training, 46% error reduction vs vanilla GCN
The Critical Prerequisite
RDKit Feature Engineering
217physicochemical descriptors
319 near-zero-variance binary Group features dropped via VarianceThreshold
MolWt, TPSA, MolLogP, NumHDonors, NumHAcceptors, ring counts...
StandardScaler fitted on training folds only -zero leakage at scaler level
LightGBM feature importance: ~100% of top features are RDKit-generated

Engineering pragmatism: the Stacking Ensemble outperforms the Hybrid GNN in absolute MAE (24.59 K vs 30.27 K). Rather than hiding this, the project uses both in parallel: Stacking for best accuracy on in-distribution molecules, the Hybrid GNN for better generalization to novel scaffolds. Acknowledging that classical ML beat deep learning here -and shipping both -demonstrates a results-first mindset over architecture hype.

Engineering Decisions

Five choices grounded in the problem constraints

Each decision was made in response to a specific constraint in the data, the model architecture, or the deployment environment. Most are non-obvious without the full context.

Log-Transform Target Variable

np.log1p() on Tm during training, np.exp() post-inference

Tm ranges from 53 to 897 K (a 17x spread) with a right-skewed distribution. Training on raw Kelvin values with L1 (MAE) loss causes the neural network to allocate gradient budget disproportionately toward high-Tm outliers. np.log1p() compresses the dynamic range, making prediction errors comparable in magnitude across the full Kelvin range. The transformation is inverted at inference via np.exp(). This single change was critical for GNN training convergence and stability over 150 epochs.

MAE (L1) over RMSE / MSE

L1 loss for both neural network training and evaluation metric

Melting point datasets contain physically valid extreme outliers: organic crystals with Tm approaching 900 K. RMSE penalizes large errors quadratically -a 100 K error counts 100x more than a 10 K error. This forces models to sacrifice median accuracy in order to reduce errors on rare high-Tm compounds. MAE treats all errors linearly, producing a more robust model for the typical organic chemistry use case where most compounds lie in the 200-400 K range.

Late-Fusion over Early Fusion

Concatenate global descriptors after GCN processing, not before

Early fusion -concatenating RDKit descriptors to node-level features before GCN layers -corrupts the graph message-passing semantics. GCN layers perform neighborhood aggregation over per-atom features; mixing in global molecular descriptors (which represent the entire molecule, not individual atoms) breaks the locality assumption the GCN architecture relies on. Late fusion allows the GCN to specialize in topology and the parallel MLP to specialize in thermodynamics, with the regression head performing the joint optimization.

Ridge as Meta-Learner over Tree Ensemble

RidgeCV with 100 alpha candidates at the stacking meta-layer

A gradient boosting meta-learner at Level-1 would overfit the OOF meta-feature matrix: at N=2662 training rows and only 3 meta-features (one per base learner), trees have too few samples per leaf to generalize. RidgeCV learns the optimal linear combination of base learner predictions with cross-validated alpha regularization. The linear model assumption is appropriate here because the meta-features (OOF predictions) are already in prediction space, and the relationship between their linear combination and the true target is approximately linear.

@st.cache_resource + .eval() Mode

Single-load model weights + inference-only PyTorch execution

PyTorch state dictionaries are deserialized from disk into RAM on every model load. Without @st.cache_resource, every Streamlit script re-run (triggered by any UI interaction) triggers a fresh deserialization -adding 2-5 second latency and risking OOM under concurrent sessions. @st.cache_resource persists the loaded PyTorch module across all sessions and reruns in the same server process. .eval() mode switches BatchNorm from training statistics to running statistics and disables dropout sampling, reducing per-inference compute and eliminating non-determinism in predictions.

Live on Streamlit

2,662 molecules. Two architectures. One SMILES input.

Paste any SMILES string and get a melting point prediction under 50ms. Both the Stacking Ensemble and Hybrid GNN paths are available in the dashboard.