# Generalization in high-D molecule space --- ## Contents 1. Walden+Polynomial XGBoost model 2. Previous curse of dimensionality experiments (Scott, Krishna, Bhanushee) 3. Curse of dimensionality background 4. Possible improvements --- ## 1. Polynomial & Walden feature generation [
](2024-12-17-images/walden-feature-subsets-flowchart.drawio.png) ---- ### XGBoost [Interactive booster tree (1/73)](2024-12-17-images/pkasupertree.html) ---- ### XGBoost tree 0 [
](2024-12-17-images/tree0table.png) ---- ### XGBoost node status [
](2024-12-17-images/tree0table.png) ---- ## Ridge polynomial model viable in high-D [
](2024-12-17-images/pka-a1-best-krishna-feature-batches.png) ---- ### t-SNE - 270D -> 2D Explorable 2D projection of molecule space? Predict and display accuracy/confidence "volume" as area? Predict unlabeled molecules as part of confidence volume map? [
](2024-12-17-images/tsne_pka_20241119.png) ---- ### Generalization Is it possible to generalize to **more** unseen data? Can we measure how much better we are doing? --- ## 2. Curse of dimensionality: - Scott, Krishna research - Everything in molecule space is far apart - **65%** of molecules are singletons - Impractical to create combinatoric features - "Library of Mendel" vast and vanishing differences between high-D vectors - Physics/chemistry creates latent patterns in high dimensional space difficult to predict ---- ### Scott's Singleton Overfitting Plot [
](2024-12-17-images/scott-singleton-regression-r2-score.png) ---- ### Krishna Elbow Plot - macro F1 score vs num features (XGBC, 10-fold shuffled CV) [
](2024-12-17-images/pka-a1-best-krishna-rainbow-10x180.png) ---- ### High variance - 10% sample CV [
](2024-12-17-images/pka-a1-best-krishna-rainbow-5x135.png) ---- ### Feature selection Is it possible to anticipate important features for unseen data? - Will zero-importance variables ever become important? - Does it matter? (model has zero weights for them) --- ### Curse of dimensionality Even 15 dimensions is problematic [
](2024-12-17-images/hyperspace-curse-1-15-D.drawio.png) ---- ### Random 2-D vectors 78% of random vectors "on target" (reasonable distance from closest peer) [
](2024-12-17-images/hyperspace-curse-2-D-circle-22pct.png) ---- ### Random 3-D vectors 48 % of 3D random vectors "on target" [
](2024-12-17-images/hyperspace-curse-3-D-sphere-corners-48pct.png) ---- ### Random vectors in high-D space Vanishingly small number of vectors "on target" in 12+ dimensional space. 35% of data "on target" for our 270-D features (mostly of these from public data?). [
](2024-12-17-images/hyperspace-curse-1-15-D.png) ---- ## 3. Possible improvements - Metrics - Cross validation - Accuracy - Confidence - Infrastructure ---- ### Metrics What you want to measure: - Measure chemist satisfaction - Measure business performance (hit/miss percents) - Prediction confidence profile - Show progress (no glass ceiling) in announcements & internal reports - Improve promotion reliability ---- ## Metric - sample weights Sample weights to improve both training and validation - Time series recency - "Active validation" (important compounds flagged by chemists) - Batch assay target std deviation - Improve confidence estimates ---- ### Cross validation Grouped K-Fold CV [](2024-12-17-images/plot_cv_indices_GroupKfold.png) --- ### Accuracy volume - Measure learned manifold volume (accuracy time volume) - Generalization to training set neighbors - Better predictions on future compounds - Better predictions on *subsets* of future compounds ---- ### Confidence - Lower confidence on false positives - Lower confidence on false negatives - Increase confidence on false - Add uniqueness/isolation feature (distance to closest training example) ---- ### Infrastructure Improved DX (developer experience) accelerates progress. - Deploy joblibs directly+immediately - Latency - Reliability Could chemists deploy new models with a web form? --- ## Model improvement - Feature selection - Active learning - Feature extraction ---- ### Feature selection - Aggregate importances across models and folds - RL or baysean feature selection model (train the trainer) - Improve model evaluation metrics - Chemist-suggested feature combinations - Chemist-suggested new features - Gradient decent (HyperOpt) on each added interaction/polynomial features - Include Bhanushee fingerprints among feature subsets ---- ## Feature selection -- active learning Feedback from chemists to improve model and metrics. - Target molecule space regions of interest to chemists - RLHF of feature selection model (train the trainer) - Model to predict ambiguity (std dev in batch assays) --- ## Feature extraction/embedding: - Transfer learning: sharing feature embeddings across target classes - Neural networks: CNN, RNN, transformer, GNN - Smiles feature embedding (variational autoencoder CNN/RNN) - Steered embeddings and distance metrics (polynomial model all targets) - Examine XGBoost decision trees to design interaction and poly features ---- ## Metrics for feature selection Reduce noise in promotion decision metrics and Krishna curve with CV - 10 fold CV - Leave-one-singleton-out - CV on clusters by project - CV on clusters by distance (Tanimoto, Levenstein, cosine, Euclidean, Manhattan) - CV on time series ---- ## Baseline Currently doing 1 20% split. Minimal baseline -- 5-10 random split CV. [
](2024-12-17-images/plot_cv_indices_ShuffleSplit.png) --- ## Option 1. Grouped Shuffle Split Group according to desired and undesired class. [
](2024-12-17-images/plot_cv_indices_GroupShuffleSplit.png) --- ## Option 2. Stratified Shuffle Split [
](2024-12-17-images/plot_cv_indices_StratifiedShuffleSplit.png) --- ## Option 3. Time Series Split [
](2024-12-17-images/TimeSeriesSplit.drawio.png) --- ## Stratified K-Fold Folding won't detect anomalous (unlucky) splits. [
](2024-12-17-images/plot_cv_indices_StratifiedKFold.png) --- ## PkA A1 Walden Confusion Matrix [
](2024-12-17-images/pka-a1-best-krishna-confusion.png) --- [
](2024-12-17-images/plot_hyperspace_curse.png) --- [
](2024-12-17-images/plot_cv_indices_KFold.png) --- [
](2024-12-17-images/pka-a1-best-krishna-rainbow-10x180.png) --- [
](2024-12-17-images/hyperspace-curse-1-15-D.drawio.png) --- [
](2024-12-17-images/plot_cv_indices_StratifiedGroupKFold.png) --- - [^1]: grid vs random hyperparmeter search - https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/grid_random.png?ssl=1 - [^2]: IDY-1609 - https://utiliware.atlassian.net/browse/IDY-1609 - [^3]: Scott's report on overfitting and feature selection - https://utiliware.atlassian.net/wiki/spaces/IDB/pages/1291976705/Overfitting+Clustering+and+Test+Set+Selection ---