Back to Claude Scientific Skills

Datasets Reference

scientific-skills/torchdrug/references/datasets.md

2.38.010.7 KB
Original Source

Datasets Reference

Overview

TorchDrug provides 40+ curated datasets across multiple domains: molecular property prediction, protein modeling, knowledge graph reasoning, and retrosynthesis. All datasets support lazy loading, automatic downloading, and customizable feature extraction.

Molecular Property Prediction Datasets

Drug Discovery Classification

DatasetSizeTaskClassesDescription
BACE1,513Binary2β-secretase inhibition for Alzheimer's
BBBP2,039Binary2Blood-brain barrier penetration
HIV41,127Binary2Inhibition of HIV replication
ClinTox1,478Multi-label2Clinical trial toxicity
SIDER1,427Multi-label27Side effects by system organ class
Tox217,831Multi-label12Toxicity across 12 targets
ToxCast8,576Multi-label617High-throughput toxicology
MUV93,087Multi-label17Unbiased validation for screening

Key Features:

  • All use scaffold splits for realistic evaluation
  • Binary classification metrics: AUROC, AUPRC
  • Multi-label handles missing values

Use Cases:

  • Drug safety prediction
  • Virtual screening
  • ADMET property prediction

Drug Discovery Regression

DatasetSizePropertyUnitsDescription
ESOL1,128Solubilitylog(mol/L)Water solubility
FreeSolv642Hydrationkcal/molHydration free energy
Lipophilicity4,200LogD-Octanol/water distribution
SAMPL643Solvationkcal/molSolvation free energies

Metrics: MAE, RMSE, R² Use Cases: ADME optimization, lead optimization

Quantum Chemistry

DatasetSizePropertiesDescription
QM77,1651Atomization energy
QM821,78612Electronic spectra, excited states
QM9133,88512Geometric, energetic, electronic, thermodynamic
PCQM4M3.8M1Large-scale HOMO-LUMO gap

Properties (QM9):

  • Dipole moment
  • Isotropic polarizability
  • HOMO/LUMO energies
  • Internal energy, enthalpy, free energy
  • Heat capacity
  • Electronic spatial extent

Use Cases:

  • Quantum property prediction
  • Method development benchmarking
  • Pre-training molecular models

Large Molecule Databases

DatasetSizeDescriptionUse Case
ZINC250k250,000Drug-like moleculesGenerative model training
ZINC2M2,000,000Drug-like moleculesLarge-scale pre-training
ChEMBLMillionsBioactive moleculesProperty prediction, generation

Protein Datasets

Function Prediction

DatasetSizeTaskClassesDescription
EnzymeCommission17,562Multi-class7 levelsEC number classification
GeneOntology46,796Multi-label489GO term prediction (BP/MF/CC)
BetaLactamase5,864Regression-Enzyme activity levels
Fluorescence54,025Regression-GFP fluorescence intensity
Stability53,614Regression-Thermostability (ΔΔG)

Features:

  • Sequence and/or structure input
  • Evolutionary information available
  • Multiple train/test splits

Use Cases:

  • Protein engineering
  • Function annotation
  • Enzyme design

Localization and Solubility

DatasetSizeTaskClassesDescription
Solubility62,478Binary2Protein solubility
BinaryLocalization22,168Binary2Membrane vs soluble
SubcellularLocalization8,943Multi-class10Subcellular compartment

Use Cases:

  • Protein expression optimization
  • Target identification
  • Cell biology

Structure Prediction

DatasetSizeTaskDescription
Fold16,712Multi-class (1,195)Structural fold recognition
SecondaryStructure8,678Sequence labeling3-state or 8-state prediction
ProteinNetVariedContact predictionResidue-residue contacts

Use Cases:

  • Structure prediction pipelines
  • Fold recognition
  • Contact map generation

Protein Interactions

DatasetSizePositivesNegativesDescription
HumanPPI1,412 proteins6,584-Human protein interactions
YeastPPI2,018 proteins6,451-Yeast protein interactions
PPIAffinity2,156 pairs--Binding affinity values

Use Cases:

  • PPI prediction
  • Network biology
  • Drug target identification

Protein-Ligand Binding

DatasetSizeTypeDescription
BindingDB~1.5MAffinityComprehensive binding data
PDBBind20,000+3D complexesStructure-based binding
- Refined Set5,316High qualityCurated crystal structures
- Core Set285BenchmarkDiverse test set

Use Cases:

  • Binding affinity prediction
  • Structure-based drug design
  • Scoring function development

Large Protein Databases

DatasetSizeDescription
AlphaFoldDB200M+Predicted structures for most known proteins
UniProtIntegrationSequence and annotation data

Knowledge Graph Datasets

General Knowledge

DatasetEntitiesRelationsTriplesDomain
FB15k14,9511,345592,213Freebase (general knowledge)
FB15k-23714,541237310,116Filtered Freebase
WN1840,94318151,442WordNet (lexical)
WN18RR40,9431193,003Filtered WordNet

Relation Types (FB15k-237):

  • /people/person/nationality
  • /film/film/genre
  • /location/location/contains
  • /business/company/founders
  • Many more...

Use Cases:

  • Link prediction
  • Relation extraction
  • Knowledge base completion

Biomedical Knowledge

DatasetEntitiesRelationsTriplesDescription
Hetionet45,158242,250,197Integrates 29 biomedical databases

Entity Types in Hetionet:

  • Genes (20,945)
  • Compounds (1,552)
  • Diseases (137)
  • Anatomy (400)
  • Pathways (1,822)
  • Pharmacologic classes
  • Side effects
  • Symptoms
  • Molecular functions
  • Biological processes
  • Cellular components

Relation Types:

  • Compound-binds-Gene
  • Gene-associates-Disease
  • Disease-presents-Symptom
  • Compound-treats-Disease
  • Compound-causes-Side effect
  • Gene-participates-Pathway
  • And 18 more...

Use Cases:

  • Drug repurposing
  • Disease mechanism discovery
  • Target identification
  • Multi-hop reasoning in biomedicine

Citation Network Datasets

DatasetNodesEdgesClassesDescription
Cora2,7085,4297Machine learning papers
CiteSeer3,3274,7326Computer science papers
PubMed19,71744,3383Biomedical papers

Use Cases:

  • Node classification
  • GNN baseline comparisons
  • Method development

Retrosynthesis Datasets

DatasetSizeDescription
USPTO-50k50,017Curated patent reactions, single-step

Features:

  • Product → Reactants mapping
  • Atom mapping for reaction centers
  • Canonicalized SMILES
  • Balanced across reaction types

Splits:

  • Train: ~40,000
  • Validation: ~5,000
  • Test: ~5,000

Use Cases:

  • Retrosynthesis prediction
  • Reaction type classification
  • Synthetic route planning

Dataset Usage Patterns

Loading Datasets

python
from torchdrug import datasets

# Basic loading
dataset = datasets.BBBP("~/molecule-datasets/")

# With transforms
from torchdrug import transforms
transform = transforms.VirtualNode()
dataset = datasets.BBBP("~/molecule-datasets/", transform=transform)

# Protein dataset
dataset = datasets.EnzymeCommission("~/protein-datasets/")

# Knowledge graph
dataset = datasets.FB15k237("~/kg-datasets/")

Data Splitting

python
# Random split
train, valid, test = dataset.split([0.8, 0.1, 0.1])

# Scaffold split (for molecules)
from torchdrug import utils
train, valid, test = dataset.split(
    utils.scaffold_split(dataset, [0.8, 0.1, 0.1])
)

# Predefined splits (some datasets)
train, valid, test = dataset.split()

Feature Extraction

Node Features (Molecules):

  • Atom type (one-hot or embedding)
  • Formal charge
  • Hybridization
  • Aromaticity
  • Number of hydrogens
  • Chirality

Edge Features (Molecules):

  • Bond type (single, double, triple, aromatic)
  • Stereochemistry
  • Conjugation
  • Ring membership

Node Features (Proteins):

  • Amino acid type (one-hot)
  • Physicochemical properties
  • Position in sequence
  • Secondary structure
  • Solvent accessibility

Edge Features (Proteins):

  • Edge type (sequential, spatial, contact)
  • Distance
  • Angles and dihedrals

Choosing Datasets

By Task

Molecular Property Prediction:

  • Start with BBBP or HIV (medium size, clear task)
  • Use QM9 for quantum properties
  • ESOL/FreeSolv for regression

Protein Function:

  • EnzymeCommission (well-defined classes)
  • GeneOntology (comprehensive annotations)

Drug Safety:

  • Tox21 (standard benchmark)
  • ClinTox (clinical relevance)

Structure-Based:

  • PDBBind (protein-ligand)
  • ProteinNet (structure prediction)

Knowledge Graph:

  • FB15k-237 (standard benchmark)
  • Hetionet (biomedical applications)

Generation:

  • ZINC250k (training)
  • QM9 (with properties)

Retrosynthesis:

  • USPTO-50k (only choice)

By Size and Resources

Small (<5k, for testing):

  • BACE, FreeSolv, ClinTox
  • Core set of PDBBind

Medium (5k-100k):

  • BBBP, HIV, ESOL, Tox21
  • EnzymeCommission, Fold
  • FB15k-237, WN18RR

Large (>100k):

  • QM9, MUV, PCQM4M
  • GeneOntology, AlphaFoldDB
  • ZINC2M, BindingDB

By Domain

Drug Discovery: BBBP, HIV, Tox21, ESOL, ZINC Quantum Chemistry: QM7, QM8, QM9, PCQM4M Protein Engineering: Fluorescence, Stability, Solubility Structural Biology: Fold, PDBBind, ProteinNet, AlphaFoldDB Biomedical: Hetionet, GeneOntology, EnzymeCommission Retrosynthesis: USPTO-50k

Best Practices

  1. Start Small: Test on small datasets before scaling
  2. Scaffold Split: Use for realistic drug discovery evaluation
  3. Balanced Metrics: Use AUROC + AUPRC for imbalanced data
  4. Multiple Runs: Report mean ± std over multiple random seeds
  5. Data Leakage: Be careful with pre-trained models
  6. Domain Knowledge: Understand what you're predicting
  7. Validation: Always use held-out test set
  8. Preprocessing: Standardize features, handle missing values