scientific-skills/deepchem/references/api_reference.md
This document provides a comprehensive reference for DeepChem's core APIs, organized by functionality.
Best Practice: For drug discovery tasks, use ScaffoldSplitter to prevent overfitting on similar molecular structures.
Use these with graph neural networks (GCNs, MPNNs, etc.):
Use these with traditional ML (Random Forest, SVM, XGBoost):
radius (default 2), size (default 2048), useChirality (default False)Calculate molecular properties directly:
For recurrent networks and transformers:
| Use Case | Recommended Featurizer | Model Type |
|---|---|---|
| Graph neural networks | ConvMolFeaturizer, MolGraphConvFeaturizer | GCN, MPNN, GAT |
| Traditional ML | CircularFingerprint, RDKitDescriptors | Random Forest, XGBoost, SVM |
| Deep learning (non-graph) | CircularFingerprint, Mol2VecFingerprint | Dense networks, CNN |
| Sequence models | SmilesToSeq | LSTM, GRU, Transformer |
| 3D molecular structures | CoulombMatrix | Specialized 3D models |
| Quick baseline | RDKitDescriptors | Linear, Ridge, Lasso |
SklearnModel(model=RandomForestRegressor())| Task | Recommended Model | Featurizer |
|---|---|---|
| Small dataset (<1000 samples) | SklearnModel (Random Forest) | CircularFingerprint |
| Medium dataset (1K-100K) | GBDTModel or MultitaskRegressor | CircularFingerprint or ConvMolFeaturizer |
| Large dataset (>100K) | GCNModel, AttentiveFPModel, or DMPNN | MolGraphConvFeaturizer |
| Transfer learning | GroverModel, Chemberta, MoLFormer | Model-specific |
| Materials properties | CGCNNModel, MEGNetModel | Structure-based |
| Molecule generation | BasicMolGANModel, LSTMGenerator | SmilesToSeq |
| Protein sequences | ProtBERT | Sequence-based |
Quick access to 30+ benchmark datasets via dc.molnet.load_*() functions.
tasks, datasets, transformers = dc.molnet.load_bbbp(
featurizer='GraphConv', # or 'ECFP', 'GraphConv', 'Weave', etc.
splitter='scaffold', # or 'random', 'stratified', etc.
reload=False # set True to skip caching
)
train, valid, test = datasets
Common evaluation metrics available in dc.metrics:
Most metrics support multi-task evaluation by averaging over tasks.
Standard DeepChem workflow:
# 1. Load data
loader = dc.data.CSVLoader(tasks=['task1'], feature_field='smiles',
featurizer=dc.feat.CircularFingerprint())
dataset = loader.create_dataset('data.csv')
# 2. Split data
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
# 3. Transform data (optional)
transformers = [dc.trans.NormalizationTransformer(dataset=train)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
# 4. Create and train model
model = dc.models.MultitaskRegressor(n_tasks=1, n_features=2048, layer_sizes=[1000])
model.fit(train, nb_epoch=50)
# 5. Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
train_score = model.evaluate(train, [metric])
test_score = model.evaluate(test, [metric])
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP')
train, valid, test = datasets
model = dc.models.MultitaskClassifier(n_tasks=len(tasks), n_features=1024)
model.fit(train)
featurizer = dc.feat.MolGraphConvFeaturizer()
loader = dc.data.CSVLoader(tasks=['activity'], feature_field='smiles',
featurizer=featurizer)
dataset = loader.create_dataset('my_data.csv')
train, test = dc.splits.RandomSplitter().train_test_split(dataset)
model = dc.models.GCNModel(mode='classification', n_tasks=1)
model.fit(train)
model = dc.models.GroverModel(task='classification', n_tasks=1)
model.fit(train_dataset)
predictions = model.predict(test_dataset)