scientific-skills/pytdc/references/utilities.md
This document provides comprehensive documentation for TDC's data processing, evaluation, and utility functions.
TDC provides utilities organized into four main categories:
Dataset splitting is crucial for evaluating model generalization. TDC provides multiple splitting strategies designed for therapeutic ML.
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')
# Get split with default parameters
split = data.get_split()
# Returns: {'train': DataFrame, 'valid': DataFrame, 'test': DataFrame}
# Customize split parameters
split = data.get_split(
method='scaffold',
seed=42,
frac=[0.7, 0.1, 0.2]
)
Random shuffling of data - suitable for general ML tasks.
split = data.get_split(method='random', seed=1)
When to use:
Not recommended for:
Splits based on molecular scaffolds (Bemis-Murcko scaffolds) - ensures test molecules are structurally distinct from training.
split = data.get_split(method='scaffold', seed=1)
When to use:
How it works:
For multi-instance prediction, cold splits ensure test set contains unseen drugs, targets, or both.
Cold Drug Split:
from tdc.multi_pred import DTI
data = DTI(name='BindingDB_Kd')
split = data.get_split(method='cold_drug', seed=1)
Cold Target Split:
split = data.get_split(method='cold_target', seed=1)
Cold Drug-Target Split:
split = data.get_split(method='cold_drug_target', seed=1)
For datasets with temporal information - ensures test data is from later time points.
split = data.get_split(method='temporal', seed=1)
When to use:
# 80% train, 10% valid, 10% test
split = data.get_split(method='scaffold', frac=[0.8, 0.1, 0.1])
# 70% train, 15% valid, 15% test
split = data.get_split(method='scaffold', frac=[0.7, 0.15, 0.15])
For classification tasks with imbalanced labels:
split = data.get_split(method='scaffold', stratified=True)
Maintains label distribution across train/valid/test sets.
TDC provides standardized evaluation metrics for different task types.
from tdc import Evaluator
# Initialize evaluator
evaluator = Evaluator(name='ROC-AUC')
# Evaluate predictions
score = evaluator(y_true, y_pred)
Receiver Operating Characteristic - Area Under Curve
evaluator = Evaluator(name='ROC-AUC')
score = evaluator(y_true, y_pred_proba)
Best for:
Range: 0-1 (higher is better, 0.5 is random)
Precision-Recall Area Under Curve
evaluator = Evaluator(name='PR-AUC')
score = evaluator(y_true, y_pred_proba)
Best for:
Range: 0-1 (higher is better)
Harmonic mean of precision and recall
evaluator = Evaluator(name='F1')
score = evaluator(y_true, y_pred_binary)
Best for:
Range: 0-1 (higher is better)
Fraction of correct predictions
evaluator = Evaluator(name='Accuracy')
score = evaluator(y_true, y_pred_binary)
Best for:
Not recommended for: Imbalanced datasets
Agreement between predictions and ground truth, accounting for chance
evaluator = Evaluator(name='Kappa')
score = evaluator(y_true, y_pred_binary)
Range: -1 to 1 (higher is better, 0 is random)
evaluator = Evaluator(name='RMSE')
score = evaluator(y_true, y_pred)
Best for:
Range: 0-∞ (lower is better)
evaluator = Evaluator(name='MAE')
score = evaluator(y_true, y_pred)
Best for:
Range: 0-∞ (lower is better)
evaluator = Evaluator(name='R2')
score = evaluator(y_true, y_pred)
Best for:
Range: -∞ to 1 (higher is better, 1 is perfect)
evaluator = Evaluator(name='MSE')
score = evaluator(y_true, y_pred)
Range: 0-∞ (lower is better)
Rank correlation coefficient
evaluator = Evaluator(name='Spearman')
score = evaluator(y_true, y_pred)
Best for:
Range: -1 to 1 (higher is better)
Linear correlation coefficient
evaluator = Evaluator(name='Pearson')
score = evaluator(y_true, y_pred)
Best for:
Range: -1 to 1 (higher is better)
evaluator = Evaluator(name='Micro-F1')
score = evaluator(y_true_multilabel, y_pred_multilabel)
Available: Micro-F1, Macro-F1, Micro-AUPR, Macro-AUPR
For benchmark groups, evaluation requires multiple seeds:
from tdc.benchmark_group import admet_group
group = admet_group(path='data/')
benchmark = group.get('Caco2_Wang')
# Predictions must be dict with seeds as keys
predictions = {}
for seed in [1, 2, 3, 4, 5]:
# Train model and predict
predictions[seed] = model_predictions
# Evaluate with mean and std across seeds
results = group.evaluate(predictions)
print(results) # {'Caco2_Wang': [mean_score, std_score]}
TDC provides 11 comprehensive data processing utilities.
Convert between ~15 molecular representations.
from tdc.chem_utils import MolConvert
# SMILES to PyTorch Geometric
converter = MolConvert(src='SMILES', dst='PyG')
pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
# SMILES to DGL
converter = MolConvert(src='SMILES', dst='DGL')
dgl_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
# SMILES to Morgan Fingerprint (ECFP)
converter = MolConvert(src='SMILES', dst='ECFP')
fingerprint = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
Available formats:
Batch conversion:
converter = MolConvert(src='SMILES', dst='PyG')
graphs = converter(['SMILES1', 'SMILES2', 'SMILES3'])
Remove non-drug-like molecules using curated chemical rules.
from tdc.chem_utils import MolFilter
# Initialize filter with rules
mol_filter = MolFilter(
rules=['PAINS', 'BMS'], # Chemical filter rules
property_filters_dict={
'MW': (150, 500), # Molecular weight range
'LogP': (-0.4, 5.6), # Lipophilicity range
'HBD': (0, 5), # H-bond donors
'HBA': (0, 10) # H-bond acceptors
}
)
# Filter molecules
filtered_smiles = mol_filter(smiles_list)
Available filter rules:
PAINS - Pan-Assay Interference CompoundsBMS - Bristol-Myers Squibb HTS deck filtersGlaxo - GlaxoSmithKline filtersDundee - University of Dundee filtersInpharmatica - Inpharmatica filtersLINT - Pfizer LINT filters# Visualize label distribution
data.label_distribution()
# Print statistics
data.print_stats()
Displays histogram and computes mean, median, std for continuous labels.
Convert continuous labels to binary using threshold.
from tdc.utils import binarize
# Binarize with threshold
binary_labels = binarize(y_continuous, threshold=5.0, order='ascending')
# order='ascending': values >= threshold become 1
# order='descending': values <= threshold become 1
Transform between measurement units.
from tdc.chem_utils import label_transform
# Convert nM to pKd
y_pkd = label_transform(y_nM, from_unit='nM', to_unit='p')
# Convert μM to nM
y_nM = label_transform(y_uM, from_unit='uM', to_unit='nM')
Available conversions:
Get interpretable descriptions for labels.
# Get label mapping
label_map = data.get_label_map(name='DrugBank')
print(label_map)
# {0: 'No interaction', 1: 'Increased effect', 2: 'Decreased effect', ...}
Handle class imbalance via over/under-sampling.
from tdc.utils import balance
# Oversample minority class
X_balanced, y_balanced = balance(X, y, method='oversample')
# Undersample majority class
X_balanced, y_balanced = balance(X, y, method='undersample')
Convert paired data to graph representations.
from tdc.utils import create_graph_from_pairs
# Create graph from drug-drug pairs
graph = create_graph_from_pairs(
pairs=ddi_pairs, # [(drug1, drug2, label), ...]
format='edge_list' # or 'PyG', 'DGL'
)
Generate negative samples for binary tasks.
from tdc.utils import negative_sample
# Generate negative samples for DTI
negative_pairs = negative_sample(
positive_pairs=known_interactions,
all_drugs=drug_list,
all_targets=target_list,
ratio=1.0 # Negative:positive ratio
)
Use cases:
Convert between database identifiers.
from tdc.utils import cid2smiles
smiles = cid2smiles(2244) # Aspirin
# Returns: 'CC(=O)Oc1ccccc1C(=O)O'
from tdc.utils import uniprot2seq
sequence = uniprot2seq('P12345')
# Returns: 'MVKVYAPASS...'
# Multiple CIDs
smiles_list = [cid2smiles(cid) for cid in [2244, 5090, 6323]]
# Multiple UniProt IDs
sequences = [uniprot2seq(uid) for uid in ['P12345', 'Q9Y5S9']]
from tdc.utils import retrieve_dataset_names
# Get all datasets for a task
adme_datasets = retrieve_dataset_names('ADME')
dti_datasets = retrieve_dataset_names('DTI')
tox_datasets = retrieve_dataset_names('Tox')
print(f"ADME datasets: {adme_datasets}")
TDC supports fuzzy matching for dataset names:
from tdc.single_pred import ADME
# These all work (typo-tolerant)
data = ADME(name='Caco2_Wang')
data = ADME(name='caco2_wang')
data = ADME(name='Caco2') # Partial match
# Pandas DataFrame (default)
df = data.get_data(format='df')
# Dictionary
data_dict = data.get_data(format='dict')
# DeepPurpose format (for DeepPurpose library)
dp_format = data.get_data(format='DeepPurpose')
# PyG/DGL graphs (if applicable)
graphs = data.get_data(format='PyG')
from tdc.utils import create_fold
# Create cross-validation folds
folds = create_fold(data, fold=5, seed=42)
# Returns list of (train_idx, test_idx) tuples
# Iterate through folds
for i, (train_idx, test_idx) in enumerate(folds):
train_data = data.iloc[train_idx]
test_data = data.iloc[test_idx]
# Train and evaluate
from tdc.single_pred import ADME
from tdc import Evaluator
from tdc.chem_utils import MolConvert, MolFilter
# 1. Load data
data = ADME(name='Caco2_Wang')
# 2. Filter molecules
mol_filter = MolFilter(rules=['PAINS'])
filtered_data = data.get_data()
filtered_data = filtered_data[
filtered_data['Drug'].apply(lambda x: mol_filter([x]))
]
# 3. Split data
split = data.get_split(method='scaffold', seed=42)
train, valid, test = split['train'], split['valid'], split['test']
# 4. Convert to graph representations
converter = MolConvert(src='SMILES', dst='PyG')
train_graphs = converter(train['Drug'].tolist())
# 5. Train model (user implements)
# model.fit(train_graphs, train['Y'])
# 6. Evaluate
evaluator = Evaluator(name='MAE')
# score = evaluator(test['Y'], predictions)
from tdc.benchmark_group import admet_group
from tdc.chem_utils import MolConvert
# Load benchmark group
group = admet_group(path='data/')
# Get multiple datasets
datasets = ['Caco2_Wang', 'HIA_Hou', 'Bioavailability_Ma']
all_data = {}
for dataset_name in datasets:
benchmark = group.get(dataset_name)
all_data[dataset_name] = benchmark
# Prepare for multi-task learning
converter = MolConvert(src='SMILES', dst='ECFP')
# Process each dataset...
from tdc.multi_pred import DTI
from tdc import Evaluator
# Load DTI data
data = DTI(name='BindingDB_Kd')
# Cold drug split
split = data.get_split(method='cold_drug', seed=42)
train, test = split['train'], split['test']
# Verify no drug overlap
train_drugs = set(train['Drug_ID'])
test_drugs = set(test['Drug_ID'])
assert len(train_drugs & test_drugs) == 0, "Drug leakage detected!"
# Train and evaluate
# model.fit(train)
evaluator = Evaluator(name='RMSE')
# score = evaluator(test['Y'], predictions)