scientific-skills/molfeat/references/api_reference.md
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
molfeat.store - Manages model loading, listing, and registrationmolfeat.calc - Provides calculators for single-molecule featurizationmolfeat.trans - Offers scikit-learn compatible transformers for batch processingmolfeat.utils - Utility functions for data handlingmolfeat.viz - Visualization tools for molecular featuresCalculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit Chem.Mol objects or SMILES strings as input.
Base abstract class for all calculators. When subclassing, must implement:
__call__() - Required method for featurization__len__() - Optional, returns output lengthcolumns - Optional property, returns feature namesbatch_compute() - Optional, for efficient batch processingState Management Methods:
to_state_json() - Save calculator state as JSONto_state_yaml() - Save calculator state as YAMLfrom_state_dict() - Load calculator from state dictionaryto_state_dict() - Export calculator state as dictionaryComputes molecular fingerprints. Supports 15+ fingerprint methods.
Supported Fingerprint Types:
Structural Fingerprints:
ecfp - Extended-connectivity fingerprints (circular)fcfp - Functional-class fingerprintsrdkit - RDKit topological fingerprintsmaccs - MACCS keys (166-bit structural keys)avalon - Avalon fingerprintspattern - Pattern fingerprintslayered - Layered fingerprintsAtom-based Fingerprints:
atompair - Atom pair fingerprintsatompair-count - Counted atom pairstopological - Topological torsion fingerprintstopological-count - Counted topological torsionsSpecialized Fingerprints:
map4 - MinHashed atom-pair fingerprint up to 4 bondssecfp - SMILES extended connectivity fingerprinterg - Extended reduced graphsestate - Electrotopological state indicesParameters:
method (str) - Fingerprint type nameradius (int) - Radius for circular fingerprints (default: 3)fpSize (int) - Fingerprint size (default: 2048)includeChirality (bool) - Include chirality informationcounting (bool) - Use count vectors instead of binaryUsage:
from molfeat.calc import FPCalculator
# Create fingerprint calculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
# Compute fingerprint for single molecule
fp = calc("CCO") # Returns numpy array
# Get fingerprint length
length = len(calc) # 2048
# Get feature names
names = calc.columns
Common Fingerprint Dimensions:
RDKitDescriptors2D Computes 2D molecular descriptors using RDKit.
from molfeat.calc import RDKitDescriptors2D
calc = RDKitDescriptors2D()
descriptors = calc("CCO") # Returns 200+ descriptors
RDKitDescriptors3D Computes 3D molecular descriptors (requires conformer generation).
MordredDescriptors Calculates over 1800 molecular descriptors using Mordred.
from molfeat.calc import MordredDescriptors
calc = MordredDescriptors()
descriptors = calc("CCO")
Pharmacophore2D RDKit's 2D pharmacophore fingerprint generation.
Pharmacophore3D Consensus pharmacophore fingerprints from multiple conformers.
CATSCalculator Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
Parameters:
mode - "2D" or "3D" distance calculationsdist_bins - Distance bins for pair distributionsscale - Scaling mode: "raw", "num", or "count"from molfeat.calc import CATSCalculator
calc = CATSCalculator(mode="2D", scale="raw")
cats = calc("CCO") # Returns 21 descriptors by default
USRDescriptors Ultrafast shape recognition descriptors (multiple variants).
ElectroShapeDescriptors Electrostatic shape descriptors combining shape, chirality, and electrostatics.
ScaffoldKeyCalculator Computes 40+ scaffold-based molecular properties.
AtomCalculator Atom-level featurization for graph neural networks.
BondCalculator Bond-level featurization for graph neural networks.
get_calculator() Factory function to instantiate calculators by name.
from molfeat.calc import get_calculator
# Instantiate any calculator by name
calc = get_calculator("ecfp", radius=3)
calc = get_calculator("maccs")
calc = get_calculator("desc2D")
Raises ValueError for unsupported featurizers.
Transformers wrap calculators into complete featurization pipelines for batch processing.
Scikit-learn compatible transformer for batch molecular featurization.
Key Parameters:
featurizer - Calculator or featurizer to usen_jobs (int) - Number of parallel jobs (-1 for all cores)dtype - Output data type (numpy float32/64, torch tensors)verbose (bool) - Enable verbose loggingignore_errors (bool) - Continue on failures (returns None for failed molecules)Essential Methods:
transform(mols) - Processes batches and returns representations_transform(mol) - Handles individual molecule featurization__call__(mols) - Convenience wrapper around transform()preprocess(mol) - Prepares input molecules (not automatically applied)to_state_yaml_file(path) - Save transformer configurationfrom_state_yaml_file(path) - Load transformer configurationUsage:
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
import datamol as dm
# Load molecules
smiles = dm.data.freesolv().sample(100).smiles.values
# Create transformer
calc = FPCalculator("ecfp")
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize batch
features = transformer(smiles) # Returns numpy array (100, 2048)
# Save configuration
transformer.to_state_yaml_file("ecfp_config.yml")
# Reload
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
Performance: Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
Concatenates multiple featurizers into unified representations.
from molfeat.trans import FeatConcat
from molfeat.calc import FPCalculator
# Combine multiple fingerprints
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
])
# Result: 2167-dimensional features
transformer = MoleculeTransformer(concat, n_jobs=-1)
features = transformer(smiles)
Subclass of MoleculeTransformer for pre-trained deep learning models.
Unique Features:
_embed() - Batched inference for neural networks_convert() - Transforms SMILES/molecules into model-compatible formats
Usage:
from molfeat.trans.pretrained import PretrainedMolTransformer
# Load pretrained model
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Generate embeddings
embeddings = transformer(smiles)
Transformer for cached/precomputed features.
Manages featurizer discovery, loading, and registration.
Central hub for accessing available featurizers.
Key Methods:
available_models - Property listing all available featurizerssearch(name=None, **kwargs) - Search for specific featurizersload(name, **kwargs) - Load a featurizer by nameregister(name, card) - Register custom featurizerUsage:
from molfeat.store.modelstore import ModelStore
# Initialize store
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Found {len(all_models)} featurizers")
# Search for specific model
results = store.search(name="ChemBERTa-77M-MLM")
if results:
model_card = results[0]
# View usage information
model_card.usage()
# Load the model
transformer = model_card.load()
# Direct loading
transformer = store.load("ChemBERTa-77M-MLM")
ModelCard Attributes:
name - Model identifierdescription - Model descriptionversion - Model versionauthors - Model authorstags - Categorization tagsusage() - Display usage examplesload(**kwargs) - Load the model# Enable error tolerance
featurizer = MoleculeTransformer(
calc,
n_jobs=-1,
verbose=True,
ignore_errors=True
)
# Failed molecules return None
features = featurizer(smiles_with_errors)
# NumPy float32 (default)
features = transformer(smiles, enforce_dtype=True)
# PyTorch tensors
import torch
transformer = MoleculeTransformer(calc, dtype=torch.float32)
features = transformer(smiles)
# Save transformer state
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")
# Load from saved state
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
transformer = MoleculeTransformer.from_state_json_file("config.json")
# Manual preprocessing
mol = transformer.preprocess("CCO")
# Transform with preprocessing
features = transformer.transform(smiles_list)
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
('classifier', RandomForestClassifier())
])
# Fit and predict
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
import torch
from torch.utils.data import Dataset, DataLoader
from molfeat.trans import MoleculeTransformer
class MoleculeDataset(Dataset):
def __init__(self, smiles, labels, transformer):
self.smiles = smiles
self.labels = labels
self.transformer = transformer
def __len__(self):
return len(self.smiles)
def __getitem__(self, idx):
features = self.transformer(self.smiles[idx])
return torch.tensor(features), torch.tensor(self.labels[idx])
# Create dataset and dataloader
transformer = MoleculeTransformer(FPCalculator("ecfp"))
dataset = MoleculeDataset(smiles, labels, transformer)
loader = DataLoader(dataset, batch_size=32)
n_jobs=-1 to utilize all CPU coresignore_errors=True for large datasets with potential invalid molecules