skills/molfeat/SKILL.md
Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications. Features fast parallel processing, scikit-learn compatible transformers, and built-in caching.
Version note: Examples target molfeat 0.11.0 (PyPI stable, May 2025). Requires Python 3.9–3.10 (requires-python caps below 3.11). Depends on datamol ≥0.8.0 and PyTorch ≥1.13. Since 0.8.7, prefer datamol Mol objects over raw rdkit.Chem.Mol. Since 0.10.1, fingerprint calculators use RDKit's rdFingerprintGenerator API internally. Since 0.11.0, pretrained models load in memory and base models are set to PyTorch evaluation mode automatically.
This skill should be used when working with:
Use a Python 3.9 or 3.10 environment (molfeat does not install on 3.11+ as of 0.11.0):
uv pip install "molfeat==0.11.0"
# With all pip-installable optional dependencies
uv pip install "molfeat[all]==0.11.0"
Optional dependency extras (PyPI):
molfeat[dgl] — GNN models (GIN variants); upstream recommends dgl<=2.0 (graphbolt issues in newer DGL)molfeat[graphormer] — Graphormer modelsmolfeat[transformer] — ChemBERTa, ChemGPT, MolT5molfeat[fcd] — FCD descriptorsmolfeat[pyg] — PyTorch Geometric featurizersmolfeat[viz] — NGLView visualization widgetsExternal featurizers: MAP4 is not bundled in molfeat extras — install from reymond-group/map4 separately. Some heavy deps (DGL, dgllife, graphormer-pretrained) are easier via conda-forge; see optional dependencies.
Molfeat organizes featurization into three hierarchical classes:
molfeat.calc)Callable objects that convert individual molecules into feature vectors. Accept RDKit Chem.Mol objects or SMILES strings.
Use calculators for:
Example:
from molfeat.calc import FPCalculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
features = calc("CCO") # Returns numpy array (2048,)
molfeat.trans)Scikit-learn compatible transformers that wrap calculators for batch processing with parallelization.
Use transformers for:
Example:
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list) # Parallel processing
molfeat.trans.pretrained)Specialized transformers for deep learning models with batched inference and caching.
Use pretrained transformers for:
Example:
from molfeat.trans.pretrained import PretrainedMolTransformer
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = transformer(smiles_list) # Deep learning embeddings
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
# Load molecular data
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
# Create calculator and transformer
calc = FPCalculator("ecfp", radius=3)
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize molecules
features = transformer(smiles)
print(f"Shape: {features.shape}") # (4, 2048)
# Save featurizer configuration for reproducibility
transformer.to_state_yaml_file("featurizer_config.yml")
# Reload exact configuration
loaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
# Process dataset with potentially invalid SMILES
transformer = MoleculeTransformer(
calc,
n_jobs=-1,
ignore_errors=True, # Continue on failures
verbose=True # Log error details
)
features = transformer(smiles_with_errors)
# Returns None for failed molecules
Start with fingerprints:
# ECFP - Most popular, general-purpose
FPCalculator("ecfp", radius=3, fpSize=2048)
# MACCS - Fast, good for scaffold hopping
FPCalculator("maccs")
# MAP4 - Efficient for large-scale screening
FPCalculator("map4")
For interpretable models:
# RDKit 2D descriptors (200+ named properties)
from molfeat.calc import RDKitDescriptors2D
RDKitDescriptors2D()
# Mordred (1800+ comprehensive descriptors)
from molfeat.calc import MordredDescriptors
MordredDescriptors()
Combine multiple featurizers:
from molfeat.trans import FeatConcat
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
]) # Result: 2215-dimensional combined features
Transformer-based embeddings:
# ChemBERTa - Pre-trained on 77M PubChem compounds
PretrainedMolTransformer("ChemBERTa-77M-MLM")
# ChemGPT - Autoregressive language model
PretrainedMolTransformer("ChemGPT-1.2B")
Graph neural networks:
# GIN models with different pre-training objectives
PretrainedMolTransformer("gin-supervised-masking")
PretrainedMolTransformer("gin-supervised-infomax")
# Graphormer for quantum chemistry
PretrainedMolTransformer("Graphormer-pcqm4mv2")
# ECFP - General purpose, most widely used
FPCalculator("ecfp")
# MACCS - Fast, scaffold-based similarity
FPCalculator("maccs")
# MAP4 - Efficient for large databases
FPCalculator("map4")
# USR/USRCAT - 3D shape similarity
from molfeat.calc import USRDescriptors
USRDescriptors()
# FCFP - Functional group based
FPCalculator("fcfp")
# CATS - Pharmacophore pair distributions
from molfeat.calc import CATSCalculator
CATSCalculator(mode="2D")
# Gobbi - Explicit pharmacophore features
FPCalculator("gobbi2D")
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
# Featurize molecules
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X = transformer(smiles_train)
# Train model
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5)
print(f"R² = {scores.mean():.3f}")
# Save configuration for deployment
transformer.to_state_yaml_file("production_featurizer.yml")
from sklearn.ensemble import RandomForestClassifier
# Train on known actives/inactives
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)
clf = RandomForestClassifier(n_estimators=500)
clf.fit(X_train, train_labels)
# Screen large library
X_screen = transformer(screening_library) # e.g., 1M compounds
predictions = clf.predict_proba(X_screen)[:, 1]
# Rank and select top hits
top_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_library[i] for i in top_indices]
from sklearn.metrics.pairwise import cosine_similarity
# Query molecule
calc = FPCalculator("ecfp")
query_fp = calc(query_smiles).reshape(1, -1)
# Database fingerprints
transformer = MoleculeTransformer(calc, n_jobs=-1)
database_fps = transformer(database_smiles)
# Compute similarity
similarities = cosine_similarity(query_fp, database_fps)[0]
top_similar = similarities.argsort()[-10:][::-1]
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
# Create end-to-end pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', RandomForestClassifier(n_estimators=100))
])
# Train and predict directly on SMILES
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
featurizers = {
'ECFP': FPCalculator("ecfp"),
'MACCS': FPCalculator("maccs"),
'Descriptors': RDKitDescriptors2D(),
'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
}
results = {}
for name, feat in featurizers.items():
transformer = MoleculeTransformer(feat, n_jobs=-1)
X = transformer(smiles)
# Evaluate with your ML model
score = score_model(X, y)
results[name] = score
Use the ModelStore to explore all available featurizers:
from molfeat.store.modelstore import ModelStore
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Total featurizers: {len(all_models)}")
# Search for specific models
chemberta_models = store.search(name="ChemBERTa")
for model in chemberta_models:
print(f"- {model.name}: {model.description}")
# Get usage information
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
model_card.usage() # Display usage examples
# Load model
transformer = store.load("ChemBERTa-77M-MLM")
class CustomTransformer(MoleculeTransformer):
def preprocess(self, mol):
"""Custom preprocessing pipeline"""
if isinstance(mol, str):
mol = dm.to_mol(mol)
mol = dm.standardize_mol(mol)
mol = dm.remove_salts(mol)
return mol
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
import numpy as np
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
"""Process large datasets in chunks to manage memory"""
all_features = []
for i in range(0, len(smiles_list), chunk_size):
chunk = smiles_list[i:i+chunk_size]
features = transformer(chunk)
all_features.append(features)
return np.vstack(all_features)
Prefer molfeat's built-in pretrained-model cache when possible. For custom embedding caches, use NumPy arrays instead of pickle (pickle can execute arbitrary code when loading untrusted files):
import numpy as np
from pathlib import Path
cache_file = Path("embeddings_cache.npz") # fixed path under your project
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
if cache_file.exists():
embeddings = np.load(cache_file)["embeddings"]
else:
embeddings = transformer(smiles_list)
np.savez(cache_file, embeddings=embeddings)
n_jobs=-1 to utilize all CPU coresdtype=np.float32 when precision allowsignore_errors=True for large datasetsQuick reference for frequently used featurizers:
| Featurizer | Type | Dimensions | Speed | Use Case |
|---|---|---|---|---|
ecfp | Fingerprint | 2048 | Fast | General purpose |
maccs | Fingerprint | 167 | Very fast | Scaffold similarity |
desc2D | Descriptors | 200+ | Fast | Interpretable models |
mordred | Descriptors | 1800+ | Medium | Comprehensive features |
map4 | Fingerprint | 1024 | Fast | Large-scale screening |
ChemBERTa-77M-MLM | Deep learning | 768 | Slow* | Transfer learning |
gin-supervised-masking | GNN | Variable | Slow* | Graph-based models |
*First run is slow; subsequent runs benefit from caching
This skill includes comprehensive reference documentation:
Complete API documentation covering:
molfeat.calc - All calculator classes and parametersmolfeat.trans - Transformer classes and methodsmolfeat.store - ModelStore usageWhen to load: Reference when implementing specific calculators, understanding transformer parameters, or integrating with scikit-learn/PyTorch.
Comprehensive catalog of all 100+ featurizers organized by category:
When to load: Reference when selecting the optimal featurizer for a specific task, exploring available options, or understanding featurizer characteristics.
Search tip: Use grep to find specific featurizer types:
grep -i "chembert" references/available_featurizers.md
grep -i "pharmacophore" references/available_featurizers.md
Practical code examples for common scenarios:
When to load: Reference when implementing specific workflows, troubleshooting issues, or learning molfeat patterns.
Enable error handling to skip invalid SMILES:
transformer = MoleculeTransformer(
calc,
ignore_errors=True,
verbose=True
)
Process in chunks or use streaming approaches for datasets > 100K molecules.
Some models require additional packages. Install specific extras (pin version for reproducibility):
uv pip install "molfeat[transformer]==0.11.0" # For ChemBERTa/ChemGPT
uv pip install "molfeat[dgl]==0.11.0" # For GIN models
uv pip install "molfeat[graphormer]==0.11.0" # For Graphormer
Save exact configurations and document versions:
transformer.to_state_yaml_file("config.yml")
import molfeat
print(f"molfeat version: {molfeat.__version__}")