scientific-skills/datamol/references/reactions_data.md
datamol.reactions)The reactions module enables programmatic application of chemical transformations using SMARTS reaction patterns.
dm.reactions.apply_reaction(rxn, reactants, as_smiles=False, sanitize=True, single_product_group=True, rm_attach=True, product_index=0)Apply a chemical reaction to reactant molecules.
rxn: Reaction object (from SMARTS pattern)reactants: Tuple of reactant moleculesas_smiles: Return SMILES strings (True) or molecule objects (False)sanitize: Sanitize product moleculessingle_product_group: Return single product (True) or all product groups (False)rm_attach: Remove attachment point markersproduct_index: Which product to return from reactionfrom rdkit import Chem
# Define reaction: alcohol + carboxylic acid → ester
rxn = Chem.rdChemReactions.ReactionFromSmarts(
'[C:1][OH:2].[C:3](=[O:4])[OH:5]>>[C:1][O:2][C:3](=[O:4])'
)
# Apply to reactants
alcohol = dm.to_mol("CCO")
acid = dm.to_mol("CC(=O)O")
product = dm.reactions.apply_reaction(rxn, (alcohol, acid))
Reactions are typically created from SMARTS patterns using RDKit:
from rdkit.Chem import rdChemReactions
# Reaction pattern: [reactant1].[reactant2]>>[product]
rxn = rdChemReactions.ReactionFromSmarts(
'[1*][*:1].[1*][*:2]>>[*:1][*:2]'
)
The module includes functions to:
Amide formation:
# Amine + carboxylic acid → amide
amide_rxn = rdChemReactions.ReactionFromSmarts(
'[N:1].[C:2](=[O:3])[OH]>>[N:1][C:2](=[O:3])'
)
Suzuki coupling:
# Aryl halide + boronic acid → biaryl
suzuki_rxn = rdChemReactions.ReactionFromSmarts(
'[c:1][Br].[c:2][B]([OH])[OH]>>[c:1][c:2]'
)
Functional group transformations:
# Alcohol → ester
esterification = rdChemReactions.ReactionFromSmarts(
'[C:1][OH:2].[C:3](=[O:4])[Cl]>>[C:1][O:2][C:3](=[O:4])'
)
import datamol as dm
from rdkit.Chem import rdChemReactions
# 1. Define reaction
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]' # Acid → acid chloride
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
# 2. Apply to molecule library
acids = [dm.to_mol(smi) for smi in acid_smiles_list]
acid_chlorides = []
for acid in acids:
try:
product = dm.reactions.apply_reaction(
rxn,
(acid,), # Single reactant as tuple
sanitize=True
)
acid_chlorides.append(product)
except Exception as e:
print(f"Reaction failed: {e}")
# 3. Validate products
valid_products = [p for p in acid_chlorides if p is not None]
datamol.data)The data module provides convenient access to curated molecular datasets for testing and learning.
dm.data.cdk2(as_df=True, mol_column='mol')RDKit CDK2 dataset - kinase inhibitor data.
as_df: Return as DataFrame (True) or list of molecules (False)mol_column: Name for molecule columncdk2_df = dm.data.cdk2(as_df=True)
print(cdk2_df.shape)
print(cdk2_df.columns)
dm.data.freesolv()FreeSolv dataset - experimental and calculated hydration free energies.
freesolv_df = dm.data.freesolv()
# Columns: iupac, smiles, expt (kcal/mol), calc (kcal/mol)
dm.data.solubility(as_df=True, mol_column='mol')RDKit solubility dataset with train/test splits.
sol_df = dm.data.solubility(as_df=True)
# Split into train/test
train_df = sol_df[sol_df['split'] == 'train']
test_df = sol_df[sol_df['split'] == 'test']
# Use for model development
X_train = dm.to_fp(train_df[mol_column])
y_train = train_df['solubility']
For testing and tutorials:
# Quick dataset for testing code
df = dm.data.cdk2()
mols = df['mol'].tolist()
# Test descriptor calculation
descriptors_df = dm.descriptors.batch_compute_many_descriptors(mols)
# Test clustering
clusters = dm.cluster_mols(mols, cutoff=0.3)
For learning workflows:
# Complete ML pipeline example
sol_df = dm.data.solubility()
# Preprocessing
train = sol_df[sol_df['split'] == 'train']
test = sol_df[sol_df['split'] == 'test']
# Featurization
X_train = dm.to_fp(train['mol'])
X_test = dm.to_fp(test['mol'])
# Model training (example)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, train['solubility'])
predictions = model.predict(X_test)