Back to Claude Scientific Skills

Datamol Fragments and Scaffolds Reference

scientific-skills/datamol/references/fragments_scaffolds.md

2.38.05.7 KB
Original Source

Datamol Fragments and Scaffolds Reference

Scaffolds Module (datamol.scaffold)

Scaffolds represent the core structure of molecules, useful for identifying structural families and analyzing structure-activity relationships (SAR).

Murcko Scaffolds

dm.to_scaffold_murcko(mol)

Extract Bemis-Murcko scaffold (molecular framework).

  • Method: Removes side chains, retaining ring systems and linkers
  • Returns: Molecule object representing the scaffold
  • Use case: Identify core structures across compound series
  • Example:
    python
    mol = dm.to_mol("c1ccc(cc1)CCN")  # Phenethylamine
    scaffold = dm.to_scaffold_murcko(mol)
    scaffold_smiles = dm.to_smiles(scaffold)
    # Returns: 'c1ccccc1CC' (benzene ring + ethyl linker)
    

Workflow for scaffold analysis:

python
# Extract scaffolds from compound library
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]

# Count scaffold frequency
from collections import Counter
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)

Fuzzy Scaffolds

dm.scaffold.fuzzy_scaffolding(mol, ...)

Generate fuzzy scaffolds with enforceable groups that must appear in the core.

  • Purpose: More flexible scaffold definition allowing specified functional groups
  • Use case: Custom scaffold definitions beyond Murcko rules

Applications

Scaffold-based splitting (for ML model validation):

python
# Group compounds by scaffold
scaffold_to_mols = {}
for mol, scaffold in zip(mols, scaffolds):
    smi = dm.to_smiles(scaffold)
    if smi not in scaffold_to_mols:
        scaffold_to_mols[smi] = []
    scaffold_to_mols[smi].append(mol)

# Ensure train/test sets have different scaffolds

SAR analysis:

python
# Group by scaffold and analyze activity
for scaffold_smi, molecules in scaffold_to_mols.items():
    activities = [get_activity(mol) for mol in molecules]
    print(f"Scaffold: {scaffold_smi}, Mean activity: {np.mean(activities)}")

Fragments Module (datamol.fragment)

Molecular fragmentation breaks molecules into smaller pieces based on chemical rules, useful for fragment-based drug design and substructure analysis.

BRICS Fragmentation

dm.fragment.brics(mol, ...)

Fragment molecule using BRICS (Breaking Retrosynthetically Interesting Chemical Substructures).

  • Method: Dissects based on 16 chemically meaningful bond types
  • Consideration: Considers chemical environment and surrounding substructures
  • Returns: Set of fragment SMILES strings
  • Use case: Retrosynthetic analysis, fragment-based design
  • Example:
    python
    mol = dm.to_mol("c1ccccc1CCN")
    fragments = dm.fragment.brics(mol)
    # Returns fragments like: '[1*]CCN', '[1*]c1ccccc1', etc.
    # [1*] represents attachment points
    

RECAP Fragmentation

dm.fragment.recap(mol, ...)

Fragment molecule using RECAP (Retrosynthetic Combinatorial Analysis Procedure).

  • Method: Dissects based on 11 predefined bond types
  • Rules:
    • Leaves alkyl groups smaller than 5 carbons intact
    • Preserves cyclic bonds
  • Returns: Set of fragment SMILES strings
  • Use case: Combinatorial library design
  • Example:
    python
    mol = dm.to_mol("CCCCCc1ccccc1")
    fragments = dm.fragment.recap(mol)
    

MMPA Fragmentation

dm.fragment.mmpa_frag(mol, ...)

Fragment for Matched Molecular Pair Analysis.

  • Purpose: Generate fragments suitable for identifying molecular pairs
  • Use case: Analyzing how small structural changes affect properties
  • Example:
    python
    fragments = dm.fragment.mmpa_frag(mol)
    # Used to find pairs of molecules differing by single transformation
    

Comparison of Methods

MethodBond TypesPreserves CyclesBest For
BRICS16YesRetrosynthetic analysis, fragment recombination
RECAP11YesCombinatorial library design
MMPAVariableDependsStructure-activity relationship analysis

Fragmentation Workflow

python
import datamol as dm

# 1. Fragment a molecule
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O")  # Aspirin
brics_frags = dm.fragment.brics(mol)
recap_frags = dm.fragment.recap(mol)

# 2. Analyze fragment frequency across library
all_fragments = []
for mol in molecule_library:
    frags = dm.fragment.brics(mol)
    all_fragments.extend(frags)

# 3. Identify common fragments
from collections import Counter
fragment_counts = Counter(all_fragments)
common_fragments = fragment_counts.most_common(20)

# 4. Convert fragments back to molecules (remove attachment points)
def clean_fragment(frag_smiles):
    # Remove [1*], [2*], etc. attachment point markers
    clean = frag_smiles.replace('[1*]', '[H]')
    return dm.to_mol(clean)

Advanced: Fragment-Based Virtual Screening

python
# Build fragment library from known actives
active_fragments = set()
for active_mol in active_compounds:
    frags = dm.fragment.brics(active_mol)
    active_fragments.update(frags)

# Screen compounds for presence of active fragments
def score_by_fragments(mol, fragment_set):
    mol_frags = dm.fragment.brics(mol)
    overlap = mol_frags.intersection(fragment_set)
    return len(overlap) / len(mol_frags)

# Score screening library
scores = [score_by_fragments(mol, active_fragments) for mol in screening_lib]

Key Concepts

  • Attachment Points: Marked with [1*], [2*], etc. in fragment SMILES
  • Retrosynthetic: Fragmentation mimics synthetic disconnections
  • Chemically Meaningful: Breaks occur at typical synthetic bonds
  • Recombination: Fragments can theoretically be recombined into valid molecules