DiffDock Workflows and Examples

This document provides practical workflows and usage examples for common DiffDock tasks.

Installation and Setup

Conda Installation (Recommended)

bash

# Clone repository
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock

# Create conda environment
conda env create --file environment.yml
conda activate diffdock

Docker Installation

bash

# Pull Docker image
docker pull rbgcsail/diffdock

# Run container with GPU support
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock

# Inside container, activate environment
micromamba activate diffdock

First Run

The first execution pre-computes SO(2) and SO(3) lookup tables, taking a few minutes. Subsequent runs start immediately.

Workflow 1: Single Protein-Ligand Docking

Using PDB File and SMILES String

bash

python -m inference \
  --config default_inference_args.yaml \
  --protein_path examples/protein.pdb \
  --ligand_description "COc1ccc(C(=O)Nc2ccccc2)cc1" \
  --out_dir results/single_docking/

Output Structure:

results/single_docking/
└── complex_0/
    ├── rank1.sdf                    # Convenience copy of top-ranked prediction
    ├── rank1_confidence0.87.sdf     # Top-ranked prediction with confidence
    ├── rank2_confidence0.42.sdf     # Second-ranked prediction
    ├── ...
    └── rank10_confidence-1.23.sdf   # 10th prediction (if samples_per_complex=10)

Current upstream inference.py uses --ligand_description; avoid the older --ligand spelling unless your local checkout has added an alias.

Using Ligand Structure File

bash

python -m inference \
  --config default_inference_args.yaml \
  --protein_path protein.pdb \
  --ligand_description ligand.sdf \
  --out_dir results/ligand_file/

Supported ligand formats: SDF, MOL2, or any format readable by RDKit

Workflow 2: Protein Sequence to Structure Docking

Using ESMFold for Protein Folding

bash

python -m inference \
  --config default_inference_args.yaml \
  --protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" \
  --ligand_description "CC(C)Cc1ccc(cc1)C(C)C(=O)O" \
  --out_dir results/sequence_docking/

Use Cases:

Protein structure not available in PDB
Modeling mutations or variants
De novo protein design validation

Note: ESMFold folding adds computation time (30s-5min depending on sequence length)

Workflow 3: Batch Processing Multiple Complexes

Prepare CSV File

Create complexes.csv with required columns:

csv

complex_name,protein_path,ligand_description,protein_sequence
complex1,proteins/protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
complex2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKF...
complex3,proteins/protein3.pdb,ligands/ligand3.sdf,

Column Descriptions:

complex_name: Unique identifier for the complex
protein_path: Path to PDB file (leave empty if using sequence)
ligand_description: SMILES string or path to ligand file
protein_sequence: Amino acid sequence (leave empty if using PDB)

Run Batch Docking

bash

python -m inference \
  --config default_inference_args.yaml \
  --protein_ligand_csv complexes.csv \
  --out_dir results/batch_predictions/ \
  --batch_size 10

Output Structure:

results/batch_predictions/
├── complex1/
│   ├── rank1.sdf
│   ├── rank1_confidence0.87.sdf
│   ├── rank2_confidence0.42.sdf
│   └── ...
├── complex2/
│   ├── rank1.sdf
│   └── ...
└── complex3/
    └── ...

Workflow 4: High-Throughput Virtual Screening

Setup for Screening Large Ligand Libraries

python

# generate_screening_csv.py
import pandas as pd

# Load ligand library
ligands = pd.read_csv("ligand_library.csv")  # Contains SMILES

# Create DiffDock input
screening_data = {
    "complex_name": [f"screen_{i}" for i in range(len(ligands))],
    "protein_path": ["target_protein.pdb"] * len(ligands),
    "ligand_description": ligands["smiles"].tolist(),
    "protein_sequence": [""] * len(ligands)
}

df = pd.DataFrame(screening_data)
df.to_csv("screening_input.csv", index=False)

Run Screening

bash

# Pre-compute ESM embeddings for faster screening
python datasets/esm_embedding_preparation.py \
  --protein_ligand_csv screening_input.csv \
  --out_file protein_embeddings.pt

# Run docking with pre-computed embeddings
python -m inference \
  --config default_inference_args.yaml \
  --protein_ligand_csv screening_input.csv \
  --esm_embeddings_path protein_embeddings.pt \
  --out_dir results/virtual_screening/ \
  --batch_size 32

Post-Processing: Extract Top Hits

python

# analyze_screening_results.py
import pandas as pd
import re
from pathlib import Path

results = []
results_dir = Path("results/virtual_screening/")

for complex_dir in results_dir.iterdir():
    if not complex_dir.is_dir():
        continue

    scores = []
    for sdf_file in complex_dir.glob("rank*_confidence*.sdf"):
        match = re.search(r"confidence(-?\d+(?:\.\d+)?)", sdf_file.name)
        if match:
            scores.append(float(match.group(1)))

    if scores:
        results.append({"complex": complex_dir.name, "top_confidence": max(scores)})

# Sort by confidence
df = pd.DataFrame(results)
df_sorted = df.sort_values("top_confidence", ascending=False)

# Get top 100 hits
top_hits = df_sorted.head(100)
top_hits.to_csv("top_hits.csv", index=False)

Workflow 5: Ensemble Docking with Protein Flexibility

Prepare Protein Ensemble

python

# For proteins with known flexibility, use multiple conformations
# Example: Using MD snapshots or crystal structures

# create_ensemble_csv.py
import pandas as pd

conformations = [
    "protein_conf1.pdb",
    "protein_conf2.pdb",
    "protein_conf3.pdb",
    "protein_conf4.pdb"
]

ligand = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"

data = {
    "complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
    "protein_path": conformations,
    "ligand_description": [ligand] * len(conformations),
    "protein_sequence": [""] * len(conformations)
}

pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)

Run Ensemble Docking

bash

python -m inference \
  --config default_inference_args.yaml \
  --protein_ligand_csv ensemble_input.csv \
  --out_dir results/ensemble_docking/ \
  --samples_per_complex 20  # More samples per conformation

Workflow 6: Integration with Downstream Analysis

Example: DiffDock + GNINA Rescoring

bash

# 1. Run DiffDock
python -m inference \
  --config default_inference_args.yaml \
  --protein_path protein.pdb \
  --ligand_description "CC(=O)OC1=CC=CC=C1C(=O)O" \
  --out_dir results/diffdock_poses/ \
  --save_visualisation

# 2. Rescore with GNINA
for pose in results/diffdock_poses/complex_0/*confidence*.sdf; do
    gnina -r protein.pdb -l "$pose" --score_only -o "${pose%.sdf}_gnina.sdf"
done

Example: DiffDock + OpenMM Energy Minimization

python

# minimize_poses.py
from openmm import app, LangevinIntegrator, Platform
from openmm.app import ForceField, Modeller, PDBFile
from rdkit import Chem
from pathlib import Path

# Load protein
protein = PDBFile('protein.pdb')
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')

# Process each DiffDock pose
pose_dir = Path('results/diffdock_poses/complex_0')
for pose_path in pose_dir.glob('*confidence*.sdf'):
    # Load ligand
    mol = Chem.SDMolSupplier(str(pose_path))[0]

    # Combine protein + ligand
    modeller = Modeller(protein.topology, protein.positions)
    # ... add ligand to modeller ...

    # Create system and minimize
    system = forcefield.createSystem(modeller.topology)
    integrator = LangevinIntegrator(300, 1.0, 0.002)
    simulation = app.Simulation(modeller.topology, system, integrator)
    simulation.minimizeEnergy(maxIterations=1000)

    # Save minimized structure
    positions = simulation.context.getState(getPositions=True).getPositions()
    PDBFile.writeFile(simulation.topology, positions,
                      open(f"minimized_{pose_path.stem}.pdb", 'w'))

Workflow 7: Using the Graphical Interface

Launch Web Interface

bash

python app/main.py

Access Interface

Navigate to http://localhost:7860 in web browser

Features

Upload protein PDB or enter sequence
Input ligand SMILES or upload structure
Adjust inference parameters via GUI
Visualize results interactively
Download predictions directly

Online Alternative

Use the Hugging Face Spaces demo without local installation:

URL: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web

Advanced Configuration

Custom Inference Settings

Create custom YAML configuration:

yaml

# custom_inference.yaml
# Model settings
model_dir: ./workdir/v1.1/score_model
confidence_model_dir: ./workdir/v1.1/confidence_model

# Sampling parameters
samples_per_complex: 20  # More samples for better coverage
inference_steps: 25      # More steps for accuracy

# Temperature adjustments (increase for more diversity)
temp_sampling_tr: 1.3
temp_sampling_rot: 2.2
temp_sampling_tor: 7.5

# Output
save_visualisation: true

Use custom configuration:

bash

python -m inference \
  --config custom_inference.yaml \
  --protein_path protein.pdb \
  --ligand_description "CC(=O)OC1=CC=CC=C1C(=O)O" \
  --out_dir results/custom_config/

Troubleshooting Common Issues

Issue: Out of Memory Errors

Solution: Reduce batch size

bash

python -m inference ... --batch_size 2

Issue: Slow Performance

Solution: Ensure GPU usage

python

import torch
print(torch.cuda.is_available())  # Should return True

Issue: Poor Predictions for Large Ligands

Solution: Increase sampling diversity

bash

python -m inference ... --samples_per_complex 40 --temp_sampling_tor 9.0

Issue: Protein with Many Chains

Solution: Limit chains or isolate binding site

bash

python -m inference ... --chain_cutoff 4

Or pre-process PDB to include only relevant chains.

Best Practices Summary

Start Simple: Test with single complex before batch processing
GPU Essential: Use GPU for reasonable performance
Multiple Samples: Generate 10-40 samples for robust predictions
Validate Results: Use molecular visualization and complementary scoring
Consider Confidence: Use confidence scores for initial ranking, not final decisions
Iterate Parameters: Adjust temperature/steps for specific systems
Pre-compute Embeddings: For repeated use of same protein
Combine Tools: Integrate with scoring functions and energy minimization