scientific-skills/esm/references/esm3-api.md
ESM3 is a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. It uses iterative masked language modeling to simultaneously generate across these three modalities.
ESM3 Family Models:
| Model ID | Parameters | Availability | Best For |
|---|---|---|---|
esm3-sm-open-v1 | 1.4B | Open weights (local) | Development, testing, learning |
esm3-medium-2024-08 | 7B | Forge API only | Production, balanced quality/speed |
esm3-large-2024-03 | 98B | Forge API only | Maximum quality, research |
esm3-medium-multimer-2024-09 | 7B | Forge API only | Protein complexes (experimental) |
Key Features:
The central data structure representing a protein with optional sequence, structure, and function information.
Constructor:
from esm.sdk.api import ESMProtein
protein = ESMProtein(
sequence="MPRTKEINDAGLIVHSP", # Amino acid sequence (optional)
coordinates=coordinates_array, # 3D structure (optional)
function_annotations=[...], # Function labels (optional)
secondary_structure="HHHEEEECCC", # SS annotations (optional)
sasa=sasa_array # Solvent accessibility (optional)
)
Key Methods:
# Load from PDB file
protein = ESMProtein.from_pdb("protein.pdb")
# Export to PDB format
pdb_string = protein.to_pdb()
# Save to file
with open("output.pdb", "w") as f:
f.write(protein.to_pdb())
Masking Conventions:
Use _ (underscore) to represent masked positions for generation:
# Mask positions 5-10 for generation
protein = ESMProtein(sequence="MPRT______AGLIVHSP")
# Fully masked sequence (generate from scratch)
protein = ESMProtein(sequence="_" * 200)
# Partial structure (some coordinates None)
protein = ESMProtein(
sequence="MPRTKEIND",
coordinates=partial_coords # Some positions can be None
)
Controls generation behavior and parameters.
Basic Configuration:
from esm.sdk.api import GenerationConfig
config = GenerationConfig(
track="sequence", # Track to generate: "sequence", "structure", or "function"
num_steps=8, # Number of demasking steps
temperature=0.7, # Sampling temperature (0.0-1.0)
top_p=None, # Nucleus sampling threshold
condition_on_coordinates_only=False # For structure conditioning
)
Parameter Details:
track: Which modality to generate
"sequence": Generate amino acid sequence"structure": Generate 3D coordinates"function": Generate function annotationsnum_steps: Number of iterative demasking steps
temperature: Controls randomness
top_p: Nucleus sampling parameter
condition_on_coordinates_only: Structure conditioning mode
True: Condition only on backbone coordinates (ignore sequence)The unified interface for both local and remote inference.
Local Model Loading:
from esm.models.esm3 import ESM3
# Load with automatic device placement
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
# Or explicitly specify device
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cpu")
Generation Method:
# Basic generation
protein_output = model.generate(protein_input, config)
# With explicit track specification
protein_output = model.generate(
protein_input,
GenerationConfig(track="sequence", num_steps=16, temperature=0.6)
)
Forward Pass (Advanced):
# Get raw model logits for custom sampling
protein_tensor = model.encode(protein)
output = model.forward(protein_tensor)
logits = model.decode(output)
Fill in masked regions of a protein sequence:
# Define partial sequence
protein = ESMProtein(sequence="MPRTK____LIVHSP____END")
# Generate missing positions
config = GenerationConfig(track="sequence", num_steps=12, temperature=0.5)
completed = model.generate(protein, config)
print(f"Original: {protein.sequence}")
print(f"Completed: {completed.sequence}")
Predict 3D structure from sequence:
# Input: sequence only
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSPQWFYK")
# Generate structure
config = GenerationConfig(track="structure", num_steps=len(protein.sequence))
protein_with_structure = model.generate(protein, config)
# Save as PDB
with open("predicted_structure.pdb", "w") as f:
f.write(protein_with_structure.to_pdb())
Design sequence for a target structure:
# Load target structure
target = ESMProtein.from_pdb("target.pdb")
# Remove sequence, keep structure
target.sequence = None
# Generate sequence that folds to this structure
config = GenerationConfig(
track="sequence",
num_steps=50,
temperature=0.7,
condition_on_coordinates_only=True
)
designed = model.generate(target, config)
print(f"Designed sequence: {designed.sequence}")
Generate protein with specific function:
from esm.sdk.api import FunctionAnnotation
# Specify desired function
protein = ESMProtein(
sequence="_" * 150,
function_annotations=[
FunctionAnnotation(
label="enzymatic_activity",
start=30,
end=90
)
]
)
# Generate sequence with this function
config = GenerationConfig(track="sequence", num_steps=75, temperature=0.6)
functional_protein = model.generate(protein, config)
Iteratively generate across multiple tracks:
# Start with partial sequence
protein = ESMProtein(sequence="MPRT" + "_" * 100)
# Step 1: Complete sequence
protein = model.generate(
protein,
GenerationConfig(track="sequence", num_steps=50, temperature=0.6)
)
# Step 2: Predict structure for completed sequence
protein = model.generate(
protein,
GenerationConfig(track="structure", num_steps=50)
)
# Step 3: Predict function
protein = model.generate(
protein,
GenerationConfig(track="function", num_steps=20)
)
print(f"Final sequence: {protein.sequence}")
print(f"Functions: {protein.function_annotations}")
Generate multiple variants of a protein:
import numpy as np
base_sequence = "MPRTKEINDAGLIVHSPQWFYK"
variants = []
for i in range(10):
# Mask random positions
seq_list = list(base_sequence)
mask_indices = np.random.choice(len(seq_list), size=5, replace=False)
for idx in mask_indices:
seq_list[idx] = '_'
protein = ESMProtein(sequence=''.join(seq_list))
# Generate variant
variant = model.generate(
protein,
GenerationConfig(track="sequence", num_steps=8, temperature=0.8)
)
variants.append(variant.sequence)
print(f"Generated {len(variants)} variants")
Vary temperature during generation for better control:
def generate_with_temperature_schedule(model, protein, temperatures):
"""Generate with decreasing temperature for annealing."""
current = protein
steps_per_temp = 10
for temp in temperatures:
config = GenerationConfig(
track="sequence",
num_steps=steps_per_temp,
temperature=temp
)
current = model.generate(current, config)
return current
# Example: Start diverse, end deterministic
result = generate_with_temperature_schedule(
model,
protein,
temperatures=[1.0, 0.8, 0.6, 0.4, 0.2]
)
Preserve specific regions during generation:
# Keep active site residues fixed
def mask_except_active_site(sequence, active_site_positions):
"""Mask everything except specified positions."""
seq_list = ['_'] * len(sequence)
for pos in active_site_positions:
seq_list[pos] = sequence[pos]
return ''.join(seq_list)
# Define active site
active_site = [23, 24, 25, 45, 46, 89]
constrained_seq = mask_except_active_site(original_sequence, active_site)
protein = ESMProtein(sequence=constrained_seq)
result = model.generate(protein, GenerationConfig(track="sequence", num_steps=50))
Use secondary structure information in generation:
# Define secondary structure (H=helix, E=sheet, C=coil)
protein = ESMProtein(
sequence="_" * 80,
secondary_structure="CCHHHHHHHEEEEECCCHHHHHHCC" + "C" * 55
)
# Generate sequence with this structure
result = model.generate(
protein,
GenerationConfig(track="sequence", num_steps=40, temperature=0.6)
)
For large proteins or batch processing:
import torch
# Clear CUDA cache between generations
torch.cuda.empty_cache()
# Use half precision for memory efficiency
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda").half()
# Process in chunks for very long sequences
def chunk_generate(model, long_sequence, chunk_size=500):
chunks = [long_sequence[i:i+chunk_size]
for i in range(0, len(long_sequence), chunk_size)]
results = []
for chunk in chunks:
protein = ESMProtein(sequence=chunk)
result = model.generate(protein, GenerationConfig(track="sequence"))
results.append(result.sequence)
return ''.join(results)
When processing multiple proteins:
forge-api.md)try:
protein = model.generate(protein_input, config)
except ValueError as e:
print(f"Invalid input: {e}")
# Handle invalid sequence or structure
except RuntimeError as e:
print(f"Generation failed: {e}")
# Handle model errors
except torch.cuda.OutOfMemoryError:
print("GPU out of memory - try smaller model or CPU")
# Fallback to CPU or smaller model
esm3-sm-open-v1:
esm3-medium-2024-08:
esm3-large-2024-03:
If using ESM3 in research, cite:
Hayes, T. et al. (2025). Simulating 500 million years of evolution with a language model.
Science. DOI: 10.1126/science.ads0018