scientific-skills/transformers/references/models.md
The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.
Use AutoModel classes for automatic architecture selection:
from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM
# Base model (no task head)
model = AutoModel.from_pretrained("bert-base-uncased")
# Sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Masked language modeling (BERT-style)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
# Sequence-to-sequence (T5-style)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
NLP Tasks:
AutoModelForSequenceClassification: Text classification, sentiment analysisAutoModelForTokenClassification: NER, POS taggingAutoModelForQuestionAnswering: Extractive QAAutoModelForCausalLM: Text generation (GPT, Llama)AutoModelForMaskedLM: Masked language modeling (BERT)AutoModelForSeq2SeqLM: Translation, summarization (T5, BART)Vision Tasks:
AutoModelForImageClassification: Image classificationAutoModelForObjectDetection: Object detectionAutoModelForImageSegmentation: Image segmentationAudio Tasks:
AutoModelForAudioClassification: Audio classificationAutoModelForSpeechSeq2Seq: Speech recognitionMultimodal:
AutoModelForVision2Seq: Image captioning, VQApretrained_model_name_or_path: Model identifier or local path
model = AutoModel.from_pretrained("bert-base-uncased") # From Hub
model = AutoModel.from_pretrained("./local/model/path") # From disk
num_labels: Number of output labels for classification
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=3
)
cache_dir: Custom cache location
model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")
device_map: Automatic device allocation for large models
# Automatically distribute across GPUs and CPU
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto"
)
# Sequential placement
model = AutoModelForCausalLM.from_pretrained(
"model-id",
device_map="sequential"
)
# Custom device map
device_map = {
"transformer.layers.0": 0, # GPU 0
"transformer.layers.1": 1, # GPU 1
"transformer.layers.2": "cpu", # CPU
}
model = AutoModel.from_pretrained("model-id", device_map=device_map)
Manual device placement:
import torch
model = AutoModel.from_pretrained("model-id")
model.to("cuda:0") # Move to GPU 0
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
torch_dtype: Set model precision
import torch
# Float16 (half precision)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
# BFloat16 (better range than float16)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16)
# Auto (use original dtype)
model = AutoModel.from_pretrained("model-id", torch_dtype="auto")
attn_implementation: Choose attention mechanism
# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")
# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")
# Eager (default, most compatible)
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")
low_cpu_mem_usage: Reduce CPU memory during loading
model = AutoModelForCausalLM.from_pretrained(
"large-model-id",
low_cpu_mem_usage=True,
device_map="auto"
)
load_in_8bit: 8-bit quantization (requires bitsandbytes)
model = AutoModelForCausalLM.from_pretrained(
"model-id",
load_in_8bit=True,
device_map="auto"
)
load_in_4bit: 4-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"model-id",
quantization_config=quantization_config,
device_map="auto"
)
from transformers import AutoConfig, AutoModel
# Load and modify config
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2
# Initialize model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config) # Random weights
Models load in evaluation mode by default:
model = AutoModel.from_pretrained("model-id")
print(model.training) # False
# Switch to training mode
model.train()
# Switch back to evaluation mode
model.eval()
Evaluation mode disables dropout and uses batch norm statistics.
model.save_pretrained("./my_model")
This creates:
config.json: Model configurationpytorch_model.bin or model.safetensors: Model weightsmodel.push_to_hub("username/model-name")
# With custom commit message
model.push_to_hub("username/model-name", commit_message="Update model")
# Private repository
model.push_to_hub("username/model-name", private=True)
# Total parameters
total_params = model.num_parameters()
# Trainable parameters only
trainable_params = model.num_parameters(only_trainable=True)
print(f"Total: {total_params:,}")
print(f"Trainable: {trainable_params:,}")
memory_bytes = model.get_memory_footprint()
memory_mb = memory_bytes / 1024**2
print(f"Memory: {memory_mb:.2f} MB")
print(model) # Print full architecture
# Access specific components
print(model.config)
print(model.base_model)
Basic inference:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")
inputs = tokenizer("Sample text", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predictions = logits.argmax(dim=-1)
SafeTensors is faster and safer:
# Save as safetensors (recommended)
model.save_pretrained("./model", safe_serialization=True)
# Load either format automatically
model = AutoModel.from_pretrained("./model")
Export for optimized inference:
from transformers.onnx import export
# Export to ONNX
export(
tokenizer=tokenizer,
model=model,
config=config,
output=Path("model.onnx")
)
CUDA out of memory:
# Use smaller precision
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
# Or use quantization
model = AutoModel.from_pretrained("model-id", load_in_8bit=True)
# Or use CPU
model = AutoModel.from_pretrained("model-id", device_map="cpu")
Slow loading:
# Enable low CPU memory mode
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)
Model not found:
# Verify model ID on hub.co
# Check authentication for private models
from huggingface_hub import login
login()