Back to Claude Scientific Skills

Model Loading and Management

scientific-skills/transformers/references/models.md

2.38.08.4 KB
Original Source

Model Loading and Management

Overview

The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.

Loading Models

AutoModel Classes

Use AutoModel classes for automatic architecture selection:

python
from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM

# Base model (no task head)
model = AutoModel.from_pretrained("bert-base-uncased")

# Sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Masked language modeling (BERT-style)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Sequence-to-sequence (T5-style)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Common AutoModel Classes

NLP Tasks:

  • AutoModelForSequenceClassification: Text classification, sentiment analysis
  • AutoModelForTokenClassification: NER, POS tagging
  • AutoModelForQuestionAnswering: Extractive QA
  • AutoModelForCausalLM: Text generation (GPT, Llama)
  • AutoModelForMaskedLM: Masked language modeling (BERT)
  • AutoModelForSeq2SeqLM: Translation, summarization (T5, BART)

Vision Tasks:

  • AutoModelForImageClassification: Image classification
  • AutoModelForObjectDetection: Object detection
  • AutoModelForImageSegmentation: Image segmentation

Audio Tasks:

  • AutoModelForAudioClassification: Audio classification
  • AutoModelForSpeechSeq2Seq: Speech recognition

Multimodal:

  • AutoModelForVision2Seq: Image captioning, VQA

Loading Parameters

Basic Parameters

pretrained_model_name_or_path: Model identifier or local path

python
model = AutoModel.from_pretrained("bert-base-uncased")  # From Hub
model = AutoModel.from_pretrained("./local/model/path")  # From disk

num_labels: Number of output labels for classification

python
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3
)

cache_dir: Custom cache location

python
model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")

Device Management

device_map: Automatic device allocation for large models

python
# Automatically distribute across GPUs and CPU
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto"
)

# Sequential placement
model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    device_map="sequential"
)

# Custom device map
device_map = {
    "transformer.layers.0": 0,      # GPU 0
    "transformer.layers.1": 1,      # GPU 1
    "transformer.layers.2": "cpu",  # CPU
}
model = AutoModel.from_pretrained("model-id", device_map=device_map)

Manual device placement:

python
import torch
model = AutoModel.from_pretrained("model-id")
model.to("cuda:0")  # Move to GPU 0
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

Precision Control

torch_dtype: Set model precision

python
import torch

# Float16 (half precision)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)

# BFloat16 (better range than float16)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16)

# Auto (use original dtype)
model = AutoModel.from_pretrained("model-id", torch_dtype="auto")

Attention Implementation

attn_implementation: Choose attention mechanism

python
# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")

# Eager (default, most compatible)
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")

Memory Optimization

low_cpu_mem_usage: Reduce CPU memory during loading

python
model = AutoModelForCausalLM.from_pretrained(
    "large-model-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

load_in_8bit: 8-bit quantization (requires bitsandbytes)

python
model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    load_in_8bit=True,
    device_map="auto"
)

load_in_4bit: 4-bit quantization

python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    quantization_config=quantization_config,
    device_map="auto"
)

Model Configuration

Loading with Custom Config

python
from transformers import AutoConfig, AutoModel

# Load and modify config
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2

# Initialize model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)

Initializing from Config Only

python
config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config)  # Random weights

Model Modes

Training vs Evaluation Mode

Models load in evaluation mode by default:

python
model = AutoModel.from_pretrained("model-id")
print(model.training)  # False

# Switch to training mode
model.train()

# Switch back to evaluation mode
model.eval()

Evaluation mode disables dropout and uses batch norm statistics.

Saving Models

Save Locally

python
model.save_pretrained("./my_model")

This creates:

  • config.json: Model configuration
  • pytorch_model.bin or model.safetensors: Model weights

Save to Hugging Face Hub

python
model.push_to_hub("username/model-name")

# With custom commit message
model.push_to_hub("username/model-name", commit_message="Update model")

# Private repository
model.push_to_hub("username/model-name", private=True)

Model Inspection

Parameter Count

python
# Total parameters
total_params = model.num_parameters()

# Trainable parameters only
trainable_params = model.num_parameters(only_trainable=True)

print(f"Total: {total_params:,}")
print(f"Trainable: {trainable_params:,}")

Memory Footprint

python
memory_bytes = model.get_memory_footprint()
memory_mb = memory_bytes / 1024**2
print(f"Memory: {memory_mb:.2f} MB")

Model Architecture

python
print(model)  # Print full architecture

# Access specific components
print(model.config)
print(model.base_model)

Forward Pass

Basic inference:

python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")

inputs = tokenizer("Sample text", return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits
predictions = logits.argmax(dim=-1)

Model Formats

SafeTensors vs PyTorch

SafeTensors is faster and safer:

python
# Save as safetensors (recommended)
model.save_pretrained("./model", safe_serialization=True)

# Load either format automatically
model = AutoModel.from_pretrained("./model")

ONNX Export

Export for optimized inference:

python
from transformers.onnx import export

# Export to ONNX
export(
    tokenizer=tokenizer,
    model=model,
    config=config,
    output=Path("model.onnx")
)

Best Practices

  1. Use AutoModel classes: Automatic architecture detection
  2. Specify dtype explicitly: Control precision and memory
  3. Use device_map="auto": For large models
  4. Enable low_cpu_mem_usage: When loading large models
  5. Use safetensors format: Faster and safer serialization
  6. Check model.training: Ensure correct mode for task
  7. Consider quantization: For deployment on resource-constrained devices
  8. Cache models locally: Set TRANSFORMERS_CACHE environment variable

Common Issues

CUDA out of memory:

python
# Use smaller precision
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)

# Or use quantization
model = AutoModel.from_pretrained("model-id", load_in_8bit=True)

# Or use CPU
model = AutoModel.from_pretrained("model-id", device_map="cpu")

Slow loading:

python
# Enable low CPU memory mode
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)

Model not found:

python
# Verify model ID on hub.co
# Check authentication for private models
from huggingface_hub import login
login()