Model Loading and Management

Overview

The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.

Loading Models

AutoModel Classes

Use AutoModel classes for automatic architecture selection:

python

from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM

# Base model (no task head)
model = AutoModel.from_pretrained("bert-base-uncased")

# Sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Masked language modeling (BERT-style)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Sequence-to-sequence (T5-style)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Common AutoModel Classes

NLP Tasks:

AutoModelForSequenceClassification: Text classification, sentiment analysis
AutoModelForTokenClassification: NER, POS tagging
AutoModelForQuestionAnswering: Extractive QA
AutoModelForCausalLM: Text generation (GPT, Llama)
AutoModelForMaskedLM: Masked language modeling (BERT)
AutoModelForSeq2SeqLM: Translation, summarization (T5, BART)

Vision Tasks:

AutoModelForImageClassification: Image classification
AutoModelForObjectDetection: Object detection
AutoModelForImageSegmentation: Image segmentation

Audio Tasks:

AutoModelForAudioClassification: Audio classification
AutoModelForSpeechSeq2Seq: Speech recognition

Multimodal:

AutoModelForVision2Seq: Image captioning, VQA

Loading Parameters

Basic Parameters

pretrained_model_name_or_path: Model identifier or local path

python

model = AutoModel.from_pretrained("bert-base-uncased")  # From Hub
model = AutoModel.from_pretrained("./local/model/path")  # From disk

num_labels: Number of output labels for classification

python

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3
)

cache_dir: Custom cache location

python

model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")

Device Management

device_map: Automatic device allocation for large models

python

# Automatically distribute across GPUs and CPU
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto"
)

# Sequential placement
model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    device_map="sequential"
)

# Custom device map
device_map = {
    "transformer.layers.0": 0,      # GPU 0
    "transformer.layers.1": 1,      # GPU 1
    "transformer.layers.2": "cpu",  # CPU
}
model = AutoModel.from_pretrained("model-id", device_map=device_map)

Manual device placement:

python

import torch
model = AutoModel.from_pretrained("model-id")
model.to("cuda:0")  # Move to GPU 0
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

Precision Control

torch_dtype: Set model precision

python

import torch

# Float16 (half precision)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)

# BFloat16 (better range than float16)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16)

# Auto (use original dtype)
model = AutoModel.from_pretrained("model-id", torch_dtype="auto")

Attention Implementation

attn_implementation: Choose attention mechanism

python

# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")

# Eager (default, most compatible)
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")

Memory Optimization

low_cpu_mem_usage: Reduce CPU memory during loading

python

model = AutoModelForCausalLM.from_pretrained(
    "large-model-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

load_in_8bit: 8-bit quantization (requires bitsandbytes)

python

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    load_in_8bit=True,
    device_map="auto"
)

load_in_4bit: 4-bit quantization

python

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    quantization_config=quantization_config,
    device_map="auto"
)

Model Configuration

Loading with Custom Config

python

from transformers import AutoConfig, AutoModel

# Load and modify config
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2

# Initialize model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)

Initializing from Config Only

python

config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config)  # Random weights

Model Modes

Training vs Evaluation Mode

Models load in evaluation mode by default:

python

model = AutoModel.from_pretrained("model-id")
print(model.training)  # False

# Switch to training mode
model.train()

# Switch back to evaluation mode
model.eval()

Evaluation mode disables dropout and uses batch norm statistics.

Saving Models

Save Locally

python

model.save_pretrained("./my_model")

This creates:

config.json: Model configuration
pytorch_model.bin or model.safetensors: Model weights

Save to Hugging Face Hub

python

model.push_to_hub("username/model-name")

# With custom commit message
model.push_to_hub("username/model-name", commit_message="Update model")

# Private repository
model.push_to_hub("username/model-name", private=True)

Model Inspection

Parameter Count

python

# Total parameters
total_params = model.num_parameters()

# Trainable parameters only
trainable_params = model.num_parameters(only_trainable=True)

print(f"Total: {total_params:,}")
print(f"Trainable: {trainable_params:,}")

Memory Footprint

python

memory_bytes = model.get_memory_footprint()
memory_mb = memory_bytes / 1024**2
print(f"Memory: {memory_mb:.2f} MB")

Model Architecture

python

print(model)  # Print full architecture

# Access specific components
print(model.config)
print(model.base_model)

Forward Pass

Basic inference:

python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")

inputs = tokenizer("Sample text", return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits
predictions = logits.argmax(dim=-1)

Model Formats

SafeTensors vs PyTorch

SafeTensors is faster and safer:

python

# Save as safetensors (recommended)
model.save_pretrained("./model", safe_serialization=True)

# Load either format automatically
model = AutoModel.from_pretrained("./model")

ONNX Export

Export for optimized inference:

python

from transformers.onnx import export

# Export to ONNX
export(
    tokenizer=tokenizer,
    model=model,
    config=config,
    output=Path("model.onnx")
)

Best Practices

Use AutoModel classes: Automatic architecture detection
Specify dtype explicitly: Control precision and memory
Use device_map="auto": For large models
Enable low_cpu_mem_usage: When loading large models
Use safetensors format: Faster and safer serialization
Check model.training: Ensure correct mode for task
Consider quantization: For deployment on resource-constrained devices
Cache models locally: Set TRANSFORMERS_CACHE environment variable

Common Issues

CUDA out of memory:

python

# Use smaller precision
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)

# Or use quantization
model = AutoModel.from_pretrained("model-id", load_in_8bit=True)

# Or use CPU
model = AutoModel.from_pretrained("model-id", device_map="cpu")

Slow loading:

python

# Enable low CPU memory mode
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)

Model not found:

python

# Verify model ID on hub.co
# Check authentication for private models
from huggingface_hub import login
login()