MIGRATION_GUIDE_V5.md
We're removing the TensorFlow and Jax parts of the library. This will help us focus fully on torch
going forward and will greatly reduce the maintenance cost of models. We are working with tools from
the Jax ecosystem still (such as MaxText) in order to see how we can remain compatible with their
tool while keeping torch as the only backend for now.
Linked PR: https://github.com/huggingface/transformers/pull/40760
We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
weight loading API is designed to apply operations to the checkpoints loaded by transformers.
Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge, and split the layers according to how they're defined in this new API. These operations are often a necessity when working with quantization or parallelism algorithms.
This new API is centered around the new WeightConverter class:
class WeightConverter(WeightTransform):
operations: list[ConversionOps]
source_keys: Union[str, list[str]]
target_keys: Union[str, list[str]]
The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount to defining the following conversion:
conversion = WeightConverter(
["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], # The input layers
"self_attn.qkv_proj", # The single layer as output
operations=[Concatenate(dim=0)],
)
In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
layer.
This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
and helped us remove a lot of technical debt that we accumulated over the past few years.
This results in several improvements:
While this is being implemented, expect varying levels of support across different release candidates.
Linked PR: https://github.com/huggingface/transformers/pull/41580
Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.
Defining a new tokenizer object should be as simple as this:
from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE
class Llama5Tokenizer(TokenizersBackend):
def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
if vocab is None:
self._vocab = {
str(unk_token): 0,
str(bos_token): 1,
str(eos_token): 2,
}
else:
self._vocab = vocab
self._merges = merges or []
self._tokenizer = Tokenizer(
BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
)
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
)
super().__init__(
tokenizer_object=self._tokenizer,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
)
Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet :wink:).
The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.
Up to now, transformers maintained two parallel implementations for many tokenizers:
tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:
sentencepiece library. It inherits from PythonBackend.tokenizers. Basically allows adding tokens.MistralCommon's tokenization library. (Previously known as the MistralCommonTokenizer)The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.
We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.
To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.
If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:
encodedecodevocab_sizeget_vocabconvert_tokens_to_idsconvert_ids_to_tokensfrom_pretrainedsave_pretrainedNote for implementing new tokenizers: When creating a tokenizer class that loads from SentencePiece files, you can override the convert_from_spm class method in your converter to customize vocabulary structure, normalizers, regexes and anything that you would want to be passed to the tokenizers your are converting.
This is useful if the model requires specific token ordering or special split regex patterns. See existing converter classes in convert_slow_tokenizer.py for examples.
Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer()
This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.
These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:
from transformers import LlamaTokenizer
vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]
tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)
This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comparison of different pre-tokenizers, normalizers, etc.
Simplified file loading: Support is added for passingvocab and merges as file paths directly to tokenizer initialization. The tokenizer will automatically detect the format (SentencePiece .model, Tekken tekken.json, or plain vocab/merges files) for loading. For BPE tokenizers, if a vocab is provided but no merges, merges will be automatically generated (excluding special tokens).
Note: Loading from file paths with vocab="<path_to_a_file>"'s primary goal is to allow you to do some quick testing, but for BPE models for example we don't check whether you properly passed the merges or not.
The batch_decode and decode methods have been unified to reflect behavior of the encode method. Both single and batch decoding now use the same decode method. See an example of the new behavior below:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")
inputs = ["hey how are you?", "fine"]
tokenizer.decode(tokenizer.encode(inputs))
Gives:
- 'hey how are you?</s> fine</s>'
+ ['hey how are you?</s>', 'fine</s>']
We expect encode and decode to behave, as two sides of the same coin: encode, process, decode, should work.
[!NOTE] A common use-case would be:
encode,model.generate,decode. However, usinggeneratewould returnlist[list[int]], which would then be incompatible withdecode.
The encode_plus method is deprecated in favor of the single __call__ method.
apply_chat_template returns BatchEncodingPreviously, apply_chat_template returned input_ids for backward compatibility. Starting with v5, it now consistently returns a BatchEncoding dict like other tokenizer methods.
# v5
messages = [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"}
]
# Now returns BatchEncoding with input_ids, attention_mask, etc.
outputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
print(outputs.keys()) # dict_keys(['input_ids', 'attention_mask'])
We simplify the serialization of tokenization attributes:
special_tokens_map.json - special tokens are now stored in tokenizer_config.json.added_tokens.json - added tokens are now stored in tokenizer.json.added_tokens_decoder is only stored when there is no tokenizer.json.add_bos_token and add_eos_token - these are no longer saved in tokenizer_config.json. When a tokenizer.json file exists, these settings are defined in the tokenizer class or tokenizer.json itself.Backend synchronization removed: The automatic synchronization logic that updated backend tokenizer settings (like add_prefix_space, do_lower_case, strip_accents, tokenize_chinese_chars) after initialization has been removed. Tokenizer behavior is now fully determined by the tokenizer.json file or class definition at initialization time.
When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. We're gradually moving towards consolidating attributes to fewer files so that other libraries and implementations may depend on them more reliably.
Several models that had identical tokenizers now import from their base implementation:
Removed T5-specific workarounds
The internal _eventually_correct_t5_max_length method has been removed. T5 tokenizers now handle max length consistently with other models.
A few testing changes specific to tokenizers have been applied:
add_tokens, encode, decode) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behaviorFor legacy implementations, the original BERT Python tokenizer code (including WhitespaceTokenizer, BasicTokenizer, etc.) is preserved in bert_legacy.py for reference purposes.
Special Tokens Structure:
SpecialTokensMixin: Merged into PreTrainedTokenizerBase to simplify the tokenizer architecture.special_tokens_map: Now only stores named special token attributes (e.g., bos_token, eos_token). Use extra_special_tokens for additional special tokens (formerly additional_special_tokens). all_special_tokens includes both named and extra tokens.# v4
tokenizer.special_tokens_map # Included 'additional_special_tokens'
# v5
tokenizer.special_tokens_map # Only named tokens
tokenizer.extra_special_tokens # Additional tokens
special_tokens_map_extended and all_special_tokens_extended: Removed. Access AddedToken objects directly from _special_tokens_map or _extra_special_tokens if needed.additional_special_tokens: Automatically converted to extra_special_tokens during initialization.additional_special_tokens_ids: Removed. Use extra_special_tokens_ids instead.extra_special_tokens: Only accepts list/tuple format and is intended for use during tokenizer initialization. For model-specific named tokens (e.g., image_token), pass directly as keyword arguments instead.Deprecated Methods:
sanitize_special_tokens(): Already deprecated in v4, removed in v5.prepare_seq2seq_batch(): Deprecated; use __call__() with text_target parameter instead.# v4
model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128)
# v5
model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt")
model_inputs["labels"] = model_inputs.pop("input_ids_target")
BatchEncoding.words(): Deprecated; use word_ids() instead.Removed Methods:
create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.prepare_for_model(), build_inputs_with_special_tokens(), truncate_sequences(): Moved from tokenization_utils_base.py to tokenization_python.py for PythonBackend tokenizers. TokenizersBackend provides model-ready input via tokenize() and encode(), so these methods are no longer needed in the base class._switch_to_input_mode(), _switch_to_target_mode(), as_target_tokenizer(): Removed from base class. Use __call__() with text_target parameter instead.# v4
with tokenizer.as_target_tokenizer():
labels = tokenizer(tgt_texts, ...)
# v5
labels = tokenizer(text_target=tgt_texts, ...)
parse_response(): Removed from base class.Because we are switching from the naive MOE (nn.ModuleList for experts) we currently have an issue with MoEs that have adapters. For more details see https://github.com/huggingface/transformers/issues/42491#issuecomment-3591485649.
We aim for this to be fixed and released in a following release candidate in the week that follows RC0.
We are streamlining the MoE support with vLLM; while this is being implemented, tensor parallelism and expert parallelism aren't working as expected. This is known and actively being worked on.
We aim for this to be fixed and released in a following release candidate in the week that follows RC0.
A lot of paths were removed and reworked; paths like transformers.tokenization_utils and transformers.tokenization_utils_fast, which no longer exist.
These now redirect to transformers.tokenization_utils_sentencepiece and transformers.tokenization_utils_tokenizers respectively; please update imports accordingly.
We aim for this to be fixed and released in a following release candidate in the week that follows RC0.
For anyone inheriting from a transformers PreTrainedModel, the weights are automatically initialized with the common scheme:
@torch.no_grad()
def _init_weights(self, module):
"""
Initialize the weights. This is quite general on purpose, in the spirit of what we usually do. For more complex
initialization scheme, it should be overridden by the derived `PreTrainedModel` class. In case a model adds an explicit
`nn.Parameter`, this method should also be overridden in order to initialize it correctly.
"""
if hasattr(self.config, "initializer_range"):
std = self.config.initializer_range or 0.02
elif hasattr(self.config, "init_std"):
std = self.config.init_std
elif hasattr(self.config, "initializer_factor"):
std = self.config.initializer_factor
else:
# 0.02 is the standard default value across the library
std = getattr(self.config.get_text_config(), "initializer_range", 0.02)
if isinstance(module, (nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.ConvTranspose1d, nn.ConvTranspose2d)):
if getattr(module, "weight", None) is not None:
init.normal_(module.weight, mean=0.0, std=std)
if getattr(module, "bias", None) is not None:
init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
if getattr(module, "weight", None) is not None:
init.normal_(module.weight, mean=0.0, std=std)
# Here we need the check explicitly, as we slice the weight in the `zeros_` call, so it looses the flag
if module.padding_idx is not None and not getattr(module.weight, "_is_hf_initialized", False):
init.zeros_(module.weight[module.padding_idx])
elif isinstance(module, nn.MultiheadAttention):
# This uses torch's original init
module._reset_parameters()
# We cannot use `isinstance` on the RMSNorms or LayerNorms, as they usually are custom modules which change names
# between modelings (because they are prefixed with the model name)
elif (
isinstance(module, (nn.GroupNorm, nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d))
or "LayerNorm" in module.__class__.__name__
or "RMSNorm" in module.__class__.__name__
):
# Norms can exist without weights (in which case they are None from torch primitives)
if hasattr(module, "weight") and module.weight is not None:
init.ones_(module.weight)
if hasattr(module, "bias") and module.bias is not None:
init.zeros_(module.bias)
If you want to avoid that, for now you should just do:
class CustomModel(Qwen3VLForConditionalGeneration):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.action_head = nn.Linear(1024, 7)
self.positional_embedding = nn.Parameter(torch.randn(16, 1152))
self.post_init()
def _init_weights(self, module):
pass
There is a tracker for that here: https://github.com/huggingface/transformers/issues/42418.
safe_serialization=FalseSafetensors is a simple format for storing tensors safely (as opposed to pickle) and that is still fast (zero-copy). It is the preferred file format to store transformers's weights. Prior to transformers v5, it was still possible to pass safe_serialization=False to fall back to torch's default (and unsafe) file format. This is no longer possible in v5. The safe_serialization parameter has been removed from all save_pretrained and push_to_hub methods.
If you really want to export weights to another file format, you must save the model.state_dict() by yourself.
Linked PR: https://github.com/huggingface/transformers/issues/42556
The default shard size went up from 5GB to 50GB. Main benefit will be to avoid having tens or hundreds of weight files for large models. This change was made possible thanks to the Xet backend allowing us to efficiently serve very large files. Increasing default shard size was a decision that was only taken after very careful considerations around optimizations and load speed. Check out the linked PR for benchmark details.
Linked PR: https://github.com/huggingface/transformers/issues/42556
use_auth_tokenThe use_auth_token argument/parameter is deprecated in favor of token everywhere.
You should be able to search and replace use_auth_token with token and get the same logic.
Linked PR: https://github.com/huggingface/transformers/pull/41666
We decided to remove some features for the upcoming v5 as they are currently only supported in a few old models and no longer integrated in current model additions. It's recommended to stick to v4.x in case you need them. Following features are affected:
We dropped support for two torch APIs:
torchscript in https://github.com/huggingface/transformers/pull/41688torch.fx in https://github.com/huggingface/transformers/pull/41683Those APIs were deprecated by the PyTorch team, and we're instead focusing on the supported APIs dynamo and export.
get_*_featuresMany multi-modal models expose convenience methods such as get_text_features, get_image_features, get_audio_features, and get_video_features to run inference on a single modality without calling model(**inputs) directly.
Starting with v5, these 4 helper methods now return a BaseModelOutputWithPooling (or a subclass) instead of only a pooled embedding tensor:
last_hidden_state: unpooled token/patch/frame embeddings for the requested modality.pooler_output: pooled representation (what most models previously returned from get_*_features).hidden_states: full hidden states for all layers when output_hidden_states=True is passed.attentions: attention maps when output_attentions=True is passed.[!IMPORTANT] There is no single universal shape for
last_hidden_stateorpooler_output. It's recommended to inspect a small forward pass before making assumptions about shapes or semantics.
If your code previously did something like this:
text_embeddings = model.get_text_features(**inputs)
and you used text_embeddings as a tensor, you should now explicitly use return_dict=True take the pooler_output field from the returned BaseModelOutputWithPooling:
outputs = model.get_text_features(**inputs, return_dict=True)
text_embeddings = outputs.pooler_output
This will match the previous behavior in the large majority of cases. If your model-specific implementation returned a tuple of results before, those values should now be accessible as fields on the corresponding BaseModelOutputWithPooling subclass.
Linked PR: https://github.com/huggingface/transformers/pull/42564
We clean up the quantization API in transformers, and significantly refactor the weight loading as highlighted above.
We drop support for two quantization arguments that have been deprecated for some time:
load_in_4bitload_in_8bitWe remove them in favor of the quantization_config argument which is much more complete. As an example, here is how
you would load a 4-bit bitsandbytes model using this argument:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
device_map="auto",
quantization_config=quantization_config
)
AutoModelWithLMHead is removed in favor of AutoModelForCausalLM for causal language models, AutoModelForMaskedLM for masked language models and AutoModelForSeq2SeqLM for encoder-decoder modelsAutoModelForVision2Seq is removed in favor of AutoModelForImageTextToTextfrom_xxx_config are deleted. Configs can be init from the __init__ method in the same way. See #41314.mode.rope_parameters, including the rope_theta and rope_type. Model's config.rope_parameters is a simple dictionaty in most cases, and can also be a nested dict in special cases (i.e. Gemma3 and ModernBert) with different rope parameterization for each layer type. Trying to get config.rope_theta will throw an attribute error from now on. See #39847 and #42255config.vocab_size). Users are expected to access keys from their respective sub-configs (config.text_config.vocab_size).model.generate()) will no longer have a generation_config and model.config.generation_config will throw an attribute error.tokenization_<model>.py ) will be removed in favor of using fast tokenizer files tokenization_<model>_fast.py --> will be renamed to tokenization_<model>.py. As fast tokenizers are :hugs:tokenizers - backend, they include a wider range of features that are maintainable and reliable.encode_plus --> __call__batch_decode --> decodeapply_chat_template by default returns naked input_ids rather than a BatchEncoding dict.
This was inconvenient - it should return a BatchEncoding dict like tokenizer.__call__(), but we were stuck with
it for backward compatibility. The method now returns a BatchEncoding.
Linked PRs:
processor_config.json as a nested dict, instead of serializing attributes in their own config files. Loading will be supported for all old format processors (https://github.com/huggingface/transformers/pull/41474)XXXFeatureExtractors classes are completely removed in favor of XXXImageProcessor class for all vision models (https://github.com/huggingface/transformers/pull/41174)XXXFastImageProcessorKwargs is removed in favor of XXXImageProcessorKwargs which will be shared between fast and slow processors (https://github.com/huggingface/transformers/pull/40931)The old slow/fast dual-file design has been replaced with a named-backend architecture. Each model previously had a PIL-based image_processing_<model>.py and a torchvision-based image_processing_<model>_fast.py. The new layout is:
image_processing_<model>.py → torchvision backend (default; was previously FooImageProcessorFast)image_processing_pil_<model>.py → PIL backend (was previously FooImageProcessor)Processor classes now inherit from TorchvisionBackend or PilBackend (defined in image_processing_backends.py), which provide ready-made implementations of all standard operations (resize, rescale, normalize, center_crop, pad) and a default _preprocess pipeline. BaseImageProcessor (in image_processing_utils) handles shared preprocessing boilerplate: kwargs validation, default-filling from class attributes, and input preparation. Model-specific processors contain only what is unique to the model. Most processors inherit from a backend and declare class-attribute defaults. Only those with custom logic (e.g. patch tiling) need to override _preprocess.
The image_processing_utils_fast module has been removed; all shared logic now lives in image_processing_utils.
use_fast is replaced by backendThe use_fast parameter is deprecated. Use backend instead:
# v4
processor = AutoImageProcessor.from_pretrained("...", use_fast=True) # torchvision
processor = AutoImageProcessor.from_pretrained("...", use_fast=False) # PIL
# v5
processor = AutoImageProcessor.from_pretrained("...", backend="torchvision")
processor = AutoImageProcessor.from_pretrained("...", backend="pil")
When backend is not specified, the default is "torchvision" if torchvision is installed, otherwise "pil". If the requested backend is unavailable, loading falls back to another available backend with a warning.
FooImageProcessorFast class names are deprecatedFooImageProcessor now refers to the torchvision-backed class (what was previously FooImageProcessorFast), and FooImageProcessorPil is the PIL-backed class (what was previously FooImageProcessor). Importing a *Fast class name still resolves correctly but emits a deprecation warning.
is_fast property is deprecatedUse processor.backend == "torchvision" instead of processor.is_fast.
AutoImageProcessor.register() API changeslow_image_processor_class and fast_image_processor_class are deprecated in favor of an image_processor_classes dict:
# v4
AutoImageProcessor.register(MyConfig, slow_image_processor_class=MyPilProcessor, fast_image_processor_class=MyTorchvisionProcessor)
# v5
AutoImageProcessor.register(MyConfig, image_processor_classes={"pil": MyPilProcessor, "torchvision": MyTorchvisionProcessor})
The backend key space is open-ended. Any string (e.g. "mlx", "onnx") can be registered by subclassing BaseImageProcessor, implementing process_image and _preprocess, and calling register_backend on the processor class:
LlavaNextImageProcessor.register_backend(name="mlx", backend_class=LlavaNextMlxProcessor, availability_check=lambda: is_mlx_available())
processor = LlavaNextImageProcessor.from_pretrained("...", backend="mlx")
Linked PR: https://github.com/huggingface/transformers/pull/43514
RotaryEmbeddings layers will start returning a dict of tuples, in case the model uses several RoPE configurations (Gemma2, ModernBert). Each value will be a tuple of "cos, sin" per RoPE type.RotaryEmbeddings layer will be unified and accessed via config.rope_parameters. Config attr for rope_theta might not be accessible anymore for some models, and instead will be in config.rope_parameters['rope_theta']. BC will be supported for a while as much as possible, and in the near future we'll gradually move to the new RoPE format (https://github.com/huggingface/transformers/pull/39847)model.language_model. It is recommended to either access the module with model.model.language_model or model.get_decoder(). See #42156GreedySearchEncoderDecoderOutput). We now only have 4 output classes built from the following matrix: decoder-only vs encoder-decoder, uses beams vs doesn't use beams (https://github.com/huggingface/transformers/pull/40998)generate doesn't receive any KV Cache argument, the default cache class used is now defined by the model (as opposed to always being DynamicCache) (https://github.com/huggingface/transformers/pull/41505)config.json for any old model, it will be loaded back into model's generation config. Users are expected to access or modify generation parameters only with model.generation_config.do_sample = True.TrainingArguments due to low usagemp_parameters -> legacy param that was later on added to sagemaker trainer_n_gpu -> not intended for users to set, we will initialize it correctly instead of putting it in the TrainingArgumentsoverwrite_output_dir - > replaced by resume_from_checkpoint and it was only used in examples script, no impact on Trainer.logging_dir -> only used for tensorboard, set TENSORBOARD_LOGGING_DIR env var insteadjit_mode_eval -> use use_torch_compile instead as torchscript is not recommended anymoretpu_num_cores-> It is actually better to remove it as it is not recommended to set the number of cores. By default, all tpu cores are used . Set TPU_NUM_CORES env var insteadpast_index -> it was only used for a very small number of models that have special architecture like transformersxl + it was not documented at all how to train those modelray_scope -> only for a minor arg for ray integration. Set RAY_SCOPE var env insteadwarmup_ratio -> use warmup_step instead. We combined both args together by allowing passing float values in warmup_step.TrainingArgumentsfsdp_min_num_params and fsdp_transformer_layer_cls_to_wrap -> use fsdp_configtpu_metrics_debug -> debugpush_to_hub_token -> hub_tokenpush_to_hub_model_id and push_to_hub_organization -> hub_model_idinclude_inputs_for_metrics -> include_for_metricsper_gpu_train_batch_size -> per_device_train_batch_sizeper_gpu_eval_batch_size -> per_device_eval_batch_sizeuse_mps_device -> mps will be used by default if detectedfp16_backend and half_precision_backend -> we will only rely on torch.amp as everything has been upstream to torchno_cuda -> use_cpuinclude_tokens_per_second -> include_num_input_tokens_seenuse_legacy_prediction_loop -> we only use evaluation_loop function from now onTrainertokenizer in initialization -> processing_classmodel_path in train() -> resume_from_checkpointTrainerTraineruse_cache in the model config will be set to False. You can still change the cache value through TrainingArguments use_cache argument if needed.question-answering and Text2TextGenerationPipeline, including its related SummarizationPipeline and TranslationPipeline, were deprecated and will now be removed. pipeline classes are intended as a high-level beginner-friendly API,
but for almost all text-to-text or question-answering tasks a modern chat model and TextGenerationPipeline will provide much higher quality output.
As a result, we felt it was misleading for beginners to offer the older pipelines.
If you were using these pipelines before, try using TextGenerationPipeline with a chat model instead. For example, for summarization:
import torch
from transformers import pipeline
# Any other chat model will also work - if you're low on memory you can use a smaller one
summarizer = pipeline("text-generation", model="Qwen/Qwen3-4B-Instruct-2507")
message_history = [
{
"role": "user",
"content": "Summarize the following text:\n\n[TEXT_TO_SUMMARIZE]"
}
]
print(summarizer(message_history)[0]["generated_text"][-1]["content"])
The above example can be adapted for other tasks, e.g. translation or question answering, simply by changing the prompt.
Similarly, the image-to-text and visual-question-answering pipelines have been removed. For image captioning or question answering
tasks we recommend using a modern vision-language chat model via the image-text-to-text pipeline. For example:
import torch
from transformers import pipeline
# Any other VLM will also work - if you're low on memory you can use a smaller one
captioner = pipeline("image-text-to-text", model="Qwen/Qwen3-VL-4B-Instruct")
message_history = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "[IMAGE_URL_HERE]",
},
{"type": "text", "text": "Describe this image."},
],
}
]
print(captioner(message_history)[0]["generated_text"][-1]["content"])
The above example can be adapted for visual question answering simply by asking the question in the prompt.
The image-to-image pipeline has been removed, as it was rarely updated or used. For most image generation tasks, you
probably want 🤗 Diffusers instead!
organization and repo_url from PushToHubMixin. You must pass a repo_id instead.ignore_metadata_errors from PushToMixin. In practice if we ignore errors while loading the model card, we won't be able to push the card back to the Hub so it's better to fail early and not provide the option to fail later.push_to_hub do not accept **kwargs anymore. All accepted parameters are explicitly documented.push_to_hub are now keyword-only to avoid confusion. Only repo_id can be positional since it's the main arg.use_temp_dir argument from push_to_hub. We now use a tmp dir in all cases.Linked PR: https://github.com/huggingface/transformers/pull/42391.
The deprecated transformers-cli ... command was deprecated, transformers ... is now the only CLI entry point.
transformers CLI has been migrated to Typer, making it easier to maintain + adding some nice features out of
the box (improved --help section, autocompletion).
Biggest breaking change is in transformers chat. This command starts a terminal UI to interact with a chat model.
It used to also be able to start a Chat Completion server powered by transformers and chat with it. In this revamped
version, this feature has been removed in favor of transformers serve. The goal of splitting transformers chat
and transformers serve is to define clear boundaries between client and server code. It helps with maintenance
but also makes the commands less bloated. The new signature of transformers chat is:
Usage: transformers chat [OPTIONS] BASE_URL MODEL_ID [GENERATE_FLAGS]...
Chat with a model from the command line.
It works hand in hand with transformers serve, which means that if transformers serve is running on its default endpoint, transformers chat can be launched as follows:
transformers chat HuggingFaceTB/SmolLM3-3B
It can however use any OpenAI API compatible HTTP endpoint:
transformers chat HuggingFaceTB/SmolLM3-3B https://router.huggingface.co/v1
Linked PRs:
run methodThe transformers run (previously transformers-cli run) is an artefact of the past, was not documented nor tested,
and isn't part of any public documentation. We're removing it for now and ask you to please let us know in case
this is a method you are using; in which case we should bring it back with better support.
Linked PR: https://github.com/huggingface/transformers/pull/42447
TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, and PYTORCH_PRETRAINED_BERT_CACHE have been removed. Please use HF_HOME instead.HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_PREFIX, and HUGGINGFACE_CO_RESOLVE_ENDPOINT have been removed. Please use huggingface_hub.constants.ENDPOINT instead.Linked PR: https://github.com/huggingface/transformers/pull/42391.
transformers v5 pins the huggingface_hub version to >=1.0.0. See this migration guide to learn more about this major release. Here are to main aspects to know about:
requests to httpx. This change was made to improve performance and to support both synchronous and asynchronous requests the same way. If you are currently catching requests.HTTPError errors in your codebase, you'll need to switch to httpx.HTTPError.HTTP_PROXY / HTTPS_PROXY environment variableshf_transfer and therefore HF_HUB_ENABLE_HF_TRANSFER have been completed dropped in favor of hf_xet. This should be transparent for most users. Please let us know if you notice any downside!typer-slim has been added as required dependency, used to implement both hf and transformers CLIs.