Writing model tests

The Transformers test suite uses a mixin-based architecture to auto-generate 100+ tests from minimal code. You write a small amount of model-specific code, and the mixins handle save/load, generation, pipelines, training, and tensor parallelism.

Run your model's tests with the following commands.

bash

# run your model's tests
pytest tests/models/mymodel/test_modeling_mymodel.py -v

# run a specific test
pytest tests/models/mymodel/test_modeling_mymodel.py::MyModelTest::test_model

# run tests matching a keyword pattern (useful to run all integration tests)
pytest tests/models/mymodel/ -k integration -v

# include slow integration tests
RUN_SLOW=1 pytest tests/models/mymodel/ -v

The Hugging Face CI runs model tests without @slow on every pull request, and slow tests run on a nightly schedule (see Pull request checks for what the CI validates).

Write tests for a causal language model

CausalLMModelTest is the recommended base class for testing causal language models. It inherits from five test mixins and auto-generates tests for save/load, generation, pipelines, training, and tensor parallelism.

import unittest

from transformers.testing_utils import require_torch
from transformers import is_torch_available

from ...causal_lm_tester import CausalLMModelTest, CausalLMModelTester

if is_torch_available():
    from transformers import MyModel


class MyModelTester(CausalLMModelTester):
    if is_torch_available():
        base_model_class = MyModel


@require_torch
class MyModelTest(CausalLMModelTest, unittest.TestCase):
    model_tester_class = MyModelTester

These two classes give full test coverage for MyModel and all its head classes (MyModelForCausalLM, MyModelForSequenceClassification, etc.). See tests/models/llama/test_modeling_llama.py for a real example.

CausalLMModelTester only requires base_model_class. The tester strips the Model suffix to get a base name (LlamaModel becomes Llama), then appends suffixes like Config or ForCausalLM to discover related classes. If a class doesn't exist in the module, the attribute stays None and the tester skips the corresponding tests.

Overriding defaults

If your model doesn't follow standard naming, or you need to customize behavior, override attributes on the tester or test class.

class MyModelTester(CausalLMModelTester):
    if is_torch_available():
        base_model_class = MyModel
        # override if the class name doesn't follow the convention
        causal_lm_class = MyCustomCausalLM


@require_torch
class MyModelTest(CausalLMModelTest, unittest.TestCase):
    model_tester_class = MyModelTester
    # disable embedding resize tests
    test_resize_embeddings = False

For models that need custom constructor parameters on the tester, override __init__ and call super().__init__(parent=parent) before setting extra attributes. See tests/models/youtu/test_modeling_youtu.py for a real example.

class YoutuModelTester(CausalLMModelTester):
    if is_torch_available():
        base_model_class = YoutuModel

    def __init__(self, parent, kv_lora_rank=16, q_lora_rank=32):
        super().__init__(parent=parent)
        self.kv_lora_rank = kv_lora_rank
        self.q_lora_rank = q_lora_rank

Write tests for a vision-language model

VLMModelTest is the base class for vision-language models. It inherits from three mixins (ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin) and sets _is_composite = True to handle multiple sub-models.

import unittest

from transformers.testing_utils import require_torch
from transformers import is_torch_available

from ...vlm_tester import VLMModelTest, VLMModelTester

if is_torch_available():
    from transformers import (
        MyVLMConfig,
        MyVLMModel,
        MyVLMTextConfig,
        MyVLMVisionConfig,
        MyVLMForConditionalGeneration,
    )


class MyVLMTester(VLMModelTester):
    if is_torch_available():
        base_model_class = MyVLMModel
        config_class = MyVLMConfig
        text_config_class = MyVLMTextConfig
        vision_config_class = MyVLMVisionConfig
        conditional_generation_class = MyVLMForConditionalGeneration


@require_torch
class MyVLMTest(VLMModelTest, unittest.TestCase):
    model_tester_class = MyVLMTester

Overriding defaults

When the VLM needs custom vision parameters or non-default config values, override __init__. Set defaults with setdefault before calling super().__init__(parent, **kwargs). The example below shows the first few defaults from tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py.

class QianfanOCRVisionText2TextModelTester(VLMModelTester):
    base_model_class = QianfanOCRModel
    config_class = QianfanOCRConfig
    text_config_class = Qwen3Config
    vision_config_class = QianfanOCRVisionConfig
    conditional_generation_class = QianfanOCRForConditionalGeneration

    def __init__(self, parent, **kwargs):
        kwargs.setdefault("image_token_id", 1)
        kwargs.setdefault("image_size", 32)
        kwargs.setdefault("patch_size", 4)
        kwargs.setdefault("num_channels", 3)
        # ... more defaults
        super().__init__(parent, **kwargs)

VLM tests differ from CausalLMModelTest in a few ways.

You must set config_class, text_config_class, vision_config_class, and conditional_generation_class on the tester.
VLMModelTest doesn't include TrainingTesterMixin or TensorParallelTesterMixin.
The tester's __init__ accepts vision parameters (image_size, patch_size, num_channels, num_image_tokens) from **kwargs and setdefault().
ConfigTester uses has_text_modality=False because the top-level config is a composite config rather than a text model config.

Write tests for other architectures

For encoder-only, encoder-decoder, audio, or other non-standard architectures, build the test infrastructure directly from the two-class pattern and test mixins described below.

ModelTester and ModelTest

Every model test file follows the same structure.

ModelTester (plain class) creates tiny configs and dummy inputs for testing, and can also hold small regression tests specific to the model.
ModelTest (unittest.TestCase + mixins) inherits auto-generated tests and runs them against every model variant.

ModelTest calls prepare_config_and_inputs_for_common() on the tester to get a (config, inputs_dict) tuple. All mixins rely on prepare_config_and_inputs_for_common() for test data.

Test mixins

Pick the mixins your model needs.

Mixin	Source file	What it tests
`ModelTesterMixin`	`tests/test_modeling_common.py`	Save/load, gradient checkpointing, forward signature, common attributes
`GenerationTesterMixin`	`tests/generation/test_utils.py`	Greedy, sampling, beam search, assisted decoding
`PipelineTesterMixin`	`tests/test_pipeline_mixin.py`	One test per pipeline task
`TrainingTesterMixin`	`tests/test_training_mixin.py`	Overfitting on a small batch
`TensorParallelTesterMixin`	`tests/test_tensor_parallel_mixin.py`	Distributed tensor parallelism

Writing a model test

See tests/models/modernbert/test_modeling_modernbert.py for a complete working example. The key steps are outlined below.

The ModelTester class builds tiny configs and dummy inputs. Keep dimensions small so tests finish in seconds on CPU. Use the three tensor helpers below to build inputs.
- ids_tensor(shape, vocab_size): Random integer tensor in [0, vocab_size). Use for input_ids, token_type_ids, and label tensors.
- random_attention_mask(shape): Binary tensor (0s and 1s) where the first token is always 1. Use for attention_mask.
- floats_tensor(shape, scale=1.0): Random float tensor. Use for continuous inputs like pixel_values or inputs_embeds.
The tester must implement get_config(), prepare_config_and_inputs(), and prepare_config_and_inputs_for_common(). Add create_and_check_* methods for each task head (base model, sequence classification, token classification, etc.).
Inherit from the mixins your model needs, set all_model_classes and pipeline_model_mapping, and define setUp(). Write test_* methods that delegate to the tester's create_and_check_* methods.
For each task head, add a create_and_check_* method on the tester that instantiates the model, runs a forward pass, and asserts output shapes. Then add a corresponding test_* method on the test class.

File organization

Test files live in tests/models/mymodel/ following the structure shown below.

text

tests/models/mymodel/
├── __init__.py
├── test_modeling_mymodel.py            # model tests (required)
├── test_tokenization_mymodel.py        # tokenizer tests (if custom tokenizer)
├── test_image_processing_mymodel.py    # image processor tests (if vision model)
├── test_feature_extraction_mymodel.py  # feature extractor tests (if audio/speech model)
└── test_processing_mymodel.py          # processor tests (if multimodal)

Tokenizer tests follow the same pattern. Inherit TokenizerTesterMixin from tests/test_tokenization_common.py, set a few attributes, and get auto-generated tests. See tests/models/llama/test_tokenization_llama.py for an example.

Config tests

ConfigTester verifies that a config class handles serialization, save/load, and standard properties correctly. CausalLMModelTest and VLMModelTest include config tests automatically. For the general path with ModelTester and ModelTest, define the config tester manually in setUp().

from tests.test_configuration_common import ConfigTester

def setUp(self):
    self.config_tester = ConfigTester(self, config_class=MyModelConfig, hidden_size=32)

def test_config(self):
    self.config_tester.run_common_tests()

run_common_tests() runs several checks.

Checks that common properties like hidden_size, num_attention_heads, and num_hidden_layers exist (and vocab_size if has_text_modality=True).
Tests JSON serialization with to_json_string() and to_json_file().
Round-trips save_pretrained() and from_pretrained().
Confirms id2label and label2id consistency.
Creates a config with no arguments to validate default initialization.
Sets common kwargs like output_hidden_states and confirms they're stored correctly.

Pass has_text_modality=False for vision-only models that lack vocab_size, and pass extra **kwargs to override config defaults.

self.config_tester = ConfigTester(
    self, config_class=MyVisionConfig, has_text_modality=False, hidden_size=64
)

Integration tests and tiny models

Mixin tests use tiny configs with random weights to verify model behavior quickly. Integration tests run inference with real pretrained weights to validate output correctness. Tiny models on the Hub are small enough for fast CI, but structured like real checkpoints.

Writing integration tests

Place integration tests in a separate test class and mark them with @slow. Each test downloads real weights, runs inference, and checks outputs against expected values. Call cleanup(torch_device, gc_collect=False) in setUp and tearDown to avoid memory residuals.

import torch
from transformers import AutoTokenizer
from transformers.testing_utils import cleanup, require_torch, slow, torch_device

class MyModelIntegrationTest(unittest.TestCase):
    def setUp(self):
        cleanup(torch_device, gc_collect=False)

    def tearDown(self):
        cleanup(torch_device, gc_collect=False)

    @slow
    @require_torch
    def test_inference(self):
        model = MyModelForCausalLM.from_pretrained("myorg/mymodel-base").to(torch_device)
        tokenizer = AutoTokenizer.from_pretrained("myorg/mymodel-base")
        inputs = tokenizer("Hello, world", return_tensors="pt").to(torch_device)

        with torch.no_grad():
            outputs = model(**inputs)

        # check against expected values
        expected_slice = torch.tensor([[-0.1234, 0.5678, -0.9012]])
        torch.testing.assert_close(outputs.logits[0, :1, :3], expected_slice, atol=1e-4, rtol=1e-4)

Mark any test with @slow if it downloads weights, loads a large dataset, or takes more than a few seconds. The pull request CI skips slow tests, but the nightly schedule runs them.

Generation integration tests

Use do_sample=False in generation tests so the output is deterministic across runs and hardware. For Mixture-of-Experts models, also call model.set_experts_implementation("eager") before generating to force a stable expert dispatch path. Without it, small numerical differences in the router can flip which expert handles a token and change the output.

@slow
@require_torch
def test_generate(self):
    model = MyModelForCausalLM.from_pretrained("myorg/mymodel-base").to(torch_device)
    tokenizer = AutoTokenizer.from_pretrained("myorg/mymodel-base")
    inputs = tokenizer("Hello, world", return_tensors="pt").to(torch_device)

    # model.set_experts_implementation("eager")  # uncomment for MoE models
    generated_ids = model.generate(**inputs, max_new_tokens=20, do_sample=False)
    output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    self.assertEqual(output, ["Hello, world! This is the expected continuation..."])

Hardware-specific expectations

Transformers CI runs slow tests on an NVIDIA A10. Numerical results can vary slightly across GPU generations, so integration tests use the Expectations class to register per-device expected values. Expectations picks the best match for the current hardware based on (device_type, (major, minor)) SM keys, and falls back to a default when nothing matches.

Run torch.cuda.get_device_capability() to print your local SM version (e.g. (8, 6) for an A10, (9, 0) for H100).

from transformers.testing_utils import Expectations

expected_texts = Expectations(
    {
        ("cuda", (8, 6)): ["Hello, world! This is the A10 continuation..."],
        ("cuda", (9, 0)): ["Hello, world! This is the H100 continuation..."],
    }
).get_expectation()

self.assertEqual(output, expected_texts)

Creating tiny models

Tiny models with random weights live on the Hub under the hf-internal-testing organization. Pipeline tests rely on tiny models when they need a Hub-hosted checkpoint but don't care about output quality. Fast smoke tests also load tiny models to verify forward pass shapes without downloading large checkpoints.

Tiny models are a last resort for integration tests. Only use them when the smallest available checkpoint exceeds ~24 GB of VRAM. Use original pretrained weights, when possible, to catch real numerical regressions.

The utils/create_dummy_models.py script generates tiny models from ModelTester.get_config(). The script extracts tiny hyperparameters from your tester, builds a model with random weights, and uploads the result to the Hub.

Generate tiny models locally.

bash

python utils/create_dummy_models.py output_dir -m your_model_type

Upload them to the Hub.

bash

python utils/create_dummy_models.py output_dir -m your_model_type --upload --organization hf-internal-testing

Each model uses the name hf-internal-testing/tiny-random-{ModelClassName} and gets recorded in tests/utils/tiny_model_summary.json. A CI workflow (.github/workflows/check_tiny_models.yml) regenerates tiny models daily.

Control what gets tested

Boolean flags on ModelTesterMixin toggle auto-generated tests. Override any flag on your test class to enable or disable specific checks.

class MyModelTest(CausalLMModelTest, unittest.TestCase):
    model_tester_class = MyModelTester
    test_resize_embeddings = False
    test_all_params_have_gradient = False  # when not all parameters are activated in every forward pass

Flag	Default	What it controls
`test_resize_embeddings`	`True`	Embedding layer resizing
`test_resize_position_embeddings`	`False`	Position embedding resizing
`test_mismatched_shapes`	`True`	Mismatched input/output shape handling
`test_missing_keys`	`True`	Missing key warnings on load
`test_torch_exportable`	`True`	`torch.export` compatibility
`test_all_params_have_gradient`	`True`	All parameters receive gradients (set `False` when not all parameters are activated in every forward pass, such as MoE experts)
`is_encoder_decoder`	`False`	Encoder-decoder specific tests
`has_attentions`	`True`	Attention output tests
`_is_composite`	`False`	Composite/multimodal model handling
`model_split_percents`	`[0.5, 0.7, 0.9]`	Split percentages for model parallelism tests

Next steps

Browse the pytest docs for more about test selection, fixtures, logging, and more.