docs/MATFORMER.md
Matformer allows you to dynamically resize transformer models at runtime, trading compute/memory for quality. This enables deploying the same model across devices with different resource constraints - from edge devices to powerful GPUs.
# Run Gemma 3n with the E2.49B configuration (2.49B params instead of 3.98B)
mistralrs run -m google/gemma-3n-E4B-it \
--matformer-config-path matformer_configs/gemma3n.csv \
--matformer-slice-name "Config for E2.49B (block-level)"
from mistralrs import Runner, Which, MultimodalArchitecture
runner = Runner(
which=Which.MultimodalPlain(
model_id="google/gemma-3n-E4B-it",
arch=MultimodalArchitecture.Gemma3n,
matformer_config_path="matformer_configs/gemma3n.csv",
matformer_slice_name="Config for E2.49B (block-level)",
),
)
use mistralrs::MultimodalModelBuilder;
use std::path::PathBuf;
let model = MultimodalModelBuilder::new("google/gemma-3n-E4B-it")
.with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
.with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
.build()
.await?;
Matformer models are pre-trained with a special architecture that allows certain layers to be skipped at inference time while maintaining reasonable quality. When you select a "slice":
For example, the Gemma 3n E2.49B (block-level) slice:
Matformer configurations are CSV files with these columns:
name,# Layers,# Effective Params (B),MMLU PT accuracy,FFN Hidden Dims,Layers Skipped
Main model,35,3.98,62.30%,"[16384, 16384, ...]",
Config for E2.49B (block-level),35,2.49,54.50%,"[8192, 8192, ..., 16384, 16384, ..., 8192, 8192, ...]",
matformer_slice_nameCurrently supported:
google/gemma-3n-E4B-it) - Multimodal model with vision and audioSee matformer_configs/ for available configurations.
Memory scales approximately with parameter count:
Speed improvement is roughly linear with layer count:
Example accuracy on MMLU benchmark:
Choose based on your requirements:
Combine Matformer with ISQ for maximum efficiency:
runner = Runner(
which=Which.MultimodalPlain(
model_id="google/gemma-3n-E4B-it",
arch=MultimodalArchitecture.Gemma3n,
matformer_config_path="matformer_configs/gemma3n.csv",
matformer_slice_name="Config for E2.49B (block-level)",
),
in_situ_quant="Q4K" # 4-bit quantization
)
Matformer works seamlessly with automatic device mapping:
use mistralrs::{MultimodalModelBuilder, DeviceMapSetting, AutoDeviceMapParams};
let model = MultimodalModelBuilder::new("google/gemma-3n-E4B-it")
.with_matformer_config_path(PathBuf::from("matformer_configs/gemma3n.csv"))
.with_matformer_slice_name("Config for E2.49B (block-level)".to_string())
.with_device_mapping(DeviceMapSetting::Auto(
AutoDeviceMapParams::default_multimodal()
))
.build()
.await?;
Only active layers are loaded to GPU, saving memory.
To create your own Matformer configuration:
Example minimal configuration:
name,# Layers,# Effective Params (B),FFN Hidden Dims,Layers Skipped
Tiny,15,0.8,"[4096, 4096, ...]","[5,6,7,10,11,12,15,16,17,20,21,22,25,26,27,30,31,32,33,34]"
--matformer-config-path PATH: Path to CSV configuration file--matformer-slice-name NAME: Exact name of slice from CSVWhich.MultimodalPlain(
model_id: str,
arch: MultimodalArchitecture,
matformer_config_path: str = None, # Path to CSV
matformer_slice_name: str = None, # Slice name
# ... other parameters
)
// For MultimodalModelBuilder
.with_matformer_config_path(path: PathBuf)
.with_matformer_slice_name(name: String)
// For TextModelBuilder (when supported)
.with_matformer_config_path(path: PathBuf)
.with_matformer_slice_name(name: String)
"Matformer slice 'X' not found"
"Layers X and Y are reserved and cannot be skipped"
Memory not reduced as expected
Enable logging to see Matformer details:
RUST_LOG=mistralrs_core=info mistralrs ...
This shows: