docs/source/en/quantization/bitnet.md
BitNet replaces traditional linear layers in Multi-Head Attention and feed-forward networks with specialized BitLinear layers. The BitLinear layers quantize the weights using ternary precision (with values of -1, 0, and 1) and quantize the activations to 8-bit precision.
<figure style="text-align: center;"> <figcaption>The architecture of BitNet with BitLinear layers.</figcaption> </figure>BitNet models can't be quantized on the fly. They need to be quantized during pretraining or fine-tuning because it is a Quantization-Aware Training (QAT) technique. During training, the weights are quantized to ternary values with symmetric per tensor quantization.
Refer to this PR to pretrain or fine-tune a 1.58-bit model with Nanotron. For fine-tuning, convert a model from the Hugging Face to Nanotron format. Find the conversion steps in this PR.
Load a BitNet quantized model with [~PreTrainedModel.from_pretrained].
from transformers import AutoModelForCausalLM
path = "/path/to/model"
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto")
@torch.compile is used to unpack the weights and perform the forward pass. It's very straightforward to implement and delivers significant speed improvements. Additional optimized kernels will be integrated in future versions.
Read Fine-tuning LLMs to 1.58bit: extreme quantization made easy to learn more about how BitNet models are trained and fine-tuned.