packages/training/scripts/quantization/README.md
This directory holds the post-training quantization passes used to shrink the fine-tuned Qwen Eliza checkpoints before they leave the training rig. Each method is a self-contained CLI; they are independent and can be combined or compared on the same fine-tuned checkpoint.
Reference: Caio Vicentino, PolarQuant: Optimal Gaussian Weight
Quantization via Hadamard Rotation for LLM Compression, arXiv:2603.29078
(March 2026). The arXiv PDF was withdrawn for errata fixes; the
implementation we run against lives in
caiovicentino/eoq-quantization
@ commit 15a12160245d7d3015290c6c5b6dbb7f22094d5e. The two source
files we depend on are vendored under polarquant/ (see
polarquant/LICENSE.md for the upstream license situation — currently
no LICENSE file in the upstream repo; vendored under a documented
research-citation arrangement that must be revisited if the upstream
project publishes a restrictive license).
For each linear's weight tensor, group the values into power-of-two
blocks (default 128). Per block: extract the L2 norm and store it as
fp16; normalize the block to the unit hypersphere; multiply by a
Walsh–Hadamard matrix so that each coordinate is an i.i.d. draw from
roughly N(0, 1/√d); quantize each coordinate against the Lloyd–Max
optimal centroids for N(0, 1) (precomputed via the iterative
conditional-expectation algorithm in polarquant/polar_quant.py),
storing the centroid index as int8; optionally append a 1-bit QJL
residual sign per block as a cheap correction. At inference, undo
each step: lookup centroid → divide by √d → inverse Hadamard → rescale
by per-block norm. The Hadamard rotation alone accounts for ~98% of
the perplexity recovery vs absmax (paper, Ablation §4.2). PolarQuant
is data-free — no calibration set is required — because the
post-rotation distribution is analytically Gaussian.
nn.Linear level.nn.Linear matrix; the win shows up when the codes-only
payload is loaded by an INT4 inference kernel. The sidecar
polarquant_artifacts.safetensors we emit is the artifact for
that downstream path. The vendored compute kernel is a pure-Python
PyTorch loop over linears (no Triton, no fused kernel), so the
quantization step itself is slow on big models — it costs O(N)
Hadamard products at quantization time, then is free at inference.The vendored kernel runs on any model that exposes its weights as
nn.Linear modules. We have explicitly verified it on:
Qwen2ForCausalLM)Qwen3ForCausalLM)The active targets Qwen3.5-{0.8B, 2B, 4B} ship as multimodal
hybrid-attention checkpoints (Qwen3_5ForConditionalGeneration).
The text decoder mixes linear_attention and full_attention layers
and adds Mamba-style SSM state, mrope vision embeddings, and a separate
vision encoder. PolarQuant operates on nn.Linear weights, so it
will still quantize the Q/K/V/O and MLP projections inside both
layer types — but:
linear_attention layers carries a few
non-linear state buffers (A_log, D, dt_bias) that are not
nn.Linear. We deliberately skip those (they fall under the
min_numel cutoff or aren't nn.Linear instances) — quantizing
them through PolarQuant's Gaussian assumption would produce
garbage.AutoModelForCausalLM loader will refuse Qwen3.5 since
the architecture is ForConditionalGeneration. The orchestrator
needs to load it via AutoModelForVision2Seq (or the
Qwen3_5ForConditionalGeneration class directly) and walk the
text decoder. We have not done that integration yet — for the
0.8B validation we fall back to Qwen/Qwen3.5-0.8B.--min-numel cutoff, and must be deliberately skipped when that line is
reintroduced.If/when we need to ship PolarQuant on a Qwen3.5 checkpoint, the
fix is: load with the right AutoModel* class, expose the text
decoder via model.language_model (or whichever attribute carries the
text tower), then call quantize_checkpoint against that submodule.
The rest of the kernel does not need to change.
Quantize a fine-tuned 2B checkpoint:
uv run python scripts/quantization/polarquant_apply.py \
--model checkpoints/qwen35-2b-eliza/final \
--calibration data/final/val.jsonl \
--calibration-samples 128 \
--output checkpoints/qwen35-2b-eliza/final-polarquant
The --calibration* flags are accepted for parity with the rest of
the quantization pipeline but are unused — PolarQuant is data-free
per the paper. Passing a non-existent calibration path still errors so
that misconfigured callers fail loudly.
Useful knobs:
--bits {2,3,4,5,6} (default 4). Q5 is the paper's near-lossless
point; Q4 is the practical default for INT4 inference kernels.--block-size N (default 128, must be a power of 2).--no-qjl to drop the 1-bit residual correction (slightly worse
PPL, slightly smaller sidecar).--no-artifacts to skip writing the sidecar codes; useful when you
only want the reconstructed fp16 model.--include-lm-head / --include-embedding to override the default
skip behavior. Quantizing the LM head on weight-tied models is
redundant; quantizing embedding tables hurts logits because they're
looked up rather than multiplied.scripts/quantization/test_polarquant.py runs the round-trip on
Qwen/Qwen3.5-0.8B (the closest text-only causal-LM stand-in for
Qwen/Qwen3.5-0.8B — see caveat above), using 5 native JSON-shaped samples
from data/final/val.jsonl. It asserts (a) the codes-only payload is
at least 30% smaller than the fp16 baseline checkpoint and (b) the
quantized model produces non-degenerate text on every sample.
uv run python scripts/quantization/test_polarquant.py
Reference: Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni
(Google Research / Google DeepMind / NYU), TurboQuant: Online Vector
Quantization with Near-Optimal Distortion Rate, arXiv:2504.19874,
ICLR 2026. Implementation:
vivekvar-dl/turboquant
shipped to PyPI as turbokv v0.1.0 (import name: turboquant). Pinned
in pyproject.toml under [project.optional-dependencies] train.
TurboQuant is an online, data-oblivious KV-cache quantizer. It runs
at inference time and replaces the standard fp16 DynamicCache with
a TurboQuantCache that stores keys and values in 2- or 4-bit
packed form per coordinate.
It is not a weight quantizer. The model.safetensors file does
not change. turboquant_apply.py therefore writes the merged base
model unchanged and adds a turboquant.json sidecar that records the
quantizer config (nbits, base_seed, calibrated skip_layers,
residual_length) so a downstream loader can reconstruct the cache
deterministically.
For each cached vector x of length head_dim (per attention
head, per token): (1) extract and store ||x|| in bf16; (2) rotate
the unit vector by a fixed-per-layer random orthogonal matrix Π drawn
from SO(d) via QR-decomposition with sign-fix — after rotation each
coordinate is approximately Beta-distributed (≈ N(0, 1/d) for large
d); (3) bucketize each coordinate against the precomputed Lloyd-Max
optimal centroids for the Beta distribution at b bits; (4) bit-pack
the indices into uint8 (2 indices/byte at 4-bit, 4 indices/byte at
2-bit). At attention time, unpack → centroid lookup → inverse rotation
x ≈ ||x|| · ΠᵀC[i]. The recent residual_length tokens stay in
fp16 to keep the freshly-generated context lossless.
model.generate via
past_key_values=cache. Works across architectures (Qwen, Llama,
Gemma, Phi) without per-model code paths. Information-theoretic
near-optimal: paper proves the rate is within ~2.7× the per-channel
Shannon-Bennett lower bound.DynamicCache
(66.8 → 12.2 tok/s). The TurboQuant paper claims faster runtime than
the bf16 baseline because it ships Triton kernels; those are not
in the turbokv 0.1.0 PyPI release we depend on. Until upstream
ships Triton, this method is a memory win, not a speed win.num_hidden_layers × num_kv_heads × head_dim × context_length.TurboQuantCache materializes a TurboQuantLayer per full-attention
layer reported by the model config. Verified locally against:
Qwen3ForCausalLM) — single-mode full attention. Works.Should work, by structural inspection, on:
TurboQuantLayer already handles.The active targets Qwen3.5-{0.8B, 2B, 4B} are **hybrid linear-attention
Qwen3_5ForConditionalGeneration). The
text decoder declares per-layer layer_types: most layers are
linear_attention (Gated DeltaNet — recurrent state, no (B, H, T,
D) KV cache) and only a minority (typically every 4th layer) are
full_attention. TurboQuant is only meaningful for the full-attention
layers — there is nothing to quantize in a recurrent state. Concretely
on Qwen3.5-2B, 6 of 24 layers are full attention; the analytic ceiling
on KV reduction is therefore at most ~25% of the standard ratio
applied to those 6 layers. The kv_bytes_per_token_analytic helper in
test_turboquant.py honors layer_types so the reported reduction
factor is correct for hybrid models.The 0.8B target is also a vision-language checkpoint
(Qwen3_5ForConditionalGeneration with vision encoder), so loading it
needs AutoModelForVision2Seq (or the Qwen3_5ForConditionalGeneration
class directly), and TurboQuantCache(model.config, ...) must receive
the text decoder config — model.config.get_text_config(decoder=True).
The cache.py in turbokv 0.1.0 already calls get_text_config when
available, so this should work in principle, but we have not validated
it end-to-end on a Qwen3.5 checkpoint yet — for the 0.8B-class
validation we fall back to Qwen/Qwen3.5-0.8B.
For future dense/MoE variants, TurboQuant is orthogonal to expert routing when the KV cache shape is unchanged. Revalidate that separately before adding those tiers back to the active release line.
Apply to a fine-tuned 2B checkpoint (auto-merges if --model points
to a LoRA adapter):
uv run python scripts/quantization/turboquant_apply.py \
--model checkpoints/qwen35-2b-eliza/final \
--calibration data/final/val.jsonl \
--calibration-samples 128 \
--output checkpoints/qwen35-2b-eliza/final-turboquant
Useful knobs:
--nbits {2,4} (default 4). 4-bit hits ~3.5× KV reduction with
~zero quality loss; 2-bit hits ~6.4× with measurable but small
degradation per the paper.--residual-length N (default 128). Most-recent N tokens stay in
fp16 to keep freshly-generated context lossless.--base-seed N (default 42). Layer i uses seed = base_seed + i.
Pin this to the value recorded in turboquant.json at inference
time; otherwise the rotation matrices will not match and dequant
gives garbage.--norm-threshold f (default 5.0). Calibration skips layers whose
per-token key norms exceed f × median.import json
from turboquant import TurboQuantCache
from transformers import AutoModelForCausalLM, AutoTokenizer
side = json.load(open("checkpoints/qwen35-2b-eliza/final-turboquant/turboquant.json"))
model = AutoModelForCausalLM.from_pretrained(
"checkpoints/qwen35-2b-eliza/final-turboquant",
torch_dtype="bfloat16", device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("checkpoints/qwen35-2b-eliza/final-turboquant")
cache = TurboQuantCache(
model.config,
nbits=side["nbits"],
residual_length=side["residual_length"],
base_seed=side["base_seed"],
skip_layers=set(side["skip_layers"]),
)
out = model.generate(**tok("...", return_tensors="pt").to("cuda"),
past_key_values=cache, max_new_tokens=256)
scripts/quantization/test_turboquant.py runs the round-trip on
Qwen/Qwen3.5-0.8B (closest text-only stand-in for Qwen/Qwen3.5-0.8B
— see caveat above), with 5 native JSON-shaped prompts from
data/final/val.jsonl and a 4096-token long-context probe. It asserts
(a) the per-token KV-cache size shrinks by at least 30% and (b) every
quantized output is non-empty and not degenerate.
uv run python scripts/quantization/test_turboquant.py
The full numeric report is written to
scripts/quantization/turboquant_report.json. Last measured run on
Qwen3.5-0.8B / 5080 (4-bit, skip={0}, 4096-token long context):
| metric | baseline (bf16 DynamicCache) | TurboQuant 4-bit | delta |
|---|---|---|---|
| KV bytes/token | 114,688 | 32,608 | 3.52× / -71.6% |
| Peak VRAM (4096-tok prefill) | 1.783 GB | 1.509 GB | -274 MB |
| Tok/s (5 short prompts × 128 new) | 66.8 | 12.2 | -82% (no Triton) |
Reference: same paper as TurboQuant above (arXiv:2504.19874). Implementation:
fused-turboquant v0.1.0 — a
re-implementation of the TurboQuant scheme in Triton kernels (encode,
decode, fused Q@K^T scoring directly from packed indices). The math is
identical to turbokv 0.1.0 above; the win is throughput.
Pinned in pyproject.toml under [project.optional-dependencies] train
alongside turbokv. The two are kept side-by-side because (a) turbokv is
the architecture-portable fallback when patch_model rejects a model
(non-power-of-2 head_dim, fused QKV, sliding window), and (b) fused-turboquant
needs a working Triton + system-dev-headers stack that turbokv does not.
Apply to a fine-tuned 27B checkpoint (auto-merges if --model points to a
LoRA adapter):
uv run python scripts/quantization/fused_turboquant_apply.py \
--model checkpoints/qwen35-4b-eliza/final \
--output checkpoints/qwen35-4b-eliza/final-fused-turboquant \
--bits 4
The script runs check_model_compatibility(model) first and refuses to
proceed if the architecture isn't supported (logged with a structured
report). After patching it discards the cache, unpatches the model, and
saves the unmodified base weights — fused-TurboQuant is a runtime
concern, the safetensors files are byte-identical to the input.
--calibration / --calibration-samples are accepted for parity with the
other quantizer CLIs but are unused: the Lloyd-Max codebooks and RHT seeds
are data-oblivious.
from quantization.fused_turboquant_vendored.hf import patch_model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"checkpoints/qwen35-4b-eliza/final-fused-turboquant",
torch_dtype="bfloat16", device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("checkpoints/qwen35-4b-eliza/final-fused-turboquant")
cache = patch_model(model, bits=4, compress_v=True) # patches model.forward in-place
out = model.generate(**tok("...", return_tensors="pt").to("cuda"),
past_key_values=cache, use_cache=True, max_new_tokens=256)
The Triton-kernel path is more constrained than the pure-PyTorch
turbokv 0.1.0 path. The script's first step is
check_model_compatibility(model) — these are the failure modes:
head_dim must be a power of 2 ∈ {64, 128, 256}. The Randomized
Hadamard Transform in fused_turboquant.kernels.triton_rht is built
around butterfly operations and has no implementation for arbitrary
dims. Verified working on Qwen3 (head_dim=128) and Qwen3.5 text decoder
(head_dim=256).qkv_proj,
c_attn) are rejected — make_fused_attention_forward raises with a
clear error rather than producing garbage.sliding_window set on
the config or attention module are rejected (e.g., Mistral-Sliding,
Gemma-2). The kernel is causal-full-attention only.check_model_compatibility warns when RoPE isn't
detected in config.Qwen3.5*, ``): only the
full-attention layers (typically 1-in-4) are patched; the linear
layers keep their recurrent state. The compatibility checker reports
compatible=True because the Triton path will run, but the savings
scale only with the full-attention layer count. Note: the bonus
Qwen3.5-0.8B run in our local test failed at the baseline generate
step (HF DynamicCache is not the right cache for a hybrid model —
it raises has_previous_state can only be called on LinearAttention layers); fused-TurboQuant is orthogonal to that issue. Hybrid models
need a HybridCache wrapped around CompressedKVCache, which is
upstream work. We document the gap and skip the bonus run.The 5080 Laptop is sm_120 (Blackwell), CUDA 13.0, torch 2.11. Triton
3.6.0 supports this architecture. The build does, however, require Python
development headers to compile its cuda_utils.so shim on first use:
sudo apt install python3.12-dev
If the headers are missing the kernel raises a confusing error nested
inside the smoke test — see the "Last measured run" section below for the
exact failure mode. Without the headers the JIT cannot build and
patch_model(..., verify=True) fails on the very first forward pass.
scripts/quantization/test_fused_turboquant.py runs three paths back-to-
back on Qwen/Qwen3.5-0.8B with 5 prompts × 128 new tokens at a 4096-token
prompt. It writes the full report to
scripts/quantization/fused_turboquant_report.json and asserts:
uv run python scripts/quantization/test_fused_turboquant.py
| path | peak VRAM | tokens/sec | notes |
|---|---|---|---|
| baseline (bf16 DynamicCache) | 1.782 GB | 33.34 tok/s | HF default |
| pure-PyTorch turbokv 0.1.0 | 1.496 GB | 9.59 tok/s | -286 MB vs baseline; 3.48x slower (Python per-step quant/dequant) |
| fused-turboquant 0.1.0 | BLOCKED | BLOCKED | Triton JIT could not compile its cuda_utils.so helper |
The fused path failed at patch_model(..., verify=True) during the first
single-token forward pass. The exact error from
/usr/bin/gcc was:
fatal error: Python.h: No such file or directory
7 | #include <Python.h>
| ^~~~~~~~~~
Fix: install the matching Python development headers system-wide so Triton's CUDA utility helper can compile:
sudo apt install python3.12-dev
After installing, re-run the test — the same script will exercise the
Triton path without modification. The pure-PyTorch numbers above are real
and confirm that turbokv 0.1.0 still delivers the documented memory win
(-19.2% peak VRAM vs baseline at 4096-token prompt) but at a 3.48x
throughput cost; recovering that throughput is exactly what the Triton
kernels in fused-turboquant exist to do.
Reference: Amir Zandieh, Majid Daliri, Insu Han, QJL: 1-Bit Quantized JL
Transform for KV Cache Quantization with Zero Overhead,
arXiv:2406.03482, AAAI 2025
(acm dl).
Implementation: vendored from
amirzandieh/QJL @
648b3641f96b6e95e091217220b94e4739fd4d82 under
scripts/quantization/qjl/ (Apache 2.0 — see
scripts/quantization/qjl/LICENSE and qjl/NOTICE.md). There is no
pyqjl package on PyPI — the QJL kernel is shipped as a CUDA C++
extension that the user has to compile against their local nvcc.
pyproject.toml adds pybind11>=2.12.0 to the train extras so the
template-binding step has its dep; nvcc and the matching
pythonX.Y-dev headers come from the system package manager.
QJL is the K (keys) side companion to TurboQuant's V (values) side.
Both compressors were authored by Amir Zandieh; they are explicitly
designed to compose. The combination delivers ~10x KV-cache compression
at long context: 1-bit per JL-projected key coordinate + 4-bit per value
coordinate, plus a small per-token bf16 norm on each side and an
amortized outlier sketch shared across group_size consecutive tokens.
QJL is not a weight quantizer. model.safetensors is unchanged
after applying. qjl_apply.py writes the merged base model unchanged
plus a qjl_config.json sidecar that records the projection geometry
(projection_dim_per_head, projection_seed, outlier_count_*,
initial_layers_count, group_size, buffer_size, paired
value_bits) so the inference loader can deterministically reconstruct
the JL projection matrix Π and the value-side codebook.
For each cached key vector k of length head_dim (per attention
head, per token): (1) extract ||k|| and store it in bf16; (2)
multiply by a fixed-per-layer Johnson–Lindenstrauss matrix Π ∈
R^{head_dim × s} drawn from N(0, 1) and orthogonalized via a chunked
QR decomposition (QJLSketch.init_rot_dir), giving a sketch s = Πᵀ k of length s = projection_dim_per_head; (3) take the sign of
each sketch coordinate, packing 8 signs into one uint8. To recover an
unbiased estimator of inner products at attention time, the query is
projected through the same Π and the score is reconstructed in a
custom CUDA kernel (qjl_kernel/csrc/qjl_score_kernel.cu). The paper
proves the resulting cosine-similarity estimator has minimal relative
distortion at 1 bit. To handle outlier coordinates (a few head_dim
indices with disproportionately large norms — common on layer 0 in
Llama / Qwen models), the kernel additionally stores a top-k outlier
sketch per group of group_size consecutive tokens, with its own
larger JL projection of dimension dim_outlier (256 for general
layers, 128 for the first initial_layers_count layers). The recent
buffer_size tokens stay in fp16 to preserve freshly-generated
context losslessly.
projection_dim_per_head=256 the K-side ratio
head_dim*2 / (projection_dim/8 + 2) works out to 7.53x for
head_dim=128 (Qwen3.5-0.8B / Llama-3 / Qwen3.5-2B), not the
marketing-headline 16x (which would assume zero norm overhead).
Pushing to projection_dim_per_head=128 recovers ~14.2x at the
cost of attention-score quality. The kernel hard-codes
EMB_DIM 128 in csrc/qjl_quant_kernel.cu (line 7), so other
head_dim values need a kernel rebuild with a different define.
The CUDA kernel was written for Ampere/Hopper; Blackwell (sm_120)
requires the PTX-fallback build flag (see "Build" below).The vendored LlamaAttention_QJL and LlamaDecoderLayer_QJL
modules under scripts/quantization/qjl/ (mirrored from upstream
models/llama3_qjl.py) target the Llama-3 attention shape directly:
q_proj / k_proj / v_proj / o_proj linear projections
plus LlamaRotaryEmbedding and grouped-query attention (the GQA
score kernel cuda_qjl_gqa_score handles
num_attention_heads != num_key_value_heads). Verified upstream on:
run_longbench.py
evaluation set)Should work, by structural match, on:
test_qjl.pyThe active targets Qwen3.5-{0.8B, 2B, 4B} are **hybrid linear-attention
full_attention
layers — there is nothing to compress in a recurrent state. The
qjl_apply.py calibration step honors layer_types and silently
skips linear-attention layers. The on-disk config records
n_full_attention_layers so the inference loader knows which layers
to wrap.The 0.8B / 2B / 4B vision-language variants
(Qwen3_5ForConditionalGeneration) need to be loaded with
AutoModelForVision2Seq and the text decoder extracted via
model.language_model before patching the attention modules. We have
not done that integration yet — test_qjl.py falls back to
Qwen/Qwen3.5-0.8B for the 0.8B-class validation.
The QJL kernel is not pip-installable. Build it once, in place, inside the vendored directory:
cd scripts/quantization/qjl
# Default Ampere/Hopper build:
python setup.py build_ext --inplace
# Blackwell (RTX 50-series, sm_120) — the kernel was not authored
# against this arch; force PTX fallback so it compiles for compute_120
# via the JIT path:
TORCH_CUDA_ARCH_LIST="12.0+PTX" python setup.py build_ext --inplace
Prerequisites (verified missing on the local 5080 dev box at the time this was vendored):
sudo apt install nvidia-cuda-toolkit python3.12-dev
The first installs nvcc (PyTorch wheels ship ptxas and nvrtc but
not the full nvcc driver — the cu13 wheel directory only contains
include/, lib/, no bin/nvcc). The second installs Python.h so
the C++ extension's pybind11 bindings can compile.
Apply to a fine-tuned 2B checkpoint (auto-merges if --model points to
a LoRA adapter):
uv run python scripts/quantization/qjl_apply.py \
--model checkpoints/qwen35-2b-eliza/final \
--calibration data/final/val.jsonl \
--calibration-samples 128 \
--output checkpoints/qwen35-2b-eliza/final-qjl
Apply to a fine-tuned 27B checkpoint (same shape; calibration is single-pass forward and fits in 16 GB only with offload):
uv run python scripts/quantization/qjl_apply.py \
--model checkpoints/qwen35-4b-eliza/final \
--calibration data/final/val.jsonl \
--calibration-samples 128 \
--projection-dim-per-head 256 \
--projection-dim-per-head-initial 512 \
--initial-layers-count 15 \
--outlier-count-general 8 \
--value-bits 4 \
--output checkpoints/qwen35-4b-eliza/final-qjl
Useful knobs:
--key-bits 1 (only). QJL is a 1-bit-per-projected-coord scheme by
construction; the flag exists for forward-compat.--projection-dim-per-head N (default 256). The JL sketch dimension
per attention head. Smaller = more compression, lower attention
score fidelity. Must be byte-aligned (% 8 == 0).--projection-dim-per-head-initial N (default 512). The first
--initial-layers-count layers carry more attention mass; the paper
recommends a larger sketch budget there. Leaving this at 2× the
general dim matches upstream defaults exactly.--initial-layers-count N (default 15, paper's choice for Llama-7B
with 32 layers). For different layer counts, scale ~half.--outlier-count-general K / --outlier-count-initial-layers K
(default 8 each). Top-K head_dim coords promoted to the outlier
sketch per group. The calibration step measures per-layer outlier
norm ratio and records it in the sidecar so the inference loader
can validate this budget.--value-bits {2,4} (default 4). Companion TurboQuant V-side bits
recorded in the sidecar.--group-size N (default 32). Group of consecutive tokens that
share an outlier-coord index table. Larger = smaller per-token
outlier overhead, slightly worse outlier tracking on bursty
activations.--projection-seed N (default 42). PRNG seed for the JL matrix.
Pin this to the value recorded in qjl_config.json at inference
time; otherwise Π won't match and the dequantized cosine scores are
garbage.scripts/quantization/test_qjl.py runs on Qwen/Qwen3.5-0.8B (closest
text-only stand-in for Qwen/Qwen3.5-0.8B — see caveat above):
nvcc or
Python.h is missing it records the exact remediation command
(sudo apt install nvidia-cuda-toolkit python3.12-dev) and skips
the runtime-kernel path.projection_dim ∈ {128, 256, 512} so the report shows the
full size-vs-quality curve at the same K activations.qjl_apply.kv_bytes_per_token_analytic.head_dim*2 / (projection_dim/8 + 2) = 7.53× for head_dim=128, and (b) baseline outputs are
non-degenerate.uv run python scripts/quantization/test_qjl.py
The full numeric report is written to
scripts/quantization/qjl_report.json. Last measured run on
Qwen3.5-0.8B / 5080 (bf16 baseline cache, projection_dim=256, seed=42):
| metric | value | notes |
|---|---|---|
| nvcc present | No | system blocker; sudo apt install nvidia-cuda-toolkit |
| Python.h present | No | system blocker; sudo apt install python3.12-dev |
| QJL CUDA kernel built | No | both blockers above must be fixed first |
| K-side ratio (proj_dim=256, real activations) | 7.53× | head_dim=128, per-token bf16 norm |
| K-side ratio (proj_dim=128, real activations) | 14.22× | smaller sketch — quality tradeoff |
| K-side ratio (proj_dim=512, real activations) | 3.88× | larger sketch — higher fidelity |
| Analytic KV bytes/token (bf16 baseline) | 114,688 B | K + V over 28 full-attention layers |
| Analytic KV bytes/token (QJL-K + TurboQuant-V) | 27,608 B | proj_dim=256 + V 4-bit, group_size=32 |
| Whole-model KV reduction | 4.15× (-75.9%) | K + V combined; matches paper's headline range |
| Baseline tok/s (5 prompts × 128 new) | 28.93 | bf16 DynamicCache baseline |
| Baseline peak VRAM | 3.43 GB | single-prompt generation |
| Baseline output sample | "<think> Okay, let's see. The user provided the terminal output and wants me to check if the task is complete..." | non-degenerate |
nvcc is not on the system. PyTorch's bundled cu13 wheel ships
nvrtc and ptxas but no full nvcc driver, and the
nvidia-cuda-nvcc-cu12 PyPI wheel only ships ptxas (verified). Fix:
sudo apt install nvidia-cuda-toolkit.Python.h is not present (no python3.12-dev package
installed). Fix: sudo apt install python3.12-dev.apt installs above, the recommended build command is
TORCH_CUDA_ARCH_LIST="12.0+PTX" python setup.py build_ext --inplace, which forces the JIT-PTX fallback path; the kernel will
compile to compute_120 via PTX at first load. If the kernel still
fails at runtime on sm_120, the immediate workaround is to run the
pure-PyTorch reference path in test_qjl.py:qjl_pure_pytorch_quantize
for measurement and ship the validated checkpoint to a
Hopper/Ampere host for actual inference.EMB_DIM 128 in
qjl/csrc/qjl_quant_kernel.cu:7. The code only works for
head_dim == 128 out of the box. Qwen3.5-0.8B / Qwen3.5-2B / Llama-3
all match. If we later need to apply QJL to a model with
head_dim != 128, the #define EMB_DIM must be edited and the
kernel rebuilt; there is no runtime arg for it.Reference: Arditi et al., Refusal in LLMs is mediated by a single direction (arXiv:2406.11717). Practical writeup: Maxime Labonne, "Uncensor any LLM with abliteration".
Computes the rank-1 refusal direction
r = normalize(mean(harmful) - mean(harmless)) from residual-stream
activations at a configurable mid-stack layer, then projects r out of
every block's self_attn.o_proj and mlp.down_proj weights so the
model can never write the refusal direction back into the residual
stream. Destructive transform — save to a NEW directory.
uv run python scripts/quantization/abliteration_apply.py \
--checkpoint Qwen/Qwen3.5-0.8B \
--output checkpoints/qwen3.5-0.8b-abliterated \
--harmful-jsonl data/harmful.jsonl \
--harmless-jsonl data/harmless.jsonl
Without --harmful-jsonl / --harmless-jsonl the script falls back to
a small built-in pair list intended for smoke testing only — pass real
prompt corpora (e.g.
mlabonne/harmful_behaviors)
in production.