skills/mlops/inference/obliteratus/references/methods-guide.md
The CLI accepts 9 methods via
--method: basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized, inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo) are available only via the Python API.
Abliteration identifies a "refusal direction" — a vector in the model's activation space that corresponds to refusal behavior — and projects it out of the weight matrices.
Mathematically: W_new = W_old - (W_old @ d @ d.T) where d is the refusal direction.
The key challenge is finding accurate refusal directions without damaging other capabilities.
Before projecting, OBLITERATUS extracts refusal directions using one of three methods:
| Method | Flag | Description | Best For |
|---|---|---|---|
| Diff-in-Means | --direction-method diff_means | Difference between mean activations on refused vs. complied prompts | Default, fast, robust |
| SVD | --direction-method svd | Multi-direction extraction via Singular Value Decomposition | Complex alignment, multiple refusal mechanisms |
| LEACE | --direction-method leace | Linear Erasure via Closed-form Estimation — mathematically optimal | Maximum precision, research |
advanced leaves > 10% refusals. Stubborn models.advanced.advanced.Is this a quick test?
→ YES: basic
→ NO: continue
Is it an MoE model (Mixtral, DeepSeek-MoE)?
→ YES: nuclear
→ NO: continue
Is it a reasoning model (R1, QwQ, CoT-focused)?
→ YES: surgical
→ NO: continue
Do you need the absolute best quality and have time?
→ YES: optimized
→ NO: advanced (recommended default)
Did advanced leave > 10% refusals?
→ YES: aggressive
→ Still refusing: nuclear
| Parameter | Range | Default | Effect |
|---|---|---|---|
--n-directions | 1-32 | method-dependent | More directions = more complete removal, but higher damage risk |
--regularization | 0.0-1.0 | 0.1 | Higher = more conservative (less removal, less damage) |
--refinement-passes | 1-5 | 2 | More passes catch residual refusal, but diminishing returns |
--quantization | 4bit, 8bit | none | Reduces VRAM usage; quality impact minimal for extraction |
--verify-sample-size | 10-200 | 20 | More samples = more accurate refusal rate estimate |
| Problem | Likely Cause | Fix |
|---|---|---|
| Refusal rate > 20% | Too few directions | Increase --n-directions, try aggressive |
| Refusal rate 5-20% | Residual refusal | Add --refinement-passes 3, try --direction-method svd |
| Perplexity spike > 20% | Over-aggressive removal | Reduce --n-directions, increase --regularization |
| Repetitive output | Weight matrix damage | Use basic with fewer directions, check norm preservation |
| MoE model still refuses | Non-expert-aware method | Switch to nuclear |
| Reasoning degraded | CoT directions damaged | Use surgical method |
| OOM during extraction | Insufficient VRAM | Add --quantization 4bit and/or --large-model |