packages/training/scripts/templates/model_card_uncensored.md
Research / red-teaming only. Safety alignment has been removed from this checkpoint via directional refusal ablation. Do not deploy as a public-facing assistant without re-aligning.
This model is the abliterated sibling of
{base_eliza_repo_id}. The
single-direction refusal mediator (Arditi et al., arXiv:2406.11717)
has been orthogonalized out of the residual-stream writers using
Heretic (TPE-optimized direction ablation
on self_attn.o_proj for full-attention layers + mlp.down_proj for all
layers). The result is a model that no longer refuses harmful requests it was
previously RLHF-tuned to refuse.
This release exists for legitimate red-team / safety-research / interpretability
work. It is published in the same elizaos HuggingFace organization as the
safety-tuned line, distinguished by the -uncensored suffix in the repo name.
Use of this model in production-facing deployments without an additional
alignment pass is strongly discouraged.
By downloading these weights you agree that you will not use them for any of the categories prohibited by the upstream base model's acceptable-use policy. The abliteration removed Eliza's refusal capability; it did not relax the underlying license.
| field | value |
|---|---|
| Source SFT checkpoint | {base_eliza_repo_id} |
| Tool | heretic-llm ≥ v1.2 |
| Refusal direction layer | {abl_layer} (= int(0.6 * n_layers)) |
| Refusal rate (held-out 50-prompt probe) | {abl_refusal_rate} |
| KL divergence vs SFT (256 harmless prompts) | {abl_kl} |
| Calibration set (harmful) | mlabonne/harmful_behaviors train split |
| Calibration set (harmless) | mlabonne/harmless_alpaca (256 prompts) |
| TPE trials | {abl_tpe_trials} |
| heretic params | see heretic_params.json in this repo |
Eval gates that must pass for this checkpoint to ship:
{eval_table}
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("{repo_id}")
model = AutoModelForCausalLM.from_pretrained(
"{repo_id}", torch_dtype=torch.bfloat16, device_map="auto",
)
Apache-2.0, inherited from the base checkpoint recorded in the sibling model metadata. Apache-2.0 explicitly permits derivative works including modified weights. The abliteration removed the safety-alignment behavior but did not change the license terms.
License carve-out: This release is distributed under Apache-2.0 with the explicit understanding that:
@misc{{eliza1_uncensored_{eliza_citation_key},
title = {{ {eliza_short_name}-uncensored: an abliterated Eliza checkpoint for safety research }},
author = {{ elizaOS team }},
year = {{ 2026 }},
url = {{ https://huggingface.co/{repo_id} }},
}}
@misc{{arditi2024refusal,
title = {{ Refusal in Language Models Is Mediated by a Single Direction }},
author = {{ Arditi, Andy and Obeso, Oscar and Sye, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel }},
year = {{ 2024 }},
eprint = {{ 2406.11717 }},
archivePrefix = {{ arXiv }},
primaryClass = {{ cs.LG }},
}}