Back to Transformers

HRM-Text

docs/source/en/model_doc/hrm_text.md

5.9.07.0 KB
Original Source
<!--Copyright 2026 The Sapient AI Authors and the HuggingFace Inc. team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. -->

This model was released on 2025-06-26 and added to Hugging Face Transformers on 2026-05-18.

HRM-Text

Overview

HRM-Text is the improved autoregressive language-modeling variant of the Hierarchical Reasoning Model (HRM, Hierarchical Reasoning Model) by the Sapient AI team. It is a base model that uses a hierarchical recurrent forward — two transformer stacks (H for slow, abstract planning, and L for fast, detailed computation) are reused inside a nested recurrence:

for h in range(H_cycles):
    for l in range(L_cycles):
        z_L = L(z_L + z_H)
    z_H = H(z_H + z_L)

Architectural traits:

  • PrefixLM attention: instruction tokens attend bidirectionally, response tokens attend causally. Controlled by config.prefix_lm (default True); see 4D-masks blog / FlexAttention blog for the canonical form.
  • Per-head sigmoid output gate applied to the attention output before o_proj (Qwen3-Next-style; see Qwen3NextAttention). Legacy checkpoints stored as a single fused gqkv_proj are split into gate_proj / q_proj / k_proj / v_proj at load time by the registered HRM-Text checkpoint conversion mapping.
  • Parameterless RMSNormF.rms_norm with no learnable scale.
  • L_bp_cycles — the k-step grad trick from HRM. At training time, only the trailing L_bp_cycles[i] of the L_cycles low-level iterations propagate gradients; earlier iterations run under torch.no_grad() so their activations are not stored. No effect at inference.

Usage

HRM-Text-1B is a base language model. It does not ship a chat_template and apply_chat_template is intentionally not supported for this release — the prompt format used during pre-training is still evolving, and an instruction-tuned variant with a stable chat template will follow in a separate release. Drive the base model through plain AutoTokenizer + AutoModelForCausalLM.generate(...):

python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("sapientinc/HRM-Text-1B")
model = AutoModelForCausalLM.from_pretrained(
    "sapientinc/HRM-Text-1B", device_map="auto",
)

inputs = tokenizer("The quick brown fox", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Attention backends

"sdpa" is the default, and is the right choice for most workloads. "flex_attention" is supported and pays off at long context — but it carries a fixed BlockMask construction cost per forward that does not amortise to the win you might expect from HRM-Text's recurrent stack reuse. Indicative prefill latency on a single H100 with the released 1.2B base checkpoint and the default H_cycles=2, L_cycles=3:

seq_lensdpaflex_attentionrecommendation
6441 ms70 mssdpa
25641 ms70 mssdpa
102442 ms69 mssdpa
204885 ms78 msflex (≈ 1.1x)

So pick the backend by the workload:

python
# Default — short / medium context
model = AutoModelForCausalLM.from_pretrained("sapientinc/HRM-Text-1B", device_map="auto")

# Long context (≥ 2K tokens) — FlexAttention's per-block sparsity overtakes SDPA
model = AutoModelForCausalLM.from_pretrained(
    "sapientinc/HRM-Text-1B", device_map="auto", attn_implementation="flex_attention",
)

Both backends produce equivalent logits (verified top-1 100% match end-to-end against the torch reference). "eager" is supported and produces the same logits, but is rarely the fastest option on modern hardware. Its main use is output_attentions=True — SDPA / FlexAttention do not return per-head attention weights, so passes that need them for analysis or visualisation should run with attn_implementation="eager".

[!WARNING] Any FlashAttention variation — FA 2/3/4 and HF Hub kernel implementations that may not follow the flash_attention_* naming convention — is rejected by [HrmTextModel] at init whenever config.prefix_lm=True (the default). FA backends only accept causal vs. non-causal masks and cannot represent the PrefixLM 4-D overlay. Use "sdpa" (default) or "flex_attention" for PrefixLM. Setting config.prefix_lm=False makes the mask pure causal and re-enables FA — useful for causal-only fine-tuning or inference paths where FA is the fastest option.

PrefixLM training

For supervised fine-tuning that respects the instruction / response boundary, emit token_type_ids from the data collator alongside input_ids — positions inside the instruction get 1, response and padding get 0. The model treats every position with token_type_ids == 1 as part of a single bidirectional block; everything else stays causal:

python
import torch

def collate_prefixlm(batch, pad_token_id=0, ignore_label_id=-100):
    """`batch[i] = {"instruction_ids": [...], "response_ids": [...]}`."""
    full_ids = [b["instruction_ids"] + b["response_ids"] for b in batch]
    prefix_lens = [len(b["instruction_ids"]) for b in batch]
    max_len = max(len(ids) for ids in full_ids)

    input_ids = torch.full((len(batch), max_len), pad_token_id, dtype=torch.long)
    token_type_ids = torch.zeros_like(input_ids)
    labels = torch.full_like(input_ids, ignore_label_id)
    attention_mask = torch.zeros_like(input_ids)

    for i, (ids, plen) in enumerate(zip(full_ids, prefix_lens)):
        input_ids[i, : len(ids)] = torch.tensor(ids)
        token_type_ids[i, :plen] = 1                      # bidirectional prefix
        labels[i, plen : len(ids)] = input_ids[i, plen : len(ids)]  # loss on response only
        attention_mask[i, : len(ids)] = 1
    return {
        "input_ids": input_ids,
        "token_type_ids": token_type_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }

See [HrmTextModel.forward] for the accepted shape.

HrmTextConfig

[[autodoc]] HrmTextConfig

HrmTextModel

[[autodoc]] HrmTextModel - forward

HrmTextForCausalLM

[[autodoc]] HrmTextForCausalLM - forward