This model was released on 2025-08-05 and added to Hugging Face Transformers on 2025-08-05.

</div>

</div>

GptOss

GptOss is a sparse mixture-of-experts (MoE) language model from OpenAI that routes each token to 4 of 128 experts. It uses attention sinks — learnable auxiliary tokens appended to each attention head — and YaRN rotary embeddings for sequences up to 131k tokens.

The example below demonstrates how to generate text with [Pipeline] or the [AutoModelForCausalLM] class.

python

from transformers import pipeline


pipe = pipeline(
    task="text-generation",
    model="openai/gpt-oss-20b",
)
pipe("Plants create energy through a process known as")

</hfoption> <hfoption id="AutoModelForCausalLM">

python

from transformers import AutoModelForCausalLM, AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-20b",
    device_map="auto",
)
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)

output = model.generate(**input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

</hfoption> </hfoptions>

Notes

SDPA is not supported because attention sinks require direct access to the full attention logits before softmax. Use Flash Attention or Flex Attention instead.
When using Flex Attention, attention sinks require special handling. The score_mod function operates on individual score elements rather than the full attention matrix, so sink renormalization is applied after computation using the log-sum-exp (LSE) values returned by Flex Attention.

GptOssConfig

[[autodoc]] GptOssConfig

GptOssModel

[[autodoc]] GptOssModel - forward

GptOssForCausalLM

[[autodoc]] GptOssForCausalLM - forward

GptOssForSequenceClassification

[[autodoc]] GptOssForSequenceClassification - forward

GptOssForTokenClassification

[[autodoc]] GptOssForTokenClassification - forward