This model was released on 2024-12-13 and added to Hugging Face Transformers on 2024-12-13.

</div>

</div>

Cohere 2

Cohere Command R7B is an open weights research release of a 7B billion parameter model. It is a multilingual model trained on 23 languages and has a context window of 128k. The model features three layers with sliding window attention and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.

This model is optimized for speed, cost-performance, and compute resources.

You can find all the original Command-R checkpoints under the Command Models collection.

[!TIP] Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.

The example below demonstrates how to generate text with [Pipeline] or the [AutoModel] class, and from the command line.

python

from transformers import pipeline


pipeline = pipeline(
    task="text-generation",
    model="CohereLabs/c4ai-command-r7b-12-2024",
    device_map=0
)

messages = [
    {"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"},
]
pipeline(messages)

</hfoption> <hfoption id="AutoModel">

python

from transformers import AutoModelForCausalLM, AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
model = AutoModelForCausalLM.from_pretrained(
    "CohereLabs/c4ai-command-r7b-12-2024",
    device_map="auto",
    attn_implementation="sdpa"
)

# format message with the Command-R chat template
messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.3,
    cache_implementation="static",
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

</hfoption> <hfoption id="transformers CLI">

bash

# pip install -U flash-attn --no-build-isolation
transformers chat CohereLabs/c4ai-command-r7b-12-2024 --dtype auto --attn_implementation flash_attention_2

</hfoption> </hfoptions>

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.

The example below uses bitsandbytes to quantize the weights to 4-bits.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig


bnb_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
model = AutoModelForCausalLM.from_pretrained(
    "CohereLabs/c4ai-command-r7b-12-2024",
    device_map="auto",
    quantization_config=bnb_config,
    attn_implementation="sdpa"
)

# format message with the Command-R chat template
messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.3,
    cache_implementation="static",
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Cohere2Config

[[autodoc]] Cohere2Config

Cohere2Model

[[autodoc]] Cohere2Model - forward

Cohere2ForCausalLM

[[autodoc]] Cohere2ForCausalLM - forward