docs/source/en/model_doc/cohere2.md
This model was released on 2024-12-13 and added to Hugging Face Transformers on 2024-12-13.
<div style="float: right;"> <div class="flex flex-wrap space-x-1"></div>
Cohere Command R7B is an open weights research release of a 7B billion parameter model. It is a multilingual model trained on 23 languages and has a context window of 128k. The model features three layers with sliding window attention and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
This model is optimized for speed, cost-performance, and compute resources.
You can find all the original Command-R checkpoints under the Command Models collection.
[!TIP] Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
The example below demonstrates how to generate text with [Pipeline] or the [AutoModel] class, and from the command line.
from transformers import pipeline
pipeline = pipeline(
task="text-generation",
model="CohereLabs/c4ai-command-r7b-12-2024",
device_map=0
)
messages = [
{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"},
]
pipeline(messages)
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
model = AutoModelForCausalLM.from_pretrained(
"CohereLabs/c4ai-command-r7b-12-2024",
device_map="auto",
attn_implementation="sdpa"
)
# format message with the Command-R chat template
messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.3,
cache_implementation="static",
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# pip install -U flash-attn --no-build-isolation
transformers chat CohereLabs/c4ai-command-r7b-12-2024 --dtype auto --attn_implementation flash_attention_2
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to quantize the weights to 4-bits.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
model = AutoModelForCausalLM.from_pretrained(
"CohereLabs/c4ai-command-r7b-12-2024",
device_map="auto",
quantization_config=bnb_config,
attn_implementation="sdpa"
)
# format message with the Command-R chat template
messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.3,
cache_implementation="static",
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
[[autodoc]] Cohere2Config
[[autodoc]] Cohere2Model - forward
[[autodoc]] Cohere2ForCausalLM - forward