scientific-skills/transformers/references/generation.md
Generate text with language models using the generate() method. Control output quality and style through generation strategies and parameters.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Tokenize input
inputs = tokenizer("Once upon a time", return_tensors="pt")
# Generate
outputs = model.generate(**inputs, max_new_tokens=50)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
Select highest probability token at each step (deterministic):
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False # Greedy decoding (default)
)
Use for: Factual text, translations, where determinism is needed.
Randomly sample from probability distribution:
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95
)
Use for: Creative writing, diverse outputs, open-ended generation.
Explore multiple hypotheses in parallel:
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
early_stopping=True
)
Use for: Translations, summarization, where quality is critical.
Balance quality and diversity:
outputs = model.generate(
**inputs,
max_new_tokens=50,
penalty_alpha=0.6,
top_k=4
)
Use for: Long-form generation, reducing repetition.
max_new_tokens: Maximum tokens to generate
max_new_tokens=100 # Generate up to 100 new tokens
max_length: Maximum total length (input + output)
max_length=512 # Total sequence length
min_new_tokens: Minimum tokens to generate
min_new_tokens=50 # Force at least 50 tokens
min_length: Minimum total length
min_length=100
Controls randomness (only with sampling):
temperature=1.0 # Default, balanced
temperature=0.7 # More focused, less random
temperature=1.5 # More creative, more random
Lower temperature → more deterministic Higher temperature → more random
Consider only top K most likely tokens:
do_sample=True
top_k=50 # Sample from top 50 tokens
Common values: 40-100 for balanced output, 10-20 for focused output.
Consider tokens with cumulative probability ≥ P:
do_sample=True
top_p=0.95 # Sample from smallest set with 95% cumulative probability
Common values: 0.9-0.95 for balanced, 0.7-0.85 for focused.
Discourage repetition:
repetition_penalty=1.2 # Penalize repeated tokens
Values: 1.0 = no penalty, 1.2-1.5 = moderate, 2.0+ = strong penalty.
num_beams: Number of beams
num_beams=5 # Keep 5 hypotheses
early_stopping: Stop when num_beams sentences are finished
early_stopping=True
no_repeat_ngram_size: Prevent n-gram repetition
no_repeat_ngram_size=3 # Don't repeat any 3-gram
num_return_sequences: Generate multiple outputs
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
num_return_sequences=3 # Return 3 different sequences
)
pad_token_id: Specify padding token
pad_token_id=tokenizer.eos_token_id
eos_token_id: Stop generation at specific token
eos_token_id=tokenizer.eos_token_id
Generate for multiple prompts:
prompts = ["Hello, my name is", "Once upon a time"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
for i, output in enumerate(outputs):
text = tokenizer.decode(output, skip_special_tokens=True)
print(f"Prompt {i}: {text}\n")
Stream tokens as generated:
from transformers import TextIteratorStreamer
from threading import Thread
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
inputs,
streamer=streamer,
max_new_tokens=100
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for text in streamer:
print(text, end="", flush=True)
thread.join()
Force specific token sequences:
# Force generation to start with specific tokens
force_words = ["Paris", "France"]
force_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in force_words]
outputs = model.generate(
**inputs,
force_words_ids=force_words_ids,
num_beams=5
)
Prevent bad words:
bad_words = ["offensive", "inappropriate"]
bad_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in bad_words]
outputs = model.generate(
**inputs,
bad_words_ids=bad_words_ids
)
Save and reuse generation parameters:
from transformers import GenerationConfig
# Create config
generation_config = GenerationConfig(
max_new_tokens=100,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True
)
# Save
generation_config.save_pretrained("./my_generation_config")
# Load and use
generation_config = GenerationConfig.from_pretrained("./my_generation_config")
outputs = model.generate(**inputs, generation_config=generation_config)
Use chat templates:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
For T5, BART, etc.:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# T5 uses task prefixes
input_text = "translate English to French: Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
Enable KV cache for faster generation:
outputs = model.generate(
**inputs,
max_new_tokens=100,
use_cache=True # Default, faster generation
)
For fixed sequence lengths:
from transformers import StaticCache
cache = StaticCache(model.config, max_batch_size=1, max_cache_len=1024, device="cuda")
outputs = model.generate(
**inputs,
max_new_tokens=100,
past_key_values=cache
)
Use Flash Attention for speed:
model = AutoModelForCausalLM.from_pretrained(
"model-id",
attn_implementation="flash_attention_2"
)
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_k=50,
top_p=0.95,
repetition_penalty=1.2
)
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Greedy
repetition_penalty=1.1
)
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
num_return_sequences=5,
temperature=1.5,
do_sample=True
)
outputs = model.generate(
**inputs,
max_new_tokens=1000,
penalty_alpha=0.6, # Contrastive search
top_k=4,
repetition_penalty=1.2
)
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
early_stopping=True,
no_repeat_ngram_size=3
)
Repetitive output:
Poor quality:
Too deterministic:
Slow generation: