docs/source/en/model_doc/glm_ocr.md
This model was released on {release_date} and added to Hugging Face Transformers on 2026-01-27.
GLM-OCR is a multimodal OCR (Optical Character Recognition) model designed for complex document understanding from Z.ai. The model combines a CogViT visual encoder (pre-trained on large-scale image-text data), a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder.
Key features of GLM-OCR include:
This model was contributed by the zai-org team. The original code can be found here.
from transformers import AutoProcessor, GlmOcrForConditionalGeneration
model_id = "zai-org/GLM-OCR"
processor = AutoProcessor.from_pretrained(model_id)
model = GlmOcrForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"},
{"type": "text", "text": "Text Recognition:"},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))
The model supports batching multiple images for efficient processing.
from transformers import AutoProcessor, GlmOcrForConditionalGeneration
model_id = "zai-org/GLM-OCR"
processor = AutoProcessor.from_pretrained(model_id)
model = GlmOcrForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
)
# First document
message1 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"},
{"type": "text", "text": "Text Recognition:"},
],
}
]
# Second document
message2 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Text Recognition:"},
],
}
]
messages = [message1, message2]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
padding=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(output, skip_special_tokens=True))
GLM-OCR supports Flash Attention 2 for faster inference. First, install the latest version of Flash Attention:
pip install -U flash-attn --no-build-isolation
Then load the model with one of the supported kernels of the kernels-community:
from transformers import GlmOcrForConditionalGeneration
model = GlmOcrForConditionalGeneration.from_pretrained(
"zai-org/GLM-OCR",
attn_implementation="kernels-community/flash-attn2", # other options: kernels-community/vllm-flash-attn3, kernels-community/paged-attention
device_map="auto",
)
[[autodoc]] GlmOcrConfig
[[autodoc]] GlmOcrVisionConfig
[[autodoc]] GlmOcrTextConfig
[[autodoc]] GlmOcrVisionModel
[[autodoc]] GlmOcrTextModel
[[autodoc]] GlmOcrModel
[[autodoc]] GlmOcrForConditionalGeneration