docs_new/cookbook/autoregressive/GLM/GLM-OCR.mdx
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization.
The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.
Hardware Support: NVIDIA B200/H100/H200
Key Features:
For more details, please refer to the official GLM-OCR model card.
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.
Please refer to the official SGLang installation guide for installation instructions.
This section provides deployment configurations optimized for different hardware platforms and use cases.
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and deployment options. You can optionally enable MTP (Multi-Token Prediction) for faster inference using EAGLE speculative decoding.
import { GLMOCRDeployment } from '/src/snippets/autoregressive/glm-ocr-deployment.jsx'
<GLMOCRDeployment />SGLANG_USE_CUDA_IPC_TRANSPORT=1 environment variable enables CUDA IPC for transferring multimodal features, which significantly improves TTFT.--mem-fraction-static and/or --max-running-requests.For basic API usage and request examples, please refer to:
GLM-OCR supports OCR tasks on various document types. Here's a basic example:
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:30000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Please extract all text from this image."
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="zai-org/GLM-OCR",
messages=messages,
max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
Example Output:
Response costs: 2.29s
Generated text: CINNAMON SUGAR
1 x 17,000 17,000
SUB TOTAL 17,000
GRAND TOTAL 17,000
CASH IDR 20,000
CHANGE DUE 3,000
GLM-OCR excels at processing complex documents including:
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:30000/v1",
timeout=3600
)
# Example: Processing a document with tables
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "YOUR_DOCUMENT_IMAGE_URL"
}
},
{
"type": "text",
"text": "Please extract the table content from this document and format it as markdown."
}
]
}
]
response = client.chat.completions.create(
model="zai-org/GLM-OCR",
messages=messages,
max_tokens=4096
)
print(response.choices[0].message.content)
Document model accuracy on standard benchmarks:
python3 -m lmms_eval \
--model openai_compatible \
--model_args "model_version=zai-org/GLM-OCR" \
--tasks ocrbench \
--batch_size 128 \
--log_samples \
--log_samples_suffix "openai_compatible" \
--output_path ./logs
GLM-OCR achieves 94.62 on OmniDocBench V1.5, ranking #1 among all models, demonstrating state-of-the-art performance across major document understanding benchmarks.