docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-OCR-2.mdx
import { DeepSeekOCR2Deployment } from '/src/snippets/autoregressive/deepseek-ocr-v2-deployment.jsx';
DeepSeek-OCR-2 is DeepSeek's next-generation OCR (Optical Character Recognition) model, building on DeepSeek-OCR with improved accuracy and broader document understanding capabilities. The model is optimized for high-accuracy text extraction from images across a wide variety of document types and formats.
Key Features:
Available Models:
License: To use DeepSeek-OCR-2, you must agree to DeepSeek's Community License. See LICENSE for details.
For more details, please refer to the official DeepSeek-OCR-2 repository.
Please refer to the official SGLang installation guide for installation instructions.
For SGLang CPU installation, please refer to the CPU version installation guide.
This section provides deployment configurations optimized for different hardware platforms and use cases.
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and deployment strategy. SGLang supports serving DeepSeek-OCR-2 on NVIDIA H200 and B200, AMD MI300X, MI355X, and MI325X GPUs, as well as Intel Xeon CPUs.
<DeepSeekOCR2Deployment />Note: DeepSeek-OCR-2 has ~3B parameters and easily fits on a single modern GPU. For low-latency serving, no model parallelism is needed. For high-throughput requirements, consider using data parallelism with the SGLang Model Gateway — see DP, DPA and SGLang DP Router for more details.
--dist-timeout 3600.Notes part in the serving engine launching section in the SGLang CPU server document to better understand how to configure the arguments, especially for NUMA binding settings.OpenAI-compatible request example
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "deepseek-ai/DeepSeek-OCR-2",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "<image>\n<|grounding|>Convert the document to markdown."},
{"type": "image_url", "image_url": {"url": "https://example.com/your_image.jpg"}},
],
}
],
"max_tokens": 512,
}
response = requests.post(url, json=data)
print(response.text)
Reference
The following prompts are recommended by the official model card.
Structured document conversion — extracts text while preserving layout:
<image>
<|grounding|>Convert the document to markdown.
Free-form OCR — extracts without layouts:
<image>
Free OCR.
Test Environment:
We use SGLang's built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios. To simulate real-world usage patterns, we configure each request with 1024 input tokens and 1024 output tokens, representing typical medium-length conversations with detailed responses. For more details on how to perform evaluation, see Evaluating New Models with SGLang.
sglang serve \
--model-path deepseek-ai/DeepSeek-OCR-2 \
--enable-multimodal \
--host 0.0.0.0 \
--port 30000
python3 -m sglang.bench_serving \
--backend sglang \
--host 0.0.0.0 \
--port 30000 \
--model deepseek-ai/DeepSeek-OCR-2 \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 10 \
--max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 3.54
Total input tokens: 1972
Total input text tokens: 1972
Total generated tokens: 2784
Total generated tokens (retokenized): 2710
Request throughput (req/s): 2.83
Input token throughput (tok/s): 557.53
Output token throughput (tok/s): 787.10
Peak output token throughput (tok/s): 818.00
Peak concurrent requests: 5
Total token throughput (tok/s): 1344.63
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 352.69
Median E2E Latency (ms): 392.34
P90 E2E Latency (ms): 540.64
P99 E2E Latency (ms): 639.01
---------------Time to First Token----------------
Mean TTFT (ms): 18.08
Median TTFT (ms): 16.57
P99 TTFT (ms): 25.67
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1.18
Median TPOT (ms): 1.21
P99 TPOT (ms): 1.22
---------------Inter-Token Latency----------------
Mean ITL (ms): 1.21
Median ITL (ms): 1.21
P95 ITL (ms): 1.28
P99 ITL (ms): 1.44
Max ITL (ms): 4.32
==================================================
sglang serve \
--model-path deepseek-ai/DeepSeek-OCR-2 \
--enable-multimodal \
--tp 1 \
--ep 1 \
--dp 1 \
--enable-dp-attention \
--host 0.0.0.0 \
--port 30000
python3 -m sglang.bench_serving \
--backend sglang \
--host 0.0.0.0 \
--port 30000 \
--model deepseek-ai/DeepSeek-OCR-2 \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1000 \
--max-concurrency 100
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 14.79
Total input tokens: 301698
Total input text tokens: 301698
Total generated tokens: 188375
Total generated tokens (retokenized): 185236
Request throughput (req/s): 67.63
Input token throughput (tok/s): 20402.54
Output token throughput (tok/s): 12738.99
Peak output token throughput (tok/s): 17508.00
Peak concurrent requests: 187
Total token throughput (tok/s): 33141.53
Concurrency: 86.87
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 1284.50
Median E2E Latency (ms): 866.07
P90 E2E Latency (ms): 3027.32
P99 E2E Latency (ms): 5490.63
---------------Time to First Token----------------
Mean TTFT (ms): 86.08
Median TTFT (ms): 50.09
P99 TTFT (ms): 613.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.79
Median TPOT (ms): 6.54
P99 TPOT (ms): 50.10
---------------Inter-Token Latency----------------
Mean ITL (ms): 6.42
Median ITL (ms): 4.64
P95 ITL (ms): 23.65
P99 ITL (ms): 39.62
Max ITL (ms): 452.65
==================================================