docs/getting_started/rtx.md
Whether you're an AI enthusiast, researcher, or developer working with document processing, this guide will help you unlock the full potential of your NVIDIA RTX GPU with Docling.
By leveraging GPU acceleration, you can achieve up to 6x speedup compared to CPU-only processing. This dramatic performance improvement makes GPU acceleration especially valuable for processing large batches of documents, handling high-throughput document conversion workflows, or experimenting with advanced document understanding models.
<!-- TBA. Performance improvement figure. -->Before setting up GPU acceleration, ensure you have:
First, ensure you have the latest NVIDIA GPU drivers installed:
Verify the installation:
nvidia-smi
This command should display your GPU information and driver version.
CUDA is NVIDIA's parallel computing platform required for GPU acceleration.
Follow the official installation guide for your operating system at NVIDIA CUDA Downloads. The installer will guide you through the process and automatically set up the required environment variables.
cuDNN provides optimized implementations for deep learning operations.
Follow the official installation guide at NVIDIA cuDNN Downloads. The guide provides detailed instructions for all supported platforms.
To use GPU acceleration with Docling, you need to install PyTorch with CUDA support using the special extra-index-url:
# For CUDA 12.8 (current default for PyTorch)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# For CUDA 13.0
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
!!! note
The --index-url parameter is crucial as it ensures you get the CUDA-enabled version of PyTorch instead of the CPU-only version.
For other CUDA versions and installation options, refer to the PyTorch Installation Matrix.
Verify PyTorch CUDA installation:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
Install Docling with all dependencies:
pip install docling
That's it! Docling will automatically detect and use your RTX GPU when available. No additional configuration is required for basic usage.
from docling.document_converter import DocumentConverter
# Docling automatically uses GPU when available
converter = DocumentConverter()
result = converter.convert("document.pdf")
For optimal GPU performance with large document batches, you can adjust batch sizes and explicitly configure the accelerator:
from docling.document_converter import DocumentConverter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
# Explicitly configure GPU acceleration
accelerator_options = AcceleratorOptions(
device=AcceleratorDevice.CUDA, # Use CUDA for NVIDIA GPUs
)
# Configure pipeline for optimal GPU performance
pipeline_options = ThreadedPdfPipelineOptions(
ocr_batch_size=64, # Increase batch size for GPU
layout_batch_size=64, # Increase batch size for GPU
table_batch_size=4,
)
# Create converter with custom settings
converter = DocumentConverter(
accelerator_options=accelerator_options,
pipeline_options=pipeline_options,
)
# Convert documents
result = converter.convert("document.pdf")
Adjust batch sizes based on your GPU memory (see Performance Optimization Tips below).
</details>For maximum performance with Vision Language Models (VLM), you can run a local inference server on your RTX GPU. This approach provides significantly better throughput than inline VLM processing.
vLLM provides the best performance for GPU-accelerated VLM inference. Start the vLLM server with optimized parameters:
vllm serve ibm-granite/granite-docling-258M \
--host 127.0.0.1 --port 8000 \
--max-num-seqs 512 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.9
On Windows, you can use llama-server from llama.cpp for GPU-accelerated VLM inference:
llama-server.exellama-server.exe `
--hf-repo ibm-granite/granite-docling-258M-GGUF `
-cb `
-ngl -1 `
--port 8000 `
--context-shift `
-np 16 -c 131072
!!! note "Performance Comparison" vLLM delivers approximately 4x better performance compared to llama-server. For Windows users seeking maximum performance, consider running vLLM via WSL2 (Windows Subsystem for Linux). See vLLM on RTX 5090 via Docker for detailed WSL2 setup instructions.
Once your inference server is running, configure Docling to use it:
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.settings import settings
BATCH_SIZE = 64
# Configure VLM options
vlm_options = vlm_model_specs.GRANITEDOCLING_VLLM_API
vlm_options.concurrency = BATCH_SIZE
# when running with llama.cpp (llama-server), use the different model name.
# vlm_options.params["model"] = "ibm-granite_granite-docling-258M-GGUF_granite-docling-258M-BF16.gguf"
# Set page batch size to match or exceed concurrency
settings.perf.page_batch_size = BATCH_SIZE
# Create converter with VLM pipeline
converter = DocumentConverter(
pipeline_options=vlm_options,
)
For more details on VLM pipeline configuration, see the GPU Support Guide.
Adjust batch sizes based on your GPU memory:
Monitor GPU memory usage:
import torch
# Check GPU memory
if torch.cuda.is_available():
print(f"GPU Memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
print(f"GPU Memory reserved: {torch.cuda.memory_reserved(0) / 1024**3:.2f} GB")
If you encounter out-of-memory errors:
pipeline_optionsimport torch
torch.cuda.empty_cache()
If torch.cuda.is_available() returns False:
nvidia-sminvcc --versionIf GPU acceleration doesn't improve performance:
nvidia-smi -l 1torch.cuda.is_available()