examples/pooling/embed/openai_embedding_long_text/README.md
This directory contains examples for using vLLM's chunked processing feature to handle long text embedding that exceeds the model's maximum context length.
Use the provided script to start a vLLM server with chunked processing enabled:
# Basic usage (supports very long texts up to ~3M tokens)
./service.sh
# Custom configuration with different models
MODEL_NAME="jinaai/jina-embeddings-v3" \
MAX_EMBED_LEN=1048576 \
./service.sh
# For extremely long documents
MODEL_NAME="intfloat/multilingual-e5-large" \
MAX_EMBED_LEN=3072000 \
./service.sh
Run the comprehensive test client:
python client.py
| File | Description |
|---|---|
service.sh | Server startup script with chunked processing enabled |
client.py | Comprehensive test client for long text embedding |
The key parameters for chunked processing are in the --pooler-config:
{
"pooling_type": "auto",
"use_activation": true,
"enable_chunked_processing": true,
"max_embed_len": 3072000
}
!!! note
pooling_type sets the model's own pooling strategy for processing within each chunk. The cross-chunk aggregation automatically uses MEAN strategy when input exceeds the model's native maximum length.
Chunked processing uses MEAN aggregation for cross-chunk combination when input exceeds the model's native maximum length:
| Component | Behavior | Description |
|---|---|---|
| Within chunks | Model's native pooling | Uses the model's configured pooling strategy |
| Cross-chunk aggregation | Always MEAN | Weighted averaging based on chunk token counts |
| Performance | Optimal | All chunks processed for complete semantic coverage |
| Variable | Default | Description |
|---|---|---|
MODEL_NAME | intfloat/multilingual-e5-large | Embedding model to use (supports multiple models) |
PORT | 31090 | Server port |
GPU_COUNT | 1 | Number of GPUs to use |
MAX_EMBED_LEN | 3072000 | Maximum embedding input length (supports very long documents) |
POOLING_TYPE | auto | Model's native pooling type: auto, MEAN, CLS, LAST (only affects within-chunk pooling, not cross-chunk aggregation) |
API_KEY | EMPTY | API key for authentication |
max_embed_len allows accepting inputs longer than max_model_len without environment variablesmax_position_embeddings to maintain semantic integrityVLLM_ALLOW_LONG_MAX_MODEL_LENWith MAX_EMBED_LEN=3072000, you can process:
| Aspect | Behavior | Performance |
|---|---|---|
| Chunk Processing | All chunks processed with native pooling | Consistent with input length |
| Cross-chunk Aggregation | MEAN weighted averaging | Minimal overhead |
| Memory Usage | Proportional to number of chunks | Moderate, scalable |
| Semantic Quality | Complete text coverage | Optimal for long documents |
The test client demonstrates:
Chunked processing not enabled:
ValueError: This model's maximum position embeddings length is 4096 tokens...
Solution: Ensure enable_chunked_processing: true in pooler config
Input exceeds max_embed_len:
ValueError: This model's maximum embedding input length is 3072000 tokens...
Solution: Increase max_embed_len in pooler config or reduce input length
Memory errors:
RuntimeError: CUDA out of memory
Solution: Reduce chunk size by adjusting model's max_position_embeddings or use fewer GPUs
Slow processing: Expected: Long text takes more time due to multiple inference calls
Server logs show chunked processing activity:
INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing
INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096)
To extend chunked processing support to other embedding models:
The new max_embed_len parameter provides:
VLLM_ALLOW_LONG_MAX_MODEL_LEN environment variablemax_model_len up to max_embed_len