Back to Vllm

Long Text Embedding with Chunked Processing

examples/pooling/embed/openai_embedding_long_text/README.md

0.20.16.4 KB
Original Source

Long Text Embedding with Chunked Processing

This directory contains examples for using vLLM's chunked processing feature to handle long text embedding that exceeds the model's maximum context length.

๐Ÿš€ Quick Start

Start the Server

Use the provided script to start a vLLM server with chunked processing enabled:

bash
# Basic usage (supports very long texts up to ~3M tokens)
./service.sh

# Custom configuration with different models
MODEL_NAME="jinaai/jina-embeddings-v3" \
MAX_EMBED_LEN=1048576 \
./service.sh

# For extremely long documents
MODEL_NAME="intfloat/multilingual-e5-large" \
MAX_EMBED_LEN=3072000 \
./service.sh

Test Long Text Embedding

Run the comprehensive test client:

bash
python client.py

๐Ÿ“ Files

FileDescription
service.shServer startup script with chunked processing enabled
client.pyComprehensive test client for long text embedding

โš™๏ธ Configuration

Server Configuration

The key parameters for chunked processing are in the --pooler-config:

json
{
  "pooling_type": "auto",
  "use_activation": true,
  "enable_chunked_processing": true,
  "max_embed_len": 3072000
}

!!! note pooling_type sets the model's own pooling strategy for processing within each chunk. The cross-chunk aggregation automatically uses MEAN strategy when input exceeds the model's native maximum length.

Chunked Processing Behavior

Chunked processing uses MEAN aggregation for cross-chunk combination when input exceeds the model's native maximum length:

ComponentBehaviorDescription
Within chunksModel's native poolingUses the model's configured pooling strategy
Cross-chunk aggregationAlways MEANWeighted averaging based on chunk token counts
PerformanceOptimalAll chunks processed for complete semantic coverage

Environment Variables

VariableDefaultDescription
MODEL_NAMEintfloat/multilingual-e5-largeEmbedding model to use (supports multiple models)
PORT31090Server port
GPU_COUNT1Number of GPUs to use
MAX_EMBED_LEN3072000Maximum embedding input length (supports very long documents)
POOLING_TYPEautoModel's native pooling type: auto, MEAN, CLS, LAST (only affects within-chunk pooling, not cross-chunk aggregation)
API_KEYEMPTYAPI key for authentication

๐Ÿ”ง How It Works

  1. Enhanced Input Validation: max_embed_len allows accepting inputs longer than max_model_len without environment variables
  2. Smart Chunking: Text is split based on max_position_embeddings to maintain semantic integrity
  3. Unified Processing: All chunks processed separately through the model using its configured pooling strategy
  4. MEAN Aggregation: When input exceeds model's native length, results combined using token count-based weighted averaging across all chunks
  5. Consistent Output: Final embeddings maintain the same dimensionality as standard processing

Input Length Handling

  • Within max_embed_len: Input is accepted and processed (up to 3M+ tokens)
  • Exceeds max_position_embeddings: Chunked processing is automatically triggered
  • Exceeds max_embed_len: Input is rejected with clear error message
  • No environment variables required: Works without VLLM_ALLOW_LONG_MAX_MODEL_LEN

Extreme Long Text Support

With MAX_EMBED_LEN=3072000, you can process:

  • Academic papers: Full research papers with references
  • Legal documents: Complete contracts and legal texts
  • Books: Entire chapters or small books
  • Code repositories: Large codebases and documentation

๐Ÿ“Š Performance Characteristics

Chunked Processing Performance

AspectBehaviorPerformance
Chunk ProcessingAll chunks processed with native poolingConsistent with input length
Cross-chunk AggregationMEAN weighted averagingMinimal overhead
Memory UsageProportional to number of chunksModerate, scalable
Semantic QualityComplete text coverageOptimal for long documents

๐Ÿงช Test Cases

The test client demonstrates:

  • โœ… Short text: Normal processing (baseline)
  • โœ… Medium text: Single chunk processing
  • โœ… Long text: Multi-chunk processing with aggregation
  • โœ… Very long text: Many chunks processing
  • โœ… Extreme long text: Document-level processing (100K+ tokens)
  • โœ… Batch processing: Mixed-length inputs in one request
  • โœ… Consistency: Reproducible results across runs

๐Ÿ› Troubleshooting

Common Issues

  1. Chunked processing not enabled:

    log
    ValueError: This model's maximum position embeddings length is 4096 tokens...
    

    Solution: Ensure enable_chunked_processing: true in pooler config

  2. Input exceeds max_embed_len:

    log
    ValueError: This model's maximum embedding input length is 3072000 tokens...
    

    Solution: Increase max_embed_len in pooler config or reduce input length

  3. Memory errors:

    log
    RuntimeError: CUDA out of memory
    

    Solution: Reduce chunk size by adjusting model's max_position_embeddings or use fewer GPUs

  4. Slow processing: Expected: Long text takes more time due to multiple inference calls

Debug Information

Server logs show chunked processing activity:

log
INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing
INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096)

๐Ÿค Contributing

To extend chunked processing support to other embedding models:

  1. Check model compatibility with the pooling architecture
  2. Test with various text lengths
  3. Validate embedding quality compared to single-chunk processing
  4. Submit PR with test cases and documentation updates

๐Ÿ†• Enhanced Features

max_embed_len Parameter

The new max_embed_len parameter provides:

  • Simplified Configuration: No need for VLLM_ALLOW_LONG_MAX_MODEL_LEN environment variable
  • Flexible Input Validation: Accept inputs longer than max_model_len up to max_embed_len
  • Extreme Length Support: Process documents with millions of tokens
  • Clear Error Messages: Better feedback when inputs exceed limits
  • Backward Compatibility: Existing configurations continue to work