docs/src/content/docs/explanation/multimodal-pipeline.md
Multimodal requests carry non-text content parts (image_url, audio, video). Each part goes through a per-modality preprocessing and encoder path before the tokens reach the transformer.
Content parts are ordered in the request body. The engine preserves that order when it interleaves media tokens with text tokens, so text surrounding an image appears on either side of the image tokens in the final token sequence. The transformer sees one uniform token stream.
file://, http(s)://, data:image/...;base64,) is resolved to a pixel buffer. HTTP URLs trigger a fetch.Multiple images in one request are encoded as a batch.
Video is decoded to frames before model preprocessing. Each selected frame then flows through the image path.
Supported containers and FFmpeg installation are covered in Set up video input.
Encoder outputs are cached by content hash and modality. When the same image, video, or audio clip appears in a later request, or in a later turn of the same session, the encoder pass is skipped and the cached tokens are reused.
The modality is part of the key: identical bytes processed as an image versus as a video frame can yield different token counts, and the cache keeps them separate.
The cache is LRU with a fixed capacity per model. Hit and miss counters are exposed for observability.