showcase/shell-docs/src/content/ag-ui/drafts/multimodal-messages.mdx
Current AG-UI protocol only supports text-based user messages. As LLMs increasingly support multimodal inputs (images, audio, files), the protocol needs to evolve to handle these richer input types.
Evolve AG-UI to support multimodal input messages without breaking existing
apps. Inputs may include text, images, audio, video, and documents. Each
modality is represented as a distinct, typed content part with a clear source
discriminator (data for inline base64, url for references), making it
straightforward to map to any LLM provider's API.
Extend the UserMessage content property to be either a string or an array of
InputContent objects. Each modality (image, audio, video, document) has its
own dedicated part type with a typed source that is either inline data or a
url reference. This makes it trivial to map content parts to any LLM
provider's API.
/**
* Supported input modality types for multimodal content.
*/
type Modality = "text" | "image" | "audio" | "video" | "document";
// ── Source types ──────────────────────────────────────────────
interface InputContentDataSource {
/** Indicates this is inline data content. */
type: "data";
/** The base64-encoded content value. */
value: string;
/** MIME type of the content (e.g., "image/png", "audio/wav"). Required. */
mimeType: string;
}
interface InputContentUrlSource {
/** Indicates this is URL-referenced content. */
type: "url";
/** HTTP(S) URL or data URI pointing to the content. */
value: string;
/** Optional MIME type hint for when it can't be inferred from the URL. */
mimeType?: string;
}
type InputContentSource = InputContentDataSource | InputContentUrlSource;
// ── Content part types ────────────────────────────────────────
interface TextInputPart {
type: "text";
/** The text content. */
text: string;
}
interface ImageInputPart<TMetadata = unknown> {
type: "image";
/** Source of the image content. */
source: InputContentSource;
/** Provider-specific metadata (e.g., OpenAI detail: "auto" | "low" | "high"). */
metadata?: TMetadata;
}
interface AudioInputPart<TMetadata = unknown> {
type: "audio";
/** Source of the audio content. */
source: InputContentSource;
/** Provider-specific metadata (e.g., format, sample rate). */
metadata?: TMetadata;
}
interface VideoInputPart<TMetadata = unknown> {
type: "video";
/** Source of the video content. */
source: InputContentSource;
/** Provider-specific metadata (e.g., duration, resolution). */
metadata?: TMetadata;
}
interface DocumentInputPart<TMetadata = unknown> {
type: "document";
/** Source of the document content. */
source: InputContentSource;
/** Provider-specific metadata (e.g., Anthropic media_type for PDFs). */
metadata?: TMetadata;
}
type InputContent =
| TextInputPart
| ImageInputPart
| AudioInputPart
| VideoInputPart
| DocumentInputPart;
// ── Updated UserMessage ───────────────────────────────────────
type UserMessage = {
id: string;
role: "user";
content: string | InputContent[];
name?: string;
};
The Modality type enumerates the supported content modalities:
| Value | Description |
|---|---|
"text" | Plain text content |
"image" | Image content (JPEG, PNG, GIF, WebP, etc.) |
"audio" | Audio content (WAV, MP3, OGG, etc.) |
"video" | Video content (MP4, WebM, etc.) |
"document" | Document content (PDF, DOCX, XLSX, etc.) |
Every non-text content part carries a source property that describes how the
content is delivered. The source is a discriminated union with two variants:
Inline base64-encoded content.
| Property | Type | Required | Description |
|---|---|---|---|
type | "data" | ✓ | Discriminator for inline data |
value | string | ✓ | Base64-encoded content |
mimeType | string | ✓ | MIME type (required to ensure correct handling) |
URL-referenced content.
| Property | Type | Required | Description |
|---|---|---|---|
type | "url" | ✓ | Discriminator for URL reference |
value | string | ✓ | HTTP(S) URL or data URI |
mimeType | string? | Optional MIME type hint |
Represents plain text content within a multimodal message.
| Property | Type | Description |
|---|---|---|
type | "text" | Identifies this as text content |
text | string | The text content |
Represents image content. Maps directly to provider image inputs (e.g., OpenAI vision, Anthropic image blocks).
| Property | Type | Description |
|---|---|---|
type | "image" | Identifies this as image content |
source | InputContentSource | Either inline data or URL reference |
metadata | TMetadata? | Provider-specific metadata (e.g., OpenAI detail level) |
Represents audio content.
| Property | Type | Description |
|---|---|---|
type | "audio" | Identifies this as audio content |
source | InputContentSource | Either inline data or URL reference |
metadata | TMetadata? | Provider-specific metadata (e.g., format, sample rate) |
Represents video content.
| Property | Type | Description |
|---|---|---|
type | "video" | Identifies this as video content |
source | InputContentSource | Either inline data or URL reference |
metadata | TMetadata? | Provider-specific metadata (e.g., duration, resolution) |
Represents document content such as PDFs, Word documents, or spreadsheets.
| Property | Type | Description |
|---|---|---|
type | "document" | Identifies this as document content |
source | InputContentSource | Either inline data or URL reference |
metadata | TMetadata? | Provider-specific metadata (e.g., Anthropic media_type) |
The generic metadata field on each content part allows provider-specific
information to flow through the protocol without polluting the core schema.
Examples:
ImageInputPart<{ detail: 'auto' | 'low' | 'high' }>DocumentInputPart<{ media_type: 'application/pdf' }>{
"id": "msg-001",
"role": "user",
"content": "What's in this image?"
}
{
"id": "msg-002",
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image",
"source": {
"type": "data",
"value": "/9j/4AAQSkZJRg...",
"mimeType": "image/jpeg"
}
}
]
}
{
"id": "msg-003",
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image",
"source": {
"type": "url",
"value": "https://example.com/photo.png"
},
"metadata": {
"detail": "high"
}
}
]
}
{
"id": "msg-004",
"role": "user",
"content": [
{
"type": "text",
"text": "What are the differences between these images?"
},
{
"type": "image",
"source": {
"type": "url",
"value": "https://example.com/image1.png",
"mimeType": "image/png"
}
},
{
"type": "image",
"source": {
"type": "url",
"value": "https://example.com/image2.png",
"mimeType": "image/png"
}
}
]
}
{
"id": "msg-005",
"role": "user",
"content": [
{
"type": "text",
"text": "Please transcribe this audio recording"
},
{
"type": "audio",
"source": {
"type": "url",
"value": "https://example.com/meeting-recording.wav",
"mimeType": "audio/wav"
}
}
]
}
{
"id": "msg-006",
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize the key points from this PDF"
},
{
"type": "document",
"source": {
"type": "url",
"value": "https://example.com/reports/q4-2024.pdf",
"mimeType": "application/pdf"
}
}
]
}
{
"id": "msg-007",
"role": "user",
"content": [
{
"type": "text",
"text": "Describe what happens in this video"
},
{
"type": "video",
"source": {
"type": "url",
"value": "https://example.com/demo.mp4",
"mimeType": "video/mp4"
},
"metadata": {
"duration": 120
}
}
]
}
{
"id": "msg-008",
"role": "user",
"content": [
{
"type": "text",
"text": "Compare the screenshot with the design spec"
},
{
"type": "image",
"source": {
"type": "data",
"value": "iVBORw0KGgo...",
"mimeType": "image/png"
}
},
{
"type": "document",
"source": {
"type": "url",
"value": "https://example.com/design-spec.pdf",
"mimeType": "application/pdf"
}
}
]
}
TypeScript SDK:
Modality type and all InputContent types in @ag-ui/coreInputContentSource, InputContentDataSource, InputContentUrlSource typesUserMessage with content: string | InputContent[]Python SDK:
TextInputPart, ImageInputPart,
etc.)InputContentSource discriminated unionUserMessage modelFrameworks need to:
InputContent parts and dispatch on part.typesource.type to determine whether to send inline data or a URL to the
providermetadata to providers that support itmimeType is appropriate for the declared content part typeUsers can upload images (ImageInputPart) and ask questions about them.
Upload PDFs, Word documents, or spreadsheets (DocumentInputPart) for analysis.
Process voice recordings, podcasts, or meeting audio (AudioInputPart).
Analyze video content (VideoInputPart) for summaries, descriptions, or content
moderation.
Compare multiple images, documents, or mixed media using different content part types in a single message.
Share screenshots (ImageInputPart) for UI/UX feedback or debugging assistance.
InputContent type and InputContentSource variantsource.type discriminator correctly narrows the unionstring contentmetadata passthrough for provider-specific fieldsInputContentDataSourceTMetadata works across SDKs