Back to Antigravity Manager

Vision MCP (built-in server)

docs/zai/vision-mcp.md

4.1.333.5 KB
Original Source

Vision MCP (built-in server)

Why we implemented it this way

The upstream Vision MCP package (@z_ai/mcp-server) is designed as a local stdio server. In a desktop app + embedded proxy, requiring users (or the app) to manage a separate Node runtime/process increases operational complexity.

Instead, we implement a built-in Vision MCP server directly in the proxy:

  • No extra runtime dependency.
  • Single place to store the z.ai key (proxy config).
  • Apps can talk to the local proxy using standard MCP over HTTP.

Local endpoint

  • /mcp/zai-mcp-server/mcp

Wired in:

Protocol surface (minimal Streamable HTTP MCP)

Handler:

Implemented methods:

  • POST /mcp:
    • initialize
    • tools/list
    • tools/call
  • GET /mcp:
    • returns an SSE stream (keepalive) for an existing session
  • DELETE /mcp:
    • terminates a session

Session storage:

Notes:

  • This is intentionally minimal to support tool calls.
  • Prompts/resources, resumability, and streamed tool output can be added later if needed.

Tool set

Tool registry:

Tool execution:

Supported tools (mirrors the upstream package at a high level):

  • ui_to_artifact
  • extract_text_from_screenshot
  • diagnose_error_screenshot
  • understand_technical_diagram
  • analyze_data_visualization
  • ui_diff_check
  • analyze_image
  • analyze_video

Upstream calls

Vision tools call the z.ai vision chat completions endpoint:

  • https://api.z.ai/api/paas/v4/chat/completions

Implementation:

Auth:

  • Uses Authorization: Bearer <proxy.zai.api_key>

Payload:

  • model: glm-4.6v (currently hardcoded)
  • messages: system prompt + a multimodal user message containing images/videos + text prompt
  • stream: false (currently returns a single tool result)

Local file handling

To support local file paths passed by MCP clients:

  • Images (.png, .jpg, .jpeg) are read and encoded as data:<mime>;base64,... (5 MB max)
  • Videos (.mp4, .mov, .m4v) are read and encoded as data:<mime>;base64,... (8 MB max)

Implementation:

Quick validation (raw JSON-RPC)

  1. Initialize:
    • POST /mcp/zai-mcp-server/mcp with {\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"initialize\",\"params\":{\"protocolVersion\":\"2024-11-05\",\"capabilities\":{}}}
    • capture Mcp-Session-Id response header
  2. List tools:
    • POST /mcp/zai-mcp-server/mcp with Mcp-Session-Id: <id> and {\"jsonrpc\":\"2.0\",\"id\":2,\"method\":\"tools/list\"}
  3. Call tool:
    • POST /mcp/zai-mcp-server/mcp with Mcp-Session-Id: <id> and {\"jsonrpc\":\"2.0\",\"id\":3,\"method\":\"tools/call\",\"params\":{\"name\":\"analyze_image\",\"arguments\":{\"image_source\":\"/path/to/file.png\",\"prompt\":\"Describe this image\"}}}