internal/ai/vision/ollama/README.md
Last Updated: February 23, 2026
This package provides PhotoPrism’s native adapter for Ollama-compatible multimodal models. It lets Caption, Labels, and future Generate workflows call locally hosted models without changing worker logic, reusing the shared API client (internal/ai/vision/api_client.go) and result types (LabelResult, CaptionResult). Requests stay inside your infrastructure, rely on base64 thumbnails, and honor the same ACL, timeout, and logging hooks as the default TensorFlow engines. The adapter resolves ${OLLAMA_BASE_URL}/api/generate, trimming trailing slashes and defaulting to http://ollama:11434; set OLLAMA_BASE_URL=https://ollama.com to opt into cloud defaults.
internal/ai/vision/ollama and are applied whenever a model sets Engine: ollama. Aliases map to ApiFormatOllama, scheme.Base64, and a default 720 px thumbnail. Cloud defaults are only selected when OLLAMA_BASE_URL equals https://ollama.com.decodeOllamaResponse keeps the most recent chunk, while the parser now supports both response and thinking fallbacks for captions and labels.Format: json (default for label models targeting the Ollama engine).photoprism vision run to validate multi-image prompts before changing that invariant.vision.yml.Name, Engine, and Service.Uri.photoprism vision CLI.Config.Model(ModelType) returns the top-most enabled entry. When Engine: ollama, ApplyEngineDefaults() fills in the request/response format, base64 file scheme, and a 720 px resolution unless overridden.ollamaBuilder.Build wraps thumbnails with NewApiRequestOllama, which encodes them as base64 strings. Model.GetModel() resolves the exact Ollama tag (gemma3:4b, qwen2.5vl:7b, etc.).PerformApiRequest uses a single HTTP POST (default timeout 10 min). Authentication is optional; provide Service.Key if you proxy through an API gateway.ollamaParser.Parse converts payloads into ApiResponse. It normalizes confidences (LabelConfidenceDefault = 0.5 when missing), copies NSFW scores, and canonicalizes label names via normalizeLabelResult.entity.SrcOllama is stamped on labels/captions so UI badges and audits reflect the new source.LabelSystem enforces single-word nouns. Set System to override; assign LabelSystemSimple when you need descriptive phrases.CaptionPrompt, which requests one sentence in active voice.LabelPromptDefault; when DetectNSFWLabels is true, the adapter swaps in LabelPromptNSFW.Prompt to LabelPromptStrict.schema.LabelsJson(nsfw) (simple JSON template). Setting Format: json auto-attaches a reminder (model.SchemaInstructions()).Schema (inline YAML) or SchemaFile. PHOTOPRISM_VISION_LABEL_SCHEMA_FILE always wins if present.Temperature equals DefaultTemperature (0.1 unless configured), TopP=0.9, Stop=["\n\n"].Temperature is set; other parameters inherit global defaults.Options merge with engine defaults. Leave ForceJson=true for labels so PhotoPrism can reject malformed payloads early.| Model (Ollama Tag) | Size & Footprint | Strengths | JSON & Language Notes | When To Use |
|---|---|---|---|---|
gemma3:4b / 12b / 27b | 4B/12B/27B parameters, ~3.3 GB → 17 GB downloads, 128 K context | Multimodal text+image reasoning with SigLIP encoder, handles OCR/long documents, supports tool/function calling | Emits structured JSON reliably; >140 languages with strong default English output | High-quality captions + multilingual labels when you have ≥12 GB VRAM (4B works on 8 GB with Q4_K_M) |
qwen2.5vl:7b | 8.29 B params (Q4_K_M) ≈6 GB download, 125 K context | Excellent charts, GUI grounding, DocVQA, multi-image reasoning, agentic tool use | JSON mode tuned for schema compliance; supports 20+ languages with strong Chinese/English parity | Label extraction for mixed-language archives or UI/diagram analysis |
qwen3-vl:2b / 4b / 8b | Dense 2B/4B/8B tiers (~3 GB, ~3.5 GB, ~6 GB downloads) with native 256 K context extendable to 1 M; fits single 12–24 GB GPUs or high-end CPUs (2B) | Spatial + video reasoning upgrades (Interleaved-MRoPE, DeepStack), 32-language OCR, GUI/agent control, long-document ingest | Emits JSON reliably when prompts specify schema; multilingual captions/labels with Thinking variants boosting STEM reasoning | General-purpose captions/labels when you need long-context doc/video support without cloud APIs; 2B for CPU/edge, 4B as balanced default, 8B when accuracy outweighs latency |
llama3.2-vision:11b | 11 B params, ~7.8 GB download, requires ≥8 GB VRAM; 90 B variant needs ≥64 GB | Strong general reasoning, captioning, OCR, supported by Meta ecosystem tooling | Vision tasks officially supported in English; text-only tasks cover eight major languages | Keep captions consistent with Meta-compatible prompts or when teams already standardize on Llama 3.x |
minicpm-v:8b-2.6 | 8 B params, ~5.5 GB download, 32 K context | Optimized for edge GPUs, high OCR accuracy, multi-image/video support, low token count (≈640 tokens for 1.8 MP) | Multilingual (EN/ZH/DE/FR/IT/KR). Emits concise JSON but may need stricter stopping sequences | Memory-constrained deployments that still require NSFW/OCR-aware label output |
Tip: pull models inside the dev container with
docker compose --profile ollama up -dand thendocker compose exec ollama ollama pull gemma3:4b. Keep the profile stopped when you do not need extra GPU/CPU load.
Qwen3-VL models can stream structured output via
thinkingwhile leavingresponseempty. The parser checksresponsefirst and falls back tothinking, so captions/labels continue to work with either field.
PHOTOPRISM_VISION_LABEL_SCHEMA_FILE — Absolute path to a JSON snippet that overrides the default label schema (applies to every Ollama label model).PHOTOPRISM_VISION_YAML — Custom vision.yml path. Keep it synced in Git if you automate deployments.OLLAMA_HOST, OLLAMA_MODELS, OLLAMA_MAX_QUEUE, OLLAMA_NUM_PARALLEL, etc. — Provided in compose*.yaml to tune the Ollama daemon. Adjust OLLAMA_KEEP_ALIVE if you want models to stay loaded between worker batches.OLLAMA_API_KEY / OLLAMA_API_KEY_FILE — Default bearer token picked up when Service.Key is empty; useful for hosted Ollama services (e.g., Ollama Cloud).OLLAMA_BASE_URL — Base URL for the Ollama API; defaults to http://ollama:11434, trailing slashes are trimmed. Set to https://ollama.com to enable cloud defaults.PHOTOPRISM_LOG_LEVEL=trace — Enables verbose request/response previews (truncated to avoid leaking images). Use temporarily when debugging parsing issues.vision.yml ExampleModels:
- Type: labels
Name: qwen2.5vl:7b
Engine: ollama
Run: newly-indexed
Resolution: 720
Format: json
Options:
Temperature: 0.05
Stop: ["\n\n"]
ForceJson: true
Service:
Uri: ${OLLAMA_BASE_URL}/api/generate
RequestFormat: ollama
ResponseFormat: ollama
FileScheme: base64
Think: "false"
- Type: caption
Name: gemma3:4b
Engine: ollama
Disabled: false
Options:
Temperature: 0.2
Service:
Uri: ${OLLAMA_BASE_URL}/api/generate
Think: "false"
Guidelines:
model:version) so upgrades are deliberate.Service.Think is optional and is sent only when non-empty. Keep it quoted (for example "false" or "low") so YAML preserves it as a string; PhotoPrism serializes "true" / "false" as JSON booleans for Ollama compatibility.think:true may fail on models that do not implement reasoning, and think:false can still yield empty response fields on some reasoning-capable models.photoprism vision run -m labels --count 1).Service.Key to Bearer <token> and configure the proxy to inject/validate it.Run: newly-indexed for incremental runs, Run: manual for ad-hoc CLI calls, or Run: on-schedule when paired with the scheduler. Leave Run: auto if you want the worker to decide based on other model states.ServiceTimeout). Ollama streaming responses complete faster in practice; if you need stricter SLAs, wrap photoprism vision run in a job runner and retry failed batches manually.labels.go stops at the first successful engine, so duplicates are avoided../storage/services/ollama:/root/.ollama (see Compose) so pulled models survive container restarts. Run docker compose exec ollama ollama list during deployments to verify availability.photoprism vision run -m caption --count 5 --force.photoprism vision run -m labels --count 5 --force.photoprism vision ls for source=ollama.go test ./internal/ai/vision/ollama ./internal/ai/vision -run Ollama -count=1 covers transport parsing and model defaults.internal/ai/vision/testdata when capturing new response shapes; keep files small and anonymized.PHOTOPRISM_LOG_LEVEL=debug to watch summary lines (“processed labels/caption via ollama”).log.Trace sparingly; it prints truncated JSON blobs for troubleshooting./api/v1/metrics exposes counts per label source; scrape after a batch to compare throughput with TensorFlow/OpenAI runs.internal/ai/vision/ollama/*.go — Engine defaults, schema helpers, transport structs.internal/ai/vision/engine_ollama.go — Builder/parser glue plus label/caption normalization.internal/ai/vision/api_ollama.go — Base64 payload builder.internal/ai/vision/api_client.go — Streaming decoder shared among engines.internal/ai/vision/models.go — Default caption model definition (gemma3).compose*.yaml — Ollama service profile, Traefik labels, and persistent volume wiring.frontend/src/common/util.js — Maps src="ollama" to the correct badge; keep it updated when adding new source strings.