Back to Cli Anything

Ollama: Project-Specific Analysis & SOP

ollama/agent-harness/OLLAMA.md

latest4.2 KB
Original Source

Ollama: Project-Specific Analysis & SOP

Architecture Summary

Ollama is a local LLM runtime that serves models via a REST API on localhost:11434. It handles model downloading, quantization, GPU/CPU inference, and memory management.

┌──────────────────────────────────────────────┐
│              Ollama Server                   │
│  ┌──────────┐ ┌──────────┐ ┌─────────────┐  │
│  │  Model    │ │ Generate │ │  Embeddings │  │
│  │  Manager  │ │  Engine  │ │   Engine    │  │
│  └────┬──────┘ └────┬─────┘ └──────┬──────┘  │
│       │             │              │          │
│  ┌────┴─────────────┴──────────────┴───────┐ │
│  │         REST API (port 11434)           │ │
│  │  /api/tags  /api/generate  /api/embed   │ │
│  │  /api/pull  /api/chat      /api/show    │ │
│  │  /api/delete /api/copy     /api/ps      │ │
│  └─────────────────┬───────────────────────┘ │
└────────────────────┼─────────────────────────┘
                     │
         ┌───────────┴──────────┐
         │  llama.cpp backend   │
         │  GGUF model format   │
         │  GPU/CPU inference   │
         └──────────────────────┘

CLI Strategy: REST API Wrapper

Ollama already provides a clean REST API. Our CLI wraps it with:

  1. requests — HTTP client for all API calls
  2. Streaming NDJSON — For progressive output during generation and model pulls
  3. Click CLI — Structured command groups matching the API surface
  4. REPL — Interactive mode for exploratory use

API Endpoints

EndpointMethodPurpose
/GETServer status check
/api/tagsGETList local models
/api/showPOSTModel details
/api/pullPOSTDownload model (streaming)
/api/deleteDELETERemove model
/api/copyPOSTCopy/rename model
/api/psGETRunning models
/api/generatePOSTText generation (streaming)
/api/chatPOSTChat completion (streaming)
/api/embedPOSTGenerate embeddings
/api/versionGETServer version

Command Map: Ollama Native CLI → CLI-Anything

Ollama CLICLI-Anything
ollama listmodel list
ollama show <name>model show <name>
ollama pull <name>model pull <name>
ollama rm <name>model rm <name>
ollama cp <src> <dst>model copy <src> <dst>
ollama psmodel ps
ollama run <model> <prompt>generate text --model <name> --prompt "..."
(no equivalent)generate chat --model <name> --message "..."
(no equivalent)embed text --model <name> --input "..."
ollama serve(external — must be running)

Model Parameters (options)

ParameterTypeDescription
temperaturefloatSampling temperature (0.0-2.0)
top_pfloatNucleus sampling threshold
top_kintTop-k sampling
num_predictintMax tokens to generate
repeat_penaltyfloatRepetition penalty
seedintRandom seed for reproducibility
stoplist[str]Stop sequences

Test Coverage Plan

  1. Unit tests (test_core.py): No Ollama server needed

    • URL construction in backend
    • Output formatting
    • CLI argument parsing via Click test runner
    • Session state management
    • Error handling paths
  2. E2E tests (test_full_e2e.py): Requires Ollama running

    • List models
    • Pull a small model
    • Generate text
    • Chat completion
    • Show model info
    • Embeddings
    • Delete model