docs/src/pages/changelog/2026-05-22-jan-v0.8.0.mdx
import ChangelogHeader from "@/components/Changelog/ChangelogHeader"
<ChangelogHeader title="Jan v0.8.0: Multi-Token Prediction, llama.cpp Router Mode & Inline MCP Approval" date="2026-05-22" />
Multi-Token Prediction (MTP)
Jan now supports llama.cpp's Multi-Token Prediction for compatible models (e.g. GLM 4.5/4.6 and other architectures with MTP heads). Jan detects MTP-capable models from GGUF metadata ({arch}.nextn_predict_layers) at import time and exposes a per-model toggle plus draft tunables (spec-draft-n-max, spec-draft-n-min, spec-draft-p-min) in Model Settings. When enabled, spec-type = draft-mtp is emitted into the router preset, letting the model draft multiple tokens per step for faster generation. Requires the bundled llama.cpp build b9193 or newer; older backends disable the toggle with an upgrade hint.
llama.cpp Router Mode
Jan's local inference engine now runs as a single unified router process instead of spawning a separate server for every model. The router loads and unloads models on demand, so switching between them is faster and uses less memory. This release also adds:
Inline MCP Tool Approval & Citation Cards
MCP tool calls no longer interrupt the conversation with a blocking dialog. Approval panels now appear inline inside each tool card, showing the exact arguments before you accept or deny. RAG results are displayed as numbered citation cards with source previews inside the tool output, and assistant responses include superscript markers linking back to each matched source (cosine similarity ≥ 0.65). Web search results show citation cards in the tool output but do not inject superscript markers into the response text.
Model Fit Labels & Bulk Delete
The Hub and provider model lists now show a colored fit pill — Fits, May be slow, or Won't fit — based on your hardware, without downloading anything. Model quantizations are grouped as Small, Balanced, or Large with a Recommended tag on the default download. A new Delete All button removes every managed model download at once and shows the total disk space to be freed (imported models are left untouched). Failed downloads no longer get stuck — they are cleared from the queue and a toast is shown. To retry, restart the download from the Hub or provider list.
Provider List Redesign
The provider list in Settings is now split into Local (llama.cpp, MLX) and Remote (OpenAI, Anthropic, Google, Groq, and others) sections so it is immediately clear which providers run on your device and which send data to the cloud.
Per-Thread Activity Indicators & Navigation
The thread sidebar now shows a small spinner next to each thread that is actively generating, loading a model, or running tools. Navigating away from an active thread no longer interrupts it — the work keeps running and the UI restores the correct state when you return. Deleting a thread properly stops all in-flight activity.
Audio Attachments
You can now attach audio files (WAV, MP3) directly in the chat input alongside images and documents. The Add Audio option appears when the active model supports audio input. Attached files show a preview chip with the filename and duration before sending.
File Attachment Progress Bar
File uploads now display a progress bar so you can see how much has transferred before sending your message.
Backend Dependency Checker
After installing a llama.cpp backend, Jan scans for required libraries (CUDA, Vulkan, cuDNN) and displays a checklist of anything missing with links to the official installers.
KV Cache Default Reverted to f16
The default KV cache type was temporarily changed to q8_0 in a prior release and has been reverted to f16 as a safer default. This change is not automatically migrated — if your models fail to load after updating, go to Settings > llama.cpp > KV Cache K Type / V Type and set both back to f16.
Vercel AI SDK v6
OpenAI, Anthropic, and Gemini model lists are now auto-populated when you enter an API key, with capabilities (tools, vision, audio) inferred from model IDs. Fallback API keys persist across restarts. Gemini 3.0 models route through Google's native SDK for correct tool-calling behavior. Error messages for context overflows and template errors are clearer and more actionable.
Server-Side MCP Tool Execution
A new /v1/orchestrations endpoint runs MCP tool calls server-side on localhost:1337, and a Settings toggle lets you enable the same orchestrator on /v1/chat/completions. An optional lightweight router model can be configured to pre-select which MCP servers a request needs, keeping the tool list short for the main model.
Message Queue & Editable Pending Chips
You can now keep typing while a response is streaming. Queued messages appear as editable chips in the input area, send automatically as soon as the current turn finishes, and the Stop button uses a two-stage interaction so you can cancel either the queue or the active generation.
Auto-Summarized Thread Titles
After your first response, Jan summarizes the thread title in the background using a cheap inference pass that can be cancelled if you rename the thread yourself.
Chain of Thought, Regenerate & Edit Improvements
Failed assistant responses now show a regenerate button instead of leaving the conversation stuck. A new Chain of Thought rendering surfaces step-by-step reasoning more cleanly, and you can scroll through streaming thinking content without being auto-interrupted.
Settings, Assistants & UX
Remote Provider Improvements
<thought> reasoning tags are recognized by the parser and rendered as a collapsible reasoning block instead of appearing as raw text in the responseBug Fixes
ctx_size overflow no longer breaks reloading a chattools/list before signalling ready, and reconnect automatically on disconnect; tool parameter schemas missing type are normalized for strict providersq8_0, which broke loading for some models that require flash attention. v0.8.0 reverts the default to f16 and removes the previous f16 → q8_0 migration code. Existing installs are not migrated automatically — if a model fails to load after updating, go to Settings > llama.cpp > KV Cache K Type / V Type and switch both back to f16 manually.Update your Jan or download the latest.
For the complete list of changes, see the GitHub release notes