docs/content/features/middleware.md
+++ title = "Middleware: PII filtering and intelligent routing" weight = 27 toc = true description = "Per-model PII redaction and policy-based request routing" tags = ["Routing", "Privacy", "PII", "Middleware", "Advanced"] categories = ["Features"] +++
LocalAI ships a request-middleware layer that sits between the HTTP API and
the backend dispatcher. Two subsystems share that layer because they share
the same lifecycle hook: PII filtering scans the request body before it
reaches a backend, and the intelligent router rewrites input.Model so
a single client-facing model name fans out across multiple downstream
targets.
Both are inspected and configured from the same admin page
(/app/middleware), backed by the same REST surface (/api/middleware/*,
/api/pii/*, /api/router/*) and the same MCP tools.
client ── auth ── route-model ── per-model PII ── backend ── client
│ │
│ └─── event log
└─── decision log
The router runs first (it picks the target model so per-model PII has
something to gate on), per-model PII runs next (gated by the resolved
config), and the backend executes. Filtering is request-side only —
the request body is scanned and rewritten before forwarding; the response
is not touched (NER over a streamed response is left as a follow-up). Each
subsystem writes to its own admin-visible log: /api/router/decisions for
routing, /api/pii/events for redaction and block actions.
PII redaction is NER-based and runs request-side (input). It is
off by default, flipping to on for any cloud-proxy backend
because that traffic crosses the network to a third-party provider. Pick a
default detector so those models are actually
scanned. Explicit pii.enabled in a model's YAML always wins over the
backend default.
Filtering runs on every text-accepting endpoint that has an adapter wired:
/v1/chat/completions and /v1/messages (chat), /v1/completions,
/v1/embeddings, /v1/edits, and the Ollama /api/chat, /api/generate
and /api/embed endpoints, plus the [MITM proxy]({{< relref "mitm-proxy.md" >}})
request body. Image, audio (TTS/STT), video, rerank, and the realtime
WebSocket are not filtered yet (different prompt-PII semantics; realtime is
not HTTP middleware).
A request's messages are scanned as one document (joined in order), so
the NER detector keeps conversational context: whether 4421 is a PIN or
jdoe_42 is a username is usually decided by the question asked in the
previous message, and a bidirectional encoder only sees that context when
the messages share a forward pass. Detected spans are mapped back to the
individual message they fall in, so redaction still rewrites each message
field in place and events carry message-local offsets.
The earlier regex pattern tier (
pii.patterns, the built-in pattern catalogue,--pii-config, the/api/pii/patterns|test|decideendpoints) and response/streaming-side redaction have been removed. Detection is now driven entirely by token-classification (NER) models. Legacy keys no-op with a startup warning.
A detector is a token_classify model (e.g. an openai-privacy-filter
GGUF) that carries the detection policy in a top-level pii_detection:
block — defined once, on the model itself:
name: privacy-filter-multilingual
backend: privacy-filter
embeddings: true # TOKEN_CLS pooling
known_usecases:
- token_classify
pii_detection:
min_score: 0.5 # drop detections below this confidence
default_action: mask # applied to any detected group with no entry
entity_actions: # which PII to block vs mask vs allow-log
PASSWORD: block
CREDITCARD: block
EMAIL: mask
mask rewrites the matched span to [REDACTED:ner:<GROUP>] in the request
body before forwarding. block returns HTTP 400 (error.type=pii_blocked)
without forwarding. allow detects and logs (a PIIEvent is still recorded)
but leaves the text unchanged. The entity-group names are whatever the model
emits (the privacy-filter family uses uppercase names like EMAIL,
PASSWORD, CREDITCARD).
NER is the wrong tool for high-entropy, highly-regular secrets — API keys, tokens, private-key blocks. A trained NER model has no "API key" class, so it fragments a key into the nearest categories it does know and can leave the secret part exposed. Those secrets are exactly what a regex catches cheaply.
A pattern detector is a detector model (backend: pattern) that matches
secrets with a restricted regex subset compiled to Go's RE2 engine —
linear-time, no backtracking, no ReDoS. It runs entirely in-process: no model
download, no backend, zero VRAM. Install the gallery's secret-filter for a
ready-made set, or define your own:
name: secret-filter
backend: pattern
known_usecases: [token_classify] # so it appears in the detector picker
pii_detection:
default_action: block # a leaked credential shouldn't leave
builtins: # built-in catalogue (enable by name)
- anthropic_api_key
- openai_api_key
- github_token
- aws_access_key
- private_key_block
patterns: # operator-defined, restricted subset
- name: INTERNAL_TOKEN
match: "tok-[A-Za-z0-9]{32,64}"
action: block # optional per-pattern override
min_len: 36 # optional length floor
A match is reported under its group (built-in group name, or the pattern
name), so entity_actions / default_action apply exactly as for NER.
The restricted grammar (validated at load — an invalid pattern is rejected, not silently ignored):
[…] and \w \d \s, alternation,
anchors ^ $ \b, and quantifiers ? * + {m,n}.. (any-char), capturing groups, and {n,m} bounds over 4096.sk-ant-, ghp_, AKIA). This admits real key
shapes but rejects open-ended ones — an email or a bare \w+ has no such
anchor and belongs to the NER tier.Use both tiers together: reference an NER detector and a pattern detector in a
model's pii.detectors (or as instance defaults); their hits union, and a
block from either rejects the request.
Any model opts in by enabling PII and referencing one or more detectors — no per-consumer policy:
name: claude-strict
backend: cloud-proxy
proxy:
mode: passthrough
provider: anthropic
upstream_url: https://api.anthropic.com/v1/messages
api_key_env: ANTHROPIC_API_KEY
pii:
enabled: true # default-on for cloud-proxy; explicit for audit
detectors:
- privacy-filter-multilingual
Multiple detectors union their detections; overlapping spans resolve to
the strongest action (block > mask > allow). A configured detector
that can't be loaded fails the request closed (HTTP 503,
error.type=pii_ner_unavailable) rather than silently skipping the check.
The same NER path runs on the [MITM proxy]({{< relref "mitm-proxy.md" >}})
request body for intercepted hosts. Response/output redaction is out of
scope for now.
The Detector models table on the Middleware → Filtering page lists every
token_classify detector model (neural NER models and in-process pattern
matchers alike) and exposes a per-row Default toggle. Toggling a detector
on adds it to the instance-wide default detector set — one or more models
applied to any PII-enabled model that names none of its own pii.detectors.
It is persisted through POST /api/settings and read live, so a change takes
effect on the next request without a restart. A default that names a model no
longer loaded still appears (marked not loaded) so it can be toggled off.
This is what makes cloud-proxy / MITM redaction work out of the box: those
backends default to PII-enabled but ship no detector list, so without a
default detector the filter runs with nothing to scan. Set one here and
cloud-proxy traffic is scanned with no per-model config.
Resolution precedence (the single decision point is ResolvePIIPolicy,
shared by the chat middleware and the MITM listener so both agree):
pii.enabled on the model wins — true or false.cloud-proxy).pii.detectors; if it lists none, the
instance-wide default detector(s) are used.A model that resolves enabled but ends up with no detector at all (a cloud-proxy model with no model detectors and no instance default) scans nothing — set a default detector to close that gap.
The /app/middleware page (admin role only) has four tabs — Filtering,
Routing, MITM Proxy (see the [MITM doc]({{< relref "mitm-proxy.md" >}})),
and Events. The Filtering tab has a Detector models table (every
token_classify filter model, with the per-row Default toggle above and an
edit link to each detector's config, plus an Add detector model button) and
a per-model table listing only the models PII can actually apply to — chat /
completion / embeddings / edit consumers and cloud-proxy models, not
VAD/STT/image models or the detector models themselves. Each row reports the
effective enabled state as an inline toggle — flipping it writes an
explicit pii.enabled to that model's YAML (a server-side deep-merge that
preserves pii.detectors and every other field), so a cloud-proxy model shown
on by backend default can be turned off, and vice-versa — plus the
resolved detector(s) — with a (default) marker when they come from the
instance-wide default rather than the model's YAML — why it is on (YAML /
backend default), and the recent event count. Detection policy
(entity→action, min score) is still edited on each detector model's config
(Models → edit → PII), not globally.
The same detection pipeline is also exposed as a standalone service, so a
client can scan or sanitise a string without routing a full chat request
through it (the inline path above). Two endpoints, both requiring a normal API
key (the pii_filter feature — not admin):
POST /api/pii/analyze — detect only. Returns the matched entity spans
(entity_type, source ner|pattern, start/end, score, action)
and a blocked flag, without modifying the text.POST /api/pii/redact — apply the configured policy. Returns redacted_text
(with masked spans replaced by [REDACTED:<id>]) and masked; when a block
action fires it returns 400 with type: pii_blocked and the offending
entities — never a redacted body.Both take the same request: text plus a detector selection — either explicit
detector model names in detectors, or a consuming model whose effective
policy is used: the model's own pii.detectors, else the
instance-wide default detectors, exactly as
the inline filter resolves them. A model with PII disabled — or enabled but
with no detector anywhere — is a 400: the inline filter would scan nothing
for it, and the API says so rather than implying a clean scan. The detection
policy lives on the detector models exactly as for the inline filter. The raw
matched value is never returned (an admin may pass reveal: true to include
the audit hash_prefix).
text is scanned as a single document. To reproduce the inline filter's
conversation-context behaviour for multi-message content, join the messages
with blank lines into one text — NER detection quality depends on that
context (a bare 4421 is nothing; after "what are the last four digits of
your card?" it is a PIN).
# Redact with an explicit pattern/NER detector
curl -sX POST http://localhost:8080/api/pii/redact \
-H 'Authorization: Bearer $API_KEY' -H 'Content-Type: application/json' \
-d '{"text":"reach me at [email protected]","detectors":["my-ner-model"]}'
# => {"redacted_text":"reach me at [REDACTED:ner:EMAIL]","masked":true,...}
# Analyze using a consuming model's configured detectors
curl -sX POST http://localhost:8080/api/pii/analyze \
-H 'Authorization: Bearer $API_KEY' -H 'Content-Type: application/json' \
-d '{"text":"sk-ant-api03-…","model":"gpt-4"}'
# => {"entities":[{"entity_type":"ANTHROPIC_KEY","source":"pattern",...,"action":"block"}],"blocked":true}
Calls are audited in the same event log, tagged with an origin of
pii_analyze / pii_redact (the inline filter records middleware, the MITM
proxy records proxy), so GET /api/pii/events?origin=pii_redact shows just
the redact-API rows.
| Method | Path | Auth | Purpose |
|---|---|---|---|
| POST | /api/pii/analyze | api key (pii_filter) | Detect PII in a string; returns entity spans, no mutation. |
| POST | /api/pii/redact | api key (pii_filter) | Redact a string per policy; returns redacted_text or 400 pii_blocked. |
| GET | /api/pii/events | admin | Recent middleware events — PII redactions, MITM connect/traffic, admission denials. Filterable by correlation_id, user_id, pattern_id (e.g. ner:EMAIL), kind, origin. |
| GET | /api/middleware/status | admin | Aggregated dashboard data: per-model PII state + detectors + router status + MITM status + admission status. One round-trip for the UI. |
The same surface is mirrored through the LocalAI Assistant MCP server:
| Tool | Read/Write | Purpose |
|---|---|---|
get_pii_events | read | Recent redaction / block events with optional filters. |
get_middleware_status | read | Aggregator — the same payload as GET /api/middleware/status. |
Detection policy is part of a detector model's config, so it is managed
through the model-config tools (edit_model_config), not a dedicated PII
tool.
A router model is a model whose YAML carries a router: block. When
a client addresses it ("model": "smart-router"), the middleware
classifies the prompt, picks a downstream candidate model, rewrites
input.Model to the candidate, and the standard model-resolution path
runs against that resolved target. ACL checks, disabled-state, and
per-model PII all apply to the resolved model — the router does
model selection only.
Candidates must not themselves be router models. A
smart-router → claude-strict → cloud-proxy chain is fine
(claude-strict is a regular cloud-proxy model). A
smart-router → other-router → real-model chain is rejected at runtime
by the middleware (the dispatcher returns HTTP 500 with a
depth-1 invariant error). This keeps the dispatch graph acyclic and
predictable.
If no candidate's label set covers the active label set from the classifier,
or the classifier errors out, the router uses cfg.Router.Fallback.
An empty fallback causes the dispatch to fail with HTTP 500 rather
than silently routing somewhere unintended — fail-fast, not
silent-bypass.
LocalAI ships two classifier implementations. Pick one with classifier:
in the router YAML:
| Classifier | When to use | Underlying primitive |
|---|---|---|
score (default) | Small classifier-tuned LM (Arch-Router-style). Best when label vocabulary is well-covered by next-token continuation. | Score gRPC primitive (llama-cpp, vLLM). |
colbert | When label descriptions are abstract or short and a next-token classifier produces flat distributions. Robust on long-form policy descriptions. | rerankers backend in ColBERT mode (e.g. bge-m3-colbert from the gallery). |
Both classifiers share the same YAML shape: classifier_model,
policies, candidates, fallback, activation_threshold,
classifier_cache_size, and the optional embedding_cache block.
The score classifier works like this:
Score gRPC primitive
(backend.proto::Score), which returns per-candidate log-probabilities
length-normalized so candidates of unequal token length stay
comparable.activation_threshold joins the active label set.Labels is a superset of the active
set. Admins order candidates smallest → largest so a single-label
query routes to the smallest capable model, while a query that
activates multiple labels falls to a candidate that covers them all.This is the Arch-Router approach extended for multi-label. The distribution carries more signal than the argmax — reading off the spread lets one prompt activate multiple policies and route to a model capable of all of them.
Arch-Router-1.5B is the canonical choice. It's a Qwen-2.5-1.5B-Instruct base trained specifically on routing-policy continuation, so the ChatML system-prompt
The classifier model must support the Score gRPC primitive (today: the
llama-cpp and vLLM backends) and use the ChatML chat template. Any small
ChatML instruct model works under those constraints, but expect flatter
probability distributions which translate to a higher
activation_threshold to keep noise out of the active label set.
On llama-cpp, scoring rides the server's task queue alongside
generation and embeddings, so the classifier may share a model config
with chat/completion/embeddings — a dedicated scorer model is no
longer required. Repeated calls with the same prompt also reuse the
prompt's KV cache across candidates.
The colbert classifier reranks each policy description against the
prompt via the rerankers backend and activates the labels whose
relevance scores clear activation_threshold (default 0.5 for
reranker-style scores in [0, 1]).
router:
classifier: colbert
classifier_model: bge-m3-colbert # gallery entry; loads BAAI/bge-m3 in ColBERT mode
activation_threshold: 0.5
policies:
- label: code-generation
description: writing, debugging, reading, or explaining code
- label: casual-chat
description: small talk, greetings, jokes
candidates: [...]
The reranker scores the description (natural English) rather than
asking a small LM to score the label as a next-token continuation,
so it tends to be more robust when policy labels are abstract slugs
(compliance-review, tier-2-support). The trade-off is one
reranker round-trip per request — bge-m3 in ColBERT mode is fast
enough on GPU that this is comparable to the Score path for most
workloads. The embedding_cache block applies identically.
The reranker model's type: (in the model YAML) selects which
underlying scoring head loads — colbert for late-interaction MaxSim,
cross-encoder for cross-attention scoring. The classifier itself is
indifferent; pick the head that fits your latency / quality budget.
name: smart-router
known_usecases:
- chat
router:
# `score` (Arch-Router-style next-token scoring) or `colbert`
# (rerank policy descriptions). See "Available classifiers" above.
classifier: score
# A model loaded by LocalAI that supports the Score gRPC primitive
# (llama-cpp and vLLM ship implementations). Arch-Router-1.5B is the
# canonical choice.
classifier_model: arch-router-1.5b
# Bounded LRU keyed on (case-folded, whitespace-trimmed) prompt — prompts
# repeat in agent loops; the cache amortises the classifier round-trip
# across them. 0 here means "use the default" (1024); the cache cannot be
# disabled from YAML today.
classifier_cache_size: 256
# Softmax probability floor a label must clear to join the active label set.
# 0 = use the package default (0.15). 0.40 is a better empirical
# starting point on Arch-Router-1.5B — see the tuning note below.
activation_threshold: 0.40
# Used when no candidate covers the active label set, or the classifier
# itself errors. Empty here = fail-fast with HTTP 500.
fallback: qwen3-0.6b
# The label vocabulary. Descriptions are fed verbatim into the
# classifier's system prompt — short, action-oriented sentences work
# best ("writing or debugging code", "small talk").
policies:
- label: code-generation
description: writing, debugging, reading, or explaining code in any programming language
- label: casual-chat
description: small talk, greetings, jokes, or general conversation with no specific task
- label: math-reasoning
description: arithmetic, equations, percentage calculations, or step-by-step word problems
# Routing table — order matters (smallest → largest). See "Score
# classifier" above for the matching rule.
candidates:
- model: qwen3-0.6b
labels: [casual-chat]
- model: qwen_qwen3.5-2b
labels: [code-generation, casual-chat, math-reasoning]
activation_thresholdThe threshold is the single knob you'll want to tune per (classifier-model, policy-set) pair. On Arch-Router-1.5B with the three-policy setup above, sweeping the threshold over a hand-labeled 30-prompt corpus produced:
| Threshold | Label-set accuracy | End-to-end routing accuracy |
|---|---|---|
| 0.15 (package default) | 30% | 73% |
| 0.30 | 57% | 87% |
| 0.40 | 60% | 90% |
| 0.45 | 67% | 97% |
| 0.50 | 67% | 97% |
The classifier's argmax matches the dominant label 93% of the time on this corpus — what the threshold controls is how much secondary-label noise leaks into the active label set. Low thresholds push single-label queries to multi-label-capable (larger) candidates unnecessarily; 0.40 keeps the dominant label dominant without losing genuine compound activations.
Re-tune per (classifier-model, policy-set) pair. The /api/score
endpoint (see below) is the convenient probe — it returns the raw
length-normalized log-probabilities so you can sweep thresholds offline
without driving real chat completions.
Classification is the most expensive thing the middleware does. The score classifier already memo-caches verbatim repeats (case- and whitespace-folded prompt → decision); the embedding cache is the L2 tier that catches semantically similar prompts — "How do I exit vim?" and "i need to quit vim" can share a decision instead of running the classifier twice.
Pairs naturally with a larger / slower classifier model: the steady-state
cost on cache hits collapses to one embedding round-trip plus a KNN
search, both well under 100ms with nomic-embed-text-v1.5 + local-store.
Add an embedding_cache: block to a router model:
router:
classifier: score
classifier_model: arch-router-1.5b
policies: [...]
candidates: [...]
embedding_cache:
embedding_model: nomic-embed-text-v1.5 # any loaded embedding model
similarity_threshold: 0.80 # cosine sim floor for a hit (default 0.80)
confidence_threshold: 0.60 # min top-label prob to cache a decision (default 0.60)
# store_name: router-cache-smart-router # optional override; defaults to "router-cache-<router>"
Omit the block entirely to disable. The cache adds two new failure modes (embedder unavailable, store unavailable) — both fall through to the inner classifier so routing keeps working.
For each request:
embedding_model.similarity_threshold, return the cached decision
(Cached=true, CacheSimilarity=<sim> in the decision log).decision.score >= confidence_threshold,
insert (embedding, decision) into the store. Low-confidence
decisions are deliberately skipped so they can't poison future
paraphrases.The local-store collection is named router-cache-<router-model-name> by
default — each router gets its own collection so two routers can't
cross-contaminate. Collections persist on disk (local-store is the
canonical persistent vector backend), so the cache survives restarts.
yaml.Marshal),
but the underlying local-store collection still holds the old
payloads. Manual flush via local-store admin or by renaming
store_name if you need a hard reset.The /app/middleware page has a Routing tab listing every router
model's classifier, policies, candidates, and fallback. The Events
tab shows the decision log — one row per classified request with
correlation ID, requested model, served model, classifier name, active
labels, top-label score, and latency.
Routing decisions are stored in an in-process ring buffer (default
capacity 5,000). The decision log is for audit and tuning — the
canonical usage log lives in /api/usage and correlates by request ID.
| Method | Path | Auth | Purpose |
|---|---|---|---|
| GET | /api/router/status | any | Router configuration: each router model's classifier, policies, candidates. |
| GET | /api/router/decisions | admin | Decision log with optional filters (correlation_id, user_id, router_model, limit). |
| POST | /api/score | admin | Direct access to the Score gRPC primitive — useful for offline threshold tuning. Body: {"model": "<classifier-model>", "prompt": "<chatml-prompt>", "candidates": ["label-a", ...], "length_normalize": true}. The llama-cpp and vLLM backends implement Score; other backends return UNIMPLEMENTED. |
| Tool | Read/Write | Purpose |
|---|---|---|
get_router_decisions | read | Recent decision log with optional filters. |
get_middleware_status | read | Includes the router section listing configured router models. |
Mutating routing config — adding a candidate, changing the classifier
model — is YAML-only today; reload with POST /models/reload to pick
up edits without restarting.
POST /models/reload re-reads from disk; the next request
rebuilds the classifier from the new config (the classifier cache is
fingerprinted by yaml.Marshal(RouterConfig) so it invalidates
automatically).backend/cpp/llama-cpp/grpc-server.cpp::Score. Until then,
classifier_cache_size is the highest-leverage knob for repeat-query
workloads (agent loops).proxy-* backends to send simple prompts to local
models and complex ones to cloud providers./app/middleware page; in
no-auth single-user mode the synthetic local user has admin role
automatically.