website/docs/developer-guide/plugin-llm-access.md
ctx.llm is the supported way for a plugin to make an LLM call.
Chat completion, structured extraction, sync, async, with or without
images — same surface, same trust gate, same host-owned credentials.
Plugins reach for this when they need to do something that involves the model but isn't part of the agent's conversation. A hook that rewrites a tool error into something a non-engineer can read. A gateway adapter that translates an inbound message before queuing it. A slash command that summarises a long paste. A scheduled job that scores yesterday's activity and writes one line to a status board. A pre-filter that decides whether a message is worth waking the agent up for at all.
These are jobs the agent shouldn't be in the loop on. They want one LLM call, a typed answer, and to be done.
result = ctx.llm.complete(messages=[{"role": "user", "content": "ping"}])
return result.text
That's the whole API in one line. No keys, no provider config, no SDK initialisation. The plugin runs against whatever provider and model the user is currently using — when they switch providers, the plugin follows them automatically.
result = ctx.llm.complete(
messages=[
{"role": "system", "content": "Rewrite errors as one short sentence a non-engineer can act on."},
{"role": "user", "content": traceback_text},
],
max_tokens=64,
purpose="hooks.error-rewrite",
)
return result.text
purpose is a free-form audit string — it shows up in agent.log
and in result.audit so operators can see which plugin made which
call. Optional but recommended for anything that fires often.
When the plugin needs a typed answer, switch to the structured lane:
result = ctx.llm.complete_structured(
instructions="Score this support reply for urgency (0–1) and pick a category.",
input=[{"type": "text", "text": message_body}],
json_schema=TRIAGE_SCHEMA,
purpose="support.triage",
temperature=0.0,
max_tokens=128,
)
if result.parsed["urgency"] > 0.8:
await dispatch_to_oncall(result.parsed["category"], message_body)
The host requests JSON output from the provider, parses it locally
as a fallback, validates against your schema if jsonschema is
installed, and hands back a Python object on result.parsed. If the
model couldn't produce valid JSON, result.parsed is None and
result.text carries the raw response.
complete() for chat,
complete_structured() for typed JSON, acomplete() and
acomplete_structured() for asyncio. Same arguments, same result
objects.result.audit.config.yaml.Two complete plugins below — one chat, one structured. Both ship
inside a single register(ctx) function and need zero outside
configuration to run against whatever model the user has active.
/tldrdef register(ctx):
ctx.register_command(
name="tldr",
handler=lambda raw: _tldr(ctx, raw),
description="Summarise the supplied text in one paragraph.",
args_hint="<text>",
)
def _tldr(ctx, raw_args: str) -> str:
text = raw_args.strip()
if not text:
return "Usage: /tldr <text to summarise>"
result = ctx.llm.complete(
messages=[
{"role": "system",
"content": "Summarise the user's text in one tight paragraph. No preamble."},
{"role": "user", "content": text},
],
max_tokens=256,
temperature=0.3,
purpose="tldr",
)
return result.text
result.text is the model's response; result.usage carries token
counts; result.provider and result.model carry attribution.
/paste-to-tasksdef register(ctx):
ctx.register_command(
name="paste-to-tasks",
handler=lambda raw: _paste_to_tasks(ctx, raw),
description="Turn freeform meeting notes into structured tasks.",
args_hint="<text>",
)
_TASKS_SCHEMA = {
"type": "object",
"properties": {
"tasks": {
"type": "array",
"items": {
"type": "object",
"properties": {
"owner": {"type": "string"},
"action": {"type": "string"},
"due": {"type": "string", "description": "ISO date or empty"},
},
"required": ["action"],
},
},
},
"required": ["tasks"],
}
def _paste_to_tasks(ctx, raw_args: str) -> str:
if not raw_args.strip():
return "Usage: /paste-to-tasks <meeting notes>"
result = ctx.llm.complete_structured(
instructions=(
"Extract concrete action items from these meeting notes. "
"One task per actionable line. If no owner is named, leave 'owner' blank."
),
input=[{"type": "text", "text": raw_args}],
json_schema=_TASKS_SCHEMA,
schema_name="meeting.tasks",
purpose="paste-to-tasks",
temperature=0.0,
max_tokens=512,
)
if result.parsed is None:
return f"Couldn't parse a response. Raw output:\n{result.text}"
lines = [f"- [{t.get('owner') or '?'}] {t['action']}" for t in result.parsed["tasks"]]
return "\n".join(lines) or "(no tasks found)"
A third worked example, this time with image input, lives in the
hermes-example-plugins
repo (companion repo for reference plugins — not bundled with
hermes-agent itself). For the async surface (acomplete() /
acomplete_structured() with asyncio.gather()), see
plugin-llm-async-example
in the same repo.
| You want… | Reach for |
|---|---|
| A free-form text response (translation, summary, rewrite, generation) | complete() |
| A multi-turn prompt (system + few-shot examples + user) | complete() |
| A typed dict back, validated against a schema | complete_structured() |
| Image-or-text input with a typed dict back | complete_structured() |
| The same call from async code (gateway adapters, async hooks) | acomplete() / acomplete_structured() |
Everything else — provider selection, model resolution, auth, fallback, timeout, vision routing — is the same across all four.
ctx.llm is an instance of agent.plugin_llm.PluginLlm.
complete()result = ctx.llm.complete(
messages=[{"role": "user", "content": "Hi"}],
provider=None, # optional, gated — Hermes provider id (e.g. "openrouter")
model=None, # optional, gated — whatever string that provider expects
temperature=None,
max_tokens=None,
timeout=None, # seconds
agent_id=None, # optional, gated
profile=None, # optional, gated — explicit auth-profile name
purpose="optional-audit-string",
)
# → PluginLlmCompleteResult(text, provider, model, agent_id, usage, audit)
Plain chat completion. messages is the standard OpenAI shape — a
list of {"role": "...", "content": "..."} dicts. Multi-turn
prompts (system + few-shot user/assistant pairs + final user) work
exactly as they would with the OpenAI SDK.
provider= and model= are independent and follow the same shape
as the host's main config (model.provider + model.model). Set
just model= to use the user's active provider with a different
model on it. Set both to switch providers entirely. Either argument
without operator opt-in raises PluginLlmTrustError.
complete_structured()result = ctx.llm.complete_structured(
instructions="What you want extracted.",
input=[
{"type": "text", "text": "..."},
{"type": "image", "data": b"...", "mime_type": "image/png"},
{"type": "image", "url": "https://..."},
],
json_schema={...}, # optional — triggers parsed result + validation
json_mode=False, # set True without a schema to ask for JSON anyway
schema_name=None, # optional human-readable schema name
system_prompt=None,
provider=None, # optional, gated
model=None, # optional, gated
temperature=None,
max_tokens=None,
timeout=None,
agent_id=None,
profile=None,
purpose=None,
)
# → PluginLlmStructuredResult(text, provider, model, agent_id,
# usage, parsed, content_type, audit)
Inputs are typed text or image blocks (raw bytes get base64 encoded
as a data: URL automatically). When json_schema or
json_mode=True is supplied, the host requests JSON output via
response_format, parses it locally as a fallback, and validates
against your schema if jsonschema is installed.
result.content_type == "json" — result.parsed is a Python
object that matches your schema.result.content_type == "text" — parsing or validation failed;
inspect result.text for the raw model response.result = await ctx.llm.acomplete(messages=...)
result = await ctx.llm.acomplete_structured(instructions=..., input=...)
Same arguments and result types as their sync counterparts. Use these from gateway adapters, async hooks, or any plugin code already running on an asyncio loop.
@dataclass
class PluginLlmCompleteResult:
text: str # the assistant's response
provider: str # e.g. "openrouter", "anthropic"
model: str # whatever the provider returned for this call
agent_id: str # whose model/auth was used
usage: PluginLlmUsage # tokens + cache + cost estimate
audit: Dict[str, Any] # plugin_id, purpose, profile
@dataclass
class PluginLlmStructuredResult(PluginLlmCompleteResult):
parsed: Optional[Any] # JSON object when content_type == "json"
content_type: str # "json" or "text"
# audit also carries schema_name when supplied
usage carries input_tokens, output_tokens, total_tokens,
cache_read_tokens, cache_write_tokens, and cost_usd when the
provider returns those fields.
The default behaviour is fail-closed. With no plugins.entries
config block, a plugin can:
temperature, max_tokens,
timeout, system_prompt, purpose, messages, instructions,
input, json_schema),…and that's it. provider=, model=, agent_id=, and profile=
arguments raise PluginLlmTrustError until the operator opts in.
Most plugins never need this section. A plugin that just calls
ctx.llm.complete(messages=...) with no overrides runs against
whatever the user has active and works zero-config. The block below
is only relevant when a plugin specifically wants to pin to a
different model or provider than the user.
plugins:
entries:
my-plugin:
llm:
# Allow this plugin to choose a different Hermes provider
# (must be one Hermes already knows about — same names as
# `hermes model` and config.yaml model.provider).
allow_provider_override: true
# Optionally restrict which providers. Use ["*"] for any.
allowed_providers:
- openrouter
- anthropic
# Allow this plugin to ask for a specific model.
allow_model_override: true
# Optionally restrict which models. Use ["*"] for any.
# Models are matched literally against whatever string the
# plugin sends — Hermes does not look anything up.
allowed_models:
- openai/gpt-4o-mini
- anthropic/claude-3-5-haiku
# Allow cross-agent calls (rare).
allow_agent_id_override: false
# Allow the plugin to request a specific stored auth profile
# (e.g. a different OAuth account on the same provider).
allow_profile_override: false
The plugin id is the manifest name: field for flat plugins, or the
path-derived key for nested plugins (image_gen/openai,
memory/honcho, etc.).
| Override | Default | Config key |
|---|---|---|
provider= | denied | allow_provider_override: true |
| ↳ allowlist | — | allowed_providers: [...] |
model= | denied | allow_model_override: true |
| ↳ allowlist | — | allowed_models: [...] |
agent_id= | denied | allow_agent_id_override: true |
profile= | denied | allow_profile_override: true |
Each override is independently gated. Granting allow_model_override
does not also grant allow_provider_override — a plugin trusted
to pick a model is still pinned to the user's active provider unless
it gets the provider gate as well.
temperature, max_tokens,
timeout, system_prompt, purpose, messages, instructions,
input, json_schema, schema_name, json_mode — are always
allowed; they don't pick credentials or routes.plugins.entries for plugins
that want finer routing.A complete list of the things ctx.llm does for the plugin so you
don't have to:
model.provider + model.model
from the user's config (or the explicit overrides when trusted).~/.hermes/auth.json / env, including the credential pool when
one is configured. The plugin never sees them.timeout= argument, falling back to
auxiliary.<task>.timeout config or the global aux default.response_format to the provider when
you ask for JSON, then re-parses locally from a code-fenced
response if the provider returned one.json_schema when
jsonschema is installed; logs a debug line and skips strict
validation otherwise.agent.log with
the plugin id, provider/model, purpose, and token totals.messages for chat, instructions + input
for structured. The plugin builds the prompt; the host runs it.complete_structured() raises ValueError on
empty inputs and on schema-validation failure. PluginLlmTrustError
fires when the trust gate denies an override. Anything else
(provider 5xx, no credentials configured, timeout) raises whatever
auxiliary_client.call_llm() raises.complete() for every gateway message without thinking
about token spend.Existing ctx.* methods extend an existing Hermes subsystem:
| ctx.register_tool | adds a tool the agent can call |
| ctx.register_platform | wires a new gateway adapter |
| ctx.register_image_gen_provider | replaces an image-gen backend |
| ctx.register_memory_provider | replaces the memory backend |
| ctx.register_context_engine | replaces the context compressor |
| ctx.register_hook | observes a lifecycle event |
ctx.llm is the first surface that lets a plugin run the same
model the user is talking to, out of band, without any of the
above. That's its only job. If your plugin needs to register a
tool the agent invokes, use register_tool. If it needs to react
to a lifecycle event, use register_hook. If it needs to make its
own model call — for any reason, structured or not — ctx.llm.
agent/plugin_llm.pytests/agent/test_plugin_llm.pyplugin-llm-example — sync structured extraction with image inputplugin-llm-async-example — async with asyncio.gather()