Back to Eliza

Known Failure Modes

packages/docs/stability/known-failure-modes.md

2.0.115.6 KB
Original Source

Known Failure Modes

This document catalogs known failure modes across the Eliza system, organized by subsystem. Each entry describes the observable symptoms, underlying root cause, any current mitigation in place, and the remaining gap or risk.


Table of Contents


Runtime / Lifecycle

F-01: Plugin loading order dependency

FieldDetail
StatusOpen
Symptomsundefined service errors at startup. A plugin attempts to call a service registered by another plugin that has not yet loaded.
Root causeThere is no explicit dependency graph governing plugin load order. Plugins are loaded in an undefined order and may reference services registered by peers that have not completed initialization.
Current mitigationSome plugins use retry loops to wait for dependent services to become available.
Gap / RiskNo formal DAG-based resolution. Retry loops are ad-hoc and inconsistent across plugins. A plugin that fails its retry window will surface a confusing runtime error with no indication of which dependency is missing.

F-02: Coordinator bridges not re-wired after restart

FieldDetail
StatusOpen
SymptomsCoding agents become unresponsive after a runtime restart. Commands sent to agents receive no response.
Root causeCoordinator bridges are wired at boot time. On restart, the runtime must re-establish these connections via a retry loop.
Current mitigationA retry loop exists to re-wire bridges after restart.
Gap / RiskIf the retry loop exhausts its attempts, the failure is silent. No error is surfaced to the user, and coding agents remain permanently disconnected until a full process restart.

Chat / Streaming

F-03: SSE stream interruption on network blip

FieldDetail
StatusFixed (PR #806)
SymptomsA chat response cuts off mid-stream. The user sees a partial message with no indication that the stream was interrupted.
Root causeServer-Sent Events (SSE) have no built-in replay or resume mechanism. When the network connection drops, the stream terminates and there is no way to recover from the last received offset.
Current mitigationSSE interruption detection with visual indicator and retry button in chat UI.
Gap / RiskNo automatic retry with offset tracking. The user loses the partial response and must re-trigger the full generation. On long responses this is especially costly in both time and tokens.

F-03b: Post-generation error replaces streamed text

FieldDetail
StatusFixed (PR #1833)
SymptomsThe LLM streams a full reply successfully, but a post-action continuation fails. The already-streamed text is discarded and replaced with a generic "provider issue" message, confusing the user.
Root causeThe error handler did not distinguish between failures that occurred before any text was streamed and failures that occurred after. Both paths produced the same generic fallback reply.
Current mitigationThe streaming error handler now checks whether text was already delivered. If so, the streamed text is preserved in the final done SSE event instead of being replaced. Errors are logged for diagnosis.
Gap / RiskNone — the user retains the partial or complete reply that was already visible.

F-04: Insufficient credits fallback detection

FieldDetail
StatusFixed (PR #806)
SymptomsThe user receives an empty response or a generic error message when the model provider returns a credit/quota exhaustion error.
Root causeProvider-specific credit exhaustion errors are not fully mapped in the error handling pipeline. Some providers return non-standard error shapes that are not caught.
Current mitigationExpanded credit exhaustion detection covers HTTP 402, 429+billing, and structured error shapes.
Gap / RiskIncomplete error mapping means the user gets no actionable feedback. They cannot distinguish between a system bug and a billing issue.

Connectors

F-05: WhatsApp QR session state loss on restart

FieldDetail
StatusFixed (PR #826)
SymptomsAfter a process restart, the WhatsApp connector requires the user to re-scan the QR code to re-authenticate.
Root causeSession state is persisted to authDir, but stop() closed the socket without flushing pending credential writes. A creds.update event that fired but whose saveCreds() hadn't completed would lose session state. Additionally, no notification was emitted on loggedOut disconnect, and no logging indicated whether session restoration succeeded.
Current mitigationstop() now flushes credentials via saveCreds() before closing the socket. loggedOut disconnect emits a WHATSAPP_DISCONNECTED runtime event for UI notification. Session restoration outcome is logged after connect().
Gap / RiskIf the WhatsApp device is explicitly removed server-side, re-pairing via QR is still required (this is a WhatsApp protocol limitation, not a bug).

F-06: Discord/Telegram token expiry

FieldDetail
StatusFixed (PR #806)
SymptomsThe Discord or Telegram connector stops working silently. Messages are no longer received or sent.
Root causeThere is no token refresh mechanism. When a token expires or is revoked, the connector enters a failed state without notification.
Current mitigationConnector health monitor with WebSocket alerts on disconnect.
Gap / RiskNo user alert on token failure. The connector appears connected in the UI but is functionally dead. The user discovers the issue only when they notice messages are not being processed.

Knowledge

F-07: Documents service loading timeout

FieldDetail
StatusFixed (PR #806)
SymptomsThe knowledge tab in the dashboard appears empty or shows a loading spinner indefinitely.
Root causeThe embedding service can take more than 10 seconds to initialize on large databases. When this exceeds the configured timeout, the load fails.
Current mitigationShared documents service loader with configurable timeout and client retry UI.
Gap / RiskThe failure is silent. The UI does not indicate that loading timed out or provide a way to retry. The user sees an empty knowledge tab with no explanation.

F-08: Large document upload rejected

FieldDetail
StatusFixed (PR #816)
SymptomsUploading a document larger than 32 MB fails. The upload is rejected by the server.
Root causeThe document upload body limit enforces a hard cap on upload size.
Current mitigationBoth upload endpoints (/api/documents and /api/documents/bulk) return a clear 413 error: "Document upload exceeds the 32 MB limit. Split large files into smaller parts before uploading."
Gap / RiskNo auto-chunking fallback. Users must split large files manually before uploading.

Triggers

F-09: Trigger execution during agent restart

FieldDetail
StatusFixed (PR #826)
SymptomsA trigger fires during an agent restart, but the resulting action is lost. The trigger is consumed but the side effect never occurs.
Root causedispatchInstruction() throws when AutonomyService is unavailable during restart. The execution was recorded as an error run with runCount incremented, consuming once triggers or triggers at maxRuns despite never actually executing.
Current mitigationexecuteTriggerTask() checks isAutonomyServiceAvailable() before dispatching. Scheduler-sourced triggers return "skipped" without incrementing runCount when the service is unavailable, allowing retry on the next scheduler cycle. Manual triggers bypass the guard since the user explicitly requested execution.
Gap / RiskIf the autonomy service remains unavailable for an extended period, scheduled triggers will keep skipping. No backlog replay mechanism exists, but triggers will execute on the next cycle once the service recovers.

F-10: dedupeKey assumes deterministic generation

FieldDetail
StatusFixed (PR #811)
SymptomsDuplicate triggers are created for what should be a single logical trigger.
Root causebuildTriggerConfig() may produce different dedupeKey values for triggers that are semantically identical. The deduplication mechanism assumes that the key generation function is fully deterministic for equivalent inputs, but this is not guaranteed.
Current mitigationDeterministic deduplication key generation.
Gap / RiskNo user-level deduplication control. Users cannot manually specify or override deduplication keys, and the system may create redundant triggers that fire multiple times for the same event.

Coding Agents / PTY

F-11: Legacy coordinator wiring exhaustion

FieldDetail
StatusRemoved with legacy PTY/coordinator path
SymptomsCoding agents do not respond to commands. No error is shown in the UI.
Root causeThe removed PTY coordinator path could fail to wire its bridge after repeated attempts.
Current mitigationTask agents now route through ACP only; there is no separate coordinator bridge to wire.
Gap / RiskHistorical record only.

F-12: Deferred task delivery race

FieldDetail
StatusFixed (plugin-agent-orchestrator 0.3.4, PR #7; eliza PR #817)
SymptomsThe first task sent to a coding agent is not received. Subsequent tasks work normally.
Root causeThe listener must be attached before pushDefaultRules executes, which includes a 1500ms sleep. If the listener attachment is delayed (e.g., under heavy system load), the task delivery window is missed.
Current mitigationFixed ordering ensures the listener is attached before pushDefaultRules is called. A 30-second timeout fallback forces task delivery if session_ready is never received (covers edge cases where the ready detection pattern doesn't match a CLI update).
Gap / RiskTimeout fallback is a last resort; if the CLI prompt changes significantly, the agent may not be fully ready when the task is force-delivered after 30s.

F-13: Stall classification cascade

FieldDetail
StatusFixed (PR #795)
SymptomsStall responses are delayed across all coding agent sessions. A single stalled session blocks classification for others.
Root causeThe stall classification queue is serialized. Each classification requires an LLM call, and a slow LLM response blocks all subsequent classifications in the queue.
Current mitigation15s timeout guard on stall classification LLM calls prevents cascade blocking.
Gap / RiskNo timeout on the LLM classification call. A single slow or hung LLM request can cascade into multi-minute delays for all active sessions. No parallel classification or circuit breaker is in place.

F-14: WebSocket reconnect exhaustion

FieldDetail
StatusFixed (PR #795, #811)
SymptomsThe terminal view in the UI goes dead. No further output is displayed and input is not accepted. No retry option is presented.
Root causeThe WebSocket connection to the PTY backend has a maximum of 15 reconnect attempts. Once exhausted, the connection is permanently dropped.
Current mitigationConnectionFailedBanner shows retry UI when WS reconnection exhausts attempts.
Gap / RiskNo banner or button to manually retry the connection after exhaustion. The user must refresh the entire page to re-establish the WebSocket connection. There is no visual indication that the connection was lost.

Training

F-15: Backend availability not validated

FieldDetail
StatusFixed (PR #811)
SymptomsA training job is submitted and starts, but fails immediately or partway through execution.
Root causeThe system does not check for MLX or CUDA backend availability before accepting a training job submission. The job is dispatched to a backend that may not exist or may not have sufficient resources.
Current mitigationPre-submission backend availability validation.
Gap / RiskNo pre-submission validation. Users waste time waiting for a job that was doomed to fail. The error message from the failed job may not clearly indicate that the required backend is unavailable.

General / UI

F-16: No React error boundary

FieldDetail
StatusFixed (PR #795)
SymptomsA white screen appears in the browser. The entire UI is unresponsive.
Root causeNo ErrorBoundary component wraps the top-level routes. An unhandled exception in any React component propagates to the root and unmounts the entire application.
Current mitigationReact ErrorBoundary wrapping ViewRouter catches render crashes with fallback UI.
Gap / RiskA single component error crashes the entire UI. There is no fallback UI, no error message, and no way to recover without a page refresh. Errors in rarely-used components can take down the entire dashboard.

F-17: Hooks system untested

FieldDetail
StatusFixed (PR #813)
SymptomsUnknown. Failures in the hooks system may go undetected.
Root causeThe hooks discovery and loader modules had zero test coverage.
Current mitigationRegistry tested in registry.test.ts; eligibility tested in hooks.test.ts; discovery and loader tested in PR #813. All 5 hooks source modules now have unit coverage.
Gap / RiskIntegration-level coverage (full filesystem + real imports) remains limited to unit tests with mocked I/O.