plugins/plugin-vision/README.md
A powerful visual perception plugin for ElizaOS that provides agents with real-time camera integration and scene analysis capabilities. This plugin enables agents to "see" their environment, describe scenes, detect people and objects, and make decisions based on visual input.
runtime.useModel(IMAGE_DESCRIPTION) for scene understandingonnxruntime-node — replaces Tesseract.js
as the default OCR backend. ~80 MB total, sub-second on CPU, multi-language.
Models fetched on first use to $ELIZA_STATE_DIR/models/rapidocr/.
Fallback chain: RapidOCR → Apple Vision (iOS/macOS, owned by plugin-ios)
→ Tesseract.js (last-resort).onnxruntime-node — replaces COCO-SSD.
Class-filterable: a dedicated PersonDetector uses the same model with a
person-only filter.face-api.js
remains the default.VisionServiceLifecycleManager ties each
sub-service (YOLO / OCR / face / pose) to the WS1 memory arbiter (when
registered) via the IModelArbiter contract. Idle watchdog releases
sub-services after idleUnloadMs (default 60s). On memory-pressure
events the coldest holders are released first.VisionService routes every scene-describe
call through runtime.useModel(IMAGE_DESCRIPTION). Locally, eliza-1
(Qwen3.5-VL) registers that slot via plugin-local-inference; otherwise
the runtime rotates to whichever cloud/remote provider has registered
IMAGE_DESCRIPTION. plugin-vision no longer ships its own VLM — Florence-2
has been removed.enableCamera(), disableCamera(),
enableScreen(displayIds?), disableScreen() on the service +
matching enable_camera / disable_camera / enable_screen /
disable_screen ops on the action surface.src/mobile/capacitor-camera.ts). plugin-aosp (WS8)
and plugin-ios (WS9) wire native sides on top of this.executionProviders: ['coreml','cpu'] to opt into CoreML acceleration.
Apple Vision is the preferred OCR backend on darwin once WS9 lands.executionProviders: ['dml','cpu'].onnxruntime-node is not the right backend on
mobile. plugin-ios (WS9) bridges to CoreML / Apple Vision via Swift;
plugin-aosp (WS8) bridges to NNAPI / ML Kit via JNI. Both register a
MobileCameraSource so the runtime API stays platform-agnostic.npm install @elizaos/plugin-vision
# or
cd plugins/plugin-vision
bun install
bun run build
The plugin requires platform-specific camera tools:
brew install imagesnapsudo apt-get install fswebcam# Camera selection (partial name match, case-insensitive)
CAMERA_NAME=obsbot
# Pixel change threshold (percentage, default: 50)
PIXEL_CHANGE_THRESHOLD=30
# Enable advanced computer vision features (default: false)
ENABLE_OBJECT_DETECTION=true
ENABLE_POSE_DETECTION=true
ENABLE_FACE_RECOGNITION=false
# Vision mode: OFF, CAMERA, SCREEN, BOTH
VISION_MODE=CAMERA
# Update intervals (milliseconds)
TF_UPDATE_INTERVAL=1000
VLM_UPDATE_INTERVAL=10000
# Screen capture settings
SCREEN_CAPTURE_INTERVAL=2000
OCR_ENABLED=true
{
"name": "VisionAgent",
"plugins": ["@elizaos/plugin-vision"],
"settings": {
"CAMERA_NAME": "obsbot",
"PIXEL_CHANGE_THRESHOLD": "30",
"ENABLE_OBJECT_DETECTION": "true",
"ENABLE_POSE_DETECTION": "true"
}
}
Analyzes the current visual scene and provides a detailed description.
Similes: ANALYZE_SCENE, WHAT_DO_YOU_SEE, VISION_CHECK, LOOK_AROUND
Example:
User: "What do you see?"
Agent: "Looking through the camera, I see a home office setup with a person sitting at a desk. There are 2 monitors, a keyboard, and various desk accessories. I detected 5 objects total: 1 person, 2 monitors, 1 keyboard, and 1 chair."
Captures the current frame and returns it as a base64 image attachment.
Similes: TAKE_PHOTO, SCREENSHOT, CAPTURE_FRAME, TAKE_PICTURE
Example:
User: "Take a photo"
Agent: "I've captured an image from the camera." [Image attached]
Changes the vision mode (OFF, CAMERA, SCREEN, or BOTH).
Similes: CHANGE_VISION_MODE, SET_VISION, TOGGLE_VISION
Assigns a name to a detected entity for tracking.
Similes: LABEL_ENTITY, NAME_OBJECT, IDENTIFY_ENTITY
Identifies a person using face recognition (requires face recognition to be enabled).
Similes: RECOGNIZE_PERSON, IDENTIFY_FACE
Starts tracking an entity with a persistent ID.
Similes: START_TRACKING, FOLLOW_ENTITY
Stops the autonomous agent loop (useful for debugging with autonomy plugin).
Similes: STOP_AUTONOMOUS, HALT_AUTONOMOUS, KILL_AUTO_LOOP
The vision provider is non-dynamic (always active) and provides:
{
visionAvailable: boolean,
sceneDescription: string,
cameraStatus: string,
cameraId?: string,
peopleCount?: number,
objectCount?: number,
sceneAge?: number,
lastChange?: number
}
Enable with ENABLE_OBJECT_DETECTION=true and/or ENABLE_POSE_DETECTION=true
Example autonomous behavior:
// Agent autonomously monitors environment
"I notice someone just entered the room.";
"The lighting has changed significantly.";
"A new object has appeared on the desk.";
plugin-vision/
├── README.md # This file
├── package.json # TypeScript package config
├── src/ # TypeScript implementation (primary)
│ ├── index.ts # Plugin entry point
│ ├── service.ts # Vision service
│ ├── provider.ts # Vision provider
│ ├── action.ts # All actions
│ ├── entity-tracker.ts # Entity tracking
│ ├── screen-capture.ts # Screen capture
│ ├── ocr-service.ts # OCR service
│ ├── face-recognition.ts # Face recognition
│ ├── vision-worker-manager.ts # Worker management
│ └── tests/ # E2E tests
# Run E2E tests
cd plugins/plugin-vision
npx vitest
# Run local E2E tests
bun run test:e2e:local
Contributions are welcome! Please see the main ElizaOS repository for contribution guidelines.
MIT
For issues and feature requests, please use the GitHub issue tracker.