Back to Onyx

Overview of Onyx Background Jobs

backend/onyx/background/README.md

3.3.57.2 KB
Original Source

Overview of Onyx Background Jobs

The background jobs take care of:

  1. Pulling/Indexing documents (from connectors)
  2. Updating document metadata (from connectors)
  3. Cleaning up checkpoints and logic around indexing work (indexing indexing checkpoints and index attempt metadata)
  4. Handling user uploaded files and deletions (from the Projects feature and uploads via the Chat)
  5. Reporting metrics on things like queue length for monitoring purposes

Worker → Queue Mapping

WorkerFileQueues
Primaryapps/primary.pycelery
Lightapps/light.pyvespa_metadata_sync, connector_deletion, doc_permissions_upsert, checkpoint_cleanup, index_attempt_cleanup
Heavyapps/heavy.pyconnector_pruning, connector_doc_permissions_sync, connector_external_group_sync, csv_generation, sandbox
Docprocessingapps/docprocessing.pydocprocessing
Docfetchingapps/docfetching.pyconnector_doc_fetching
User File Processingapps/user_file_processing.pyuser_file_processing, user_file_project_sync, user_file_delete
Monitoringapps/monitoring.pymonitoring
Background (consolidated)apps/background.pyAll queues above except celery

Non-Worker Apps

AppFilePurpose
Beatbeat.pyCelery beat scheduler with DynamicTenantScheduler that generates per-tenant periodic task schedules
Clientclient.pyMinimal app for task submission from non-worker processes (e.g., API server)

Shared Module

app_base.py provides:

  • TenantAwareTask - Base task class that sets tenant context
  • Signal handlers for logging, cleanup, and lifecycle events
  • Readiness probes and health checks

Worker Details

Primary (Coordinator and task dispatcher)

It is the single worker which handles tasks from the default celery queue. It is a singleton worker ensured by the PRIMARY_WORKER Redis lock which it touches every CELERY_PRIMARY_WORKER_LOCK_TIMEOUT / 8 seconds (using Celery Bootsteps)

On startup:

  • waits for redis, postgres, document index to all be healthy
  • acquires the singleton lock
  • cleans all the redis states associated with background jobs
  • mark orphaned index attempts failed

Then it cycles through its tasks as scheduled by Celery Beat:

TaskFrequencyDescription
check_for_indexing15sScans for connectors needing indexing → dispatches to DOCFETCHING queue
check_for_vespa_sync_task20sFinds stale documents/document sets → dispatches sync tasks to VESPA_METADATA_SYNC queue
check_for_pruning20sFinds connectors due for pruning → dispatches to CONNECTOR_PRUNING queue
check_for_connector_deletion20sProcesses deletion requests → dispatches to CONNECTOR_DELETION queue
check_for_user_file_processing20sChecks for user uploads → dispatches to USER_FILE_PROCESSING queue
check_for_checkpoint_cleanup1hCleans up old indexing checkpoints
check_for_index_attempt_cleanup30mCleans up old index attempts
celery_beat_heartbeat1mHeartbeat for Beat watchdog

Watchdog is a separate Python process managed by supervisord which runs alongside celery workers. It checks the ONYX_CELERY_BEAT_HEARTBEAT_KEY in Redis to ensure Celery Beat is not dead. Beat schedules the celery_beat_heartbeat for Primary to touch the key and share that it's still alive. See supervisord.conf for watchdog config.

Light

Fast and short living tasks that are not resource intensive. High concurrency: Can have 24 concurrent workers, each with a prefetch of 8 for a total of 192 tasks in flight at once.

Tasks it handles:

  • Syncs access/permissions, document sets, boosts, hidden state
  • Deletes documents that are marked for deletion in Postgres
  • Cleanup of checkpoints and index attempts

Heavy

Long running, resource intensive tasks, handles pruning and sandbox operations. Low concurrency - max concurrency of 4 with 1 prefetch.

Does not interact with the Document Index, it handles the syncs with external systems. Large volume API calls to handle pruning and fetching permissions, etc.

Generates CSV exports which may take a long time with significant data in Postgres.

Sandbox (new feature) for running Next.js, Python virtual env, OpenCode AI Agent, and access to knowledge files

Docprocessing, Docfetching, User File Processing

Docprocessing and Docfetching are for indexing documents:

  • Docfetching runs connectors to pull documents from external APIs (Google Drive, Confluence, etc.), stores batches to file storage, and dispatches docprocessing tasks
  • Docprocessing retrieves batches, runs the indexing pipeline (chunking, embedding), and indexes into the Document Index
  • User Files come from uploads directly via the input bar

Monitoring

Observability and metrics collections:

  • Queue lengths, connector success/failure, connector latencies
  • Memory of supervisor managed processes (workers, beat, slack)
  • Cloud and multitenant specific monitorings

Prometheus Metrics

Workers can expose Prometheus metrics via a standalone HTTP server. Currently docfetching and docprocessing have push-based task lifecycle metrics; the monitoring worker runs pull-based collectors for queue depth and connector health.

For the full metric reference, integration guide, and PromQL examples, see docs/METRICS.md.