Back to Onyx

User Library Sync to Sandboxes

docs/craft/features/user-library-sync.md

4.0.0-persistent-indexing27.0 KB
Original Source

User Library Sync to Sandboxes

How user-uploaded files (PDFs, spreadsheets, slides) get from the user's library into the agent's sandbox. Replaces the symbiotic-S3 sync that was removed in PR #11042.

1. Architecture

Three layers, mirroring the skills pipeline:

HTTP layer        backend/onyx/server/features/build/api/user_library.py
  │ thin: validate input, call db helpers, return response
  ▼
DB layer          backend/onyx/server/features/build/db/user_library.py
  │ owns: file_store I/O, Document upserts, ownership checks, quota
  ▼
Sandbox sync      backend/onyx/server/features/build/sandbox/user_library.py
                  builds FileSet from DB + file_store, pushes via push daemon

Files are stored in the default file store (get_default_file_store()) — same backend skills use. There is no Craft-specific S3 bucket for user files.

2. Storage model

FieldWherePurpose
File contentsfile_store (S3/local)Raw bytes
Document.idCRAFT_FILE__{user_id}__{sha256(path)[:16]}Deterministic, ownership encoded in prefix
Document.linkfile_store file_idPointer for retrieval
Document.file_idfile_store file_id (mirror)Schema-level pointer
Document.doc_metadata{file_path, file_size, mime_type, is_directory, sync_disabled, file_store_id}Display + filtering
Connectorshared "User Library" connector, DocumentSource.CRAFT_FILEReuses indexing infrastructure
Credentialone per user, empty credential_json={}Pairs with connector for ownership

FileOrigin.USER_FILE is the file-store origin tag.

3. Sync pipeline

Mount path

/workspace/managed/user_library/ in the sandbox pod. Same atomic symlink swap as skills — the agent sees a stable path while the daemon swaps versioned directories underneath.

Trigger points

TriggerCalls
Session creation (cold start)hydrate_user_library(sandbox_id, user_id, db_session)
Existing session resumehydrate_user_library(...) (same — re-hydrate to get latest state)
File upload (single or zip)sync_user_library_to_active_sandboxes(user_id, db_session)
File deletesync_user_library_to_active_sandboxes(...)
Toggle sync_disabledsync_user_library_to_active_sandboxes(...)

All synchronous — no Celery, matching the skills pattern.

Data flow

build_user_library_fileset(user_id, db_session)
    1. list_user_files() → CRAFT_FILE Document rows for user
    2. Filter: skip is_directory=True, skip sync_disabled=True
    3. For each remaining doc: file_store.read_file(doc.link)
    4. Strip leading "/" from file_path (push daemon rejects absolute paths)
    5. Return FileSet { relative_path: bytes }
              │
              ▼
sandbox_manager.push_to_sandbox / push_to_sandboxes
    mount_path="/workspace/managed/user_library"
    → tar.gz, Ed25519-sign, POST to push daemon, atomic swap

Empty filesets push too (clears stale files via the swap).

4. DB layer API

backend/onyx/server/features/build/db/user_library.py:

FunctionPurpose
get_or_create_craft_connector(db, user)Returns (connector_id, credential_id). Idempotent.
get_user_storage_bytes(db, user_id)SQL aggregation for quota
build_document_id(user_id, path)Deterministic doc_id
list_user_files(db, user_id)All CRAFT_FILE docs for the user
fetch_user_file_for_user(db, doc_id, user_id)Lookup + ownership check; raises OnyxError(NOT_FOUND) on miss
store_user_file(db, ..., file_path, content, mime_type)Save to file store + upsert Document. Returns (doc_id, file_id, old_blob_id_to_delete). The new blob is saved first; the caller must pass old_blob_id_to_delete to cleanup_old_blobs after their final commit.
cleanup_old_blobs(blob_ids)Delete superseded blobs. Must be called after the final DB commit — if called before and the commit fails, the document rolls back to point at the now-deleted blob.
create_directory_record(db, user_id, connector_id, credential_id, dir_path)Virtual directory document (no file store object)
set_sync_disabled(db, user_id, doc, sync_disabled)Toggle file or directory (recursive into children)
delete_user_file(db, doc)Delete blob from file store + Document row

The HTTP layer (api/user_library.py) calls these directly — no business logic in the endpoints.

5. Ownership

Encoded in the document_id prefix: CRAFT_FILE__{user_id}__{hash}. fetch_user_file_for_user checks the prefix matches the calling user before any DB lookup. A foreign user's doc_id returns NOT_FOUND (not 403), so existence is never leaked across users.

6. Files

New:

  • backend/onyx/server/features/build/sandbox/user_library.py — sync module
  • backend/tests/external_dependency_unit/craft/test_user_library_sync.py — ext-dep tests

Modified:

  • backend/onyx/server/features/build/db/user_library.py — added all the CRUD/storage helpers
  • backend/onyx/server/features/build/api/user_library.py — thinned to call db helpers; PersistentDocumentWriter usage removed; HTTPExceptionOnyxError
  • backend/onyx/server/features/build/session/manager.py_hydrate_user_library called in both create_session__no_commit and get_or_create_empty_session
  • backend/onyx/skills/push.py — per-failure logging (consistent with user_library push logging)
  • backend/tests/external_dependency_unit/craft/conftest.pySandboxHandle.provision_for now returns (Sandbox, Path) tuple, eliminating duplicated _provision_with_status helpers

Deleted:

  • backend/onyx/server/features/build/indexing/persistent_document_writer.py
  • backend/tests/external_dependency_unit/craft/test_persistent_document_writer.py

7. Tests

Tests follow the layered paradigm in docs/craft/tests/coverage-and-overview.md.

Ext-dep (test_user_library_sync.py)

Real Postgres + real LocalSandboxManager on tmp_path. Seeds via the same DB helpers production uses (no DocumentMetadata construction in test code).

  • test_hydrate_pushes_files_to_sandbox — happy path
  • test_sync_disabled_files_excluded — filter contract via set_sync_disabled
  • test_directories_excluded_from_fileset — directory exclusion via create_directory_record
  • test_sync_after_delete_removes_file — atomic swap cleanup after delete_user_file

Integration (test_user_library_api.py, pre-existing)

HTTP-level coverage including cross-user 404, toggle, delete, upload caps. Already on the test-overhaul branch.

K8s

No new K8s tests — push daemon contract (signed tarball, atomic swap) is already covered by existing K8s tests. User library sync reuses write_files_to_sandbox() unchanged.

8. Not in scope

  • Per-session file selection (all of the user's non-disabled files go to all of their active sandboxes)
  • Streaming for large files (office docs are typically MB-scale; 100 MiB bundle limit is fine)
  • New DB tables or migrations (reuses Document + doc_metadata)
  • Changes to the push daemon or extract logic