backend/onyx/file_store/README.md
The Onyx file store provides a unified interface for storing files and large binary objects. It supports three storage backends: S3-compatible storage (AWS S3, MinIO, Digital Ocean Spaces, etc.), Google Cloud Storage (GCS), and PostgreSQL Large Objects.
The file store uses a single database table (file_record) to store file metadata while the actual file content is stored in the configured storage backend. This approach provides scalability, cost-effectiveness, and decouples file storage from the database.
The file_record table contains the following columns:
file_id (primary key): Unique identifier for the filedisplay_name: Human-readable name for the filefile_origin: Origin/source of the file (enum)file_type: MIME type of the filefile_metadata: Additional metadata as JSONbucket_name: External storage bucket/container nameobject_key: External storage object key/pathcreated_at: Timestamp when the file was createdupdated_at: Timestamp when the file was last updatedThe backend is selected via the FILE_STORE_BACKEND environment variable:
| Value | Backend | Description |
|---|---|---|
s3 (default) | S3-compatible | AWS S3, MinIO, Digital Ocean Spaces, etc. |
gcs | Google Cloud Storage | Native GCS with ADC/Workload Identity support |
postgres | PostgreSQL Large Objects | No external storage service required |
FILE_STORE_BACKEND=s3
S3_FILE_STORE_BUCKET_NAME=your-bucket-name # Defaults to 'onyx-file-store-bucket'
S3_FILE_STORE_PREFIX=onyx-files # Optional, defaults to 'onyx-files'
# AWS credentials (use one of these methods):
# 1. Environment variables
S3_AWS_ACCESS_KEY_ID=your-access-key
S3_AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_REGION_NAME=us-east-2 # Optional, defaults to 'us-east-2'
# 2. IAM roles (recommended for EC2/ECS deployments)
# No additional configuration needed if using IAM roles
FILE_STORE_BACKEND=s3
S3_FILE_STORE_BUCKET_NAME=your-bucket-name
S3_ENDPOINT_URL=http://localhost:9000 # MinIO endpoint
S3_AWS_ACCESS_KEY_ID=minioadmin
S3_AWS_SECRET_ACCESS_KEY=minioadmin
AWS_REGION_NAME=us-east-1 # Any region name
S3_VERIFY_SSL=false # Optional, defaults to false
FILE_STORE_BACKEND=s3
S3_FILE_STORE_BUCKET_NAME=your-space-name
S3_ENDPOINT_URL=https://nyc3.digitaloceanspaces.com
S3_AWS_ACCESS_KEY_ID=your-spaces-key
S3_AWS_SECRET_ACCESS_KEY=your-spaces-secret
AWS_REGION_NAME=nyc3
FILE_STORE_BACKEND=gcs
GCS_FILE_STORE_BUCKET_NAME=your-bucket-name # Required
GCS_FILE_STORE_PREFIX=onyx-files # Optional, defaults to 'onyx-files'
GCS_PROJECT_ID=your-gcp-project # Optional, auto-detected via ADC
# Authentication (use one of these methods, in priority order):
# 1. Workload Identity / ADC (recommended for GKE, Cloud Run, Compute Engine)
# No additional configuration needed. Credentials are resolved automatically
# from the environment: GKE Workload Identity, instance metadata server,
# GOOGLE_APPLICATION_CREDENTIALS env var, or gcloud CLI.
# 2. Service account key file
GCS_SERVICE_ACCOUNT_KEY_PATH=/path/to/service-account-key.json
# 3. Inline service account JSON (for environments where file mounts are impractical)
GCS_SERVICE_ACCOUNT_KEY_JSON='{"type":"service_account","project_id":"...","private_key":"..."}'
Required IAM permissions:
On the GCS bucket (object operations + existence check):
storage.objects.create, storage.objects.get, storage.objects.delete (CRUD operations)storage.buckets.get (for initialize() to check bucket existence)At the project level (only if initialize() should auto-create the bucket):
storage.buckets.createThe predefined role roles/storage.objectAdmin (granted on the bucket) covers all object operations. For initial bucket creation, roles/storage.admin at the project level is needed.
The file store works with any S3-compatible service. Simply configure:
S3_FILE_STORE_BUCKET_NAME: Your bucket/container nameS3_ENDPOINT_URL: The service endpoint URLS3_AWS_ACCESS_KEY_ID and S3_AWS_SECRET_ACCESS_KEY: Your credentialsAWS_REGION_NAME: The region (any valid region name)FILE_STORE_BACKEND=postgres
# No additional configuration needed — files are stored directly in PostgreSQL.
The system provides three implementations of the abstract FileStore interface:
S3BackedFileStore (file_store.py): For S3-compatible storage (AWS S3, MinIO, etc.)GCSBackedFileStore (gcs_file_store.py): For Google Cloud Storage with native ADC supportPostgresBackedFileStore (postgres_file_store.py): For PostgreSQL Large ObjectsThe factory function get_default_file_store() returns the appropriate implementation based on FILE_STORE_BACKEND. The database uses generic column names (bucket_name, object_key) to maintain compatibility across all backends.
The FileStore abstract base class defines the following methods:
initialize(): Initialize the storage backend (create bucket if needed)has_file(file_id, file_origin, file_type): Check if a file existssave_file(content, display_name, file_origin, file_type, file_metadata, file_id): Save a fileread_file(file_id, mode, use_tempfile): Read file contentread_file_record(file_id): Get file metadata from databaseget_file_size(file_id): Get file size in bytesdelete_file(file_id): Delete a file and its metadataget_file_with_mime_type(file_id): Get file with parsed MIME typechange_file_id(old_file_id, new_file_id): Rename a filelist_files_by_prefix(prefix): List files matching a prefixfrom onyx.file_store.file_store import get_default_file_store
from onyx.configs.constants import FileOrigin
# Get the configured file store
file_store = get_default_file_store()
# Initialize the storage backend (creates bucket if needed)
file_store.initialize()
# Save a file
with open("example.pdf", "rb") as f:
file_id = file_store.save_file(
content=f,
display_name="Important Document.pdf",
file_origin=FileOrigin.OTHER,
file_type="application/pdf",
file_metadata={"department": "engineering", "version": "1.0"}
)
# Check if a file exists
exists = file_store.has_file(
file_id=file_id,
file_origin=FileOrigin.OTHER,
file_type="application/pdf"
)
# Read a file
file_content = file_store.read_file(file_id)
# Read file with temporary file (for large files)
file_content = file_store.read_file(file_id, use_tempfile=True)
# Get file metadata
file_record = file_store.read_file_record(file_id)
# Get file with MIME type detection
file_with_mime = file_store.get_file_with_mime_type(file_id)
# Delete a file
file_store.delete_file(file_id)
The blob storage connector (backend/onyx/connectors/blob/connector.py) also supports native GCS authentication via the admin UI. When creating a Google Cloud Storage connector, three auth methods are available:
Security note: When using ADC/Workload Identity in the blob connector, the connector inherits the permissions of the pod's service account. If the SA has access to buckets beyond the intended connector target (e.g., the internal file store bucket), an admin could point a connector at those buckets. This mirrors the existing S3 "Assume Role" auth method. Mitigation is IAM scoping at the infrastructure level: scope the pod's service account to only the buckets it should access.
When deploying the application, ensure that:
file_store.initialize() during application startup to ensure the bucket existsThe file store will automatically create the bucket if it doesn't exist and the credentials have sufficient permissions.