docs/rfcs/2025-12-30-export-import-v2.md
This RFC proposes a redesigned export/import system (V2) for GreptimeDB that addresses fundamental issues in the current implementation. The new design leverages time-series characteristics for efficient chunking, provides clear storage semantics, and ensures data reliability through comprehensive validation mechanisms.
The current export/import implementation has several critical issues:
output_dir serves dual purposes (local vs remote), causing confusionA snapshot represents the complete state of a GreptimeDB catalog at a point in time, including schemas and data.
Snapshot structure:
snapshot-20250101/
├── manifest.json # Snapshot metadata and chunk index
├── schema/
│ ├── schemas.json # Schema definitions (JSON)
│ ├── tables.json # Table definitions (JSON)
│ └── views.json # View definitions (JSON)
└── data/
├── 1/
│ ├── public.metrics.parquet
│ └── public.logs.parquet
├── 2/
│ ├── public.metrics.parquet
│ └── public.logs.parquet
└── 3/
├── public.metrics.parquet
└── public.logs.parquet
Key properties:
manifest.json and schema/; data/ is absent, chunks is empty, and later data append is rejected (use --force to recreate)A chunk is a time-range partition of data. Each chunk is independently exportable/importable and retryable.
Chunk properties:
start_time and end_time (recorded in manifest)Chunk directory naming:
1/, 2/, 3/, ...V2 supports two storage types:
| Type | Example | Use Case |
|---|---|---|
| Remote Storage | s3://bucket/snapshots | Production (recommended) |
| Server Path | file:///data/backup | Local dev/testing |
Important: Local paths (e.g., /tmp/export, ./backup) are not supported because schema export (CLI) and data export (server) run in different processes, which would split the snapshot across two machines.
# Full snapshot to S3
greptime export create \
--to s3://my-bucket/snapshots/prod-20250101
# Incremental snapshot (time range)
greptime export create \
--start-time 2024-12-01T00:00:00Z \
--end-time 2024-12-31T23:59:59Z \
--to s3://my-bucket/snapshots/prod-december
# Schema-only export
greptime export create \
--schema-only \
--to s3://my-bucket/snapshots/prod-schema-only
Schema-only snapshots cannot be resumed with data; use `--force` to recreate.
# Export with specific format (default: parquet)
greptime export create \
--format csv \
--to s3://my-bucket/snapshots/prod-csv
# Resume interrupted export (automatic if snapshot exists)
greptime export create \
--to s3://my-bucket/snapshots/prod-20250101
# Force recreate (delete existing and start over)
greptime export create \
--to s3://my-bucket/snapshots/prod-20250101 \
--force
# Full import
greptime import \
--from s3://my-bucket/snapshots/prod-20250101
# Partial import (selected schemas)
greptime import \
--from s3://my-bucket/snapshots/prod-20250101 \
--schemas public,private
# Dry-run (verify without importing)
greptime import \
--from s3://my-bucket/snapshots/prod-20250101 \
--dry-run
The export/import system consists of four main components:
All components use OpenDAL for storage abstraction, supporting S3, OSS, GCS, Azure Blob, and local filesystem.
The manifest is a JSON file containing snapshot metadata and chunk index:
Key fields:
snapshot_id: Unique identifier (UUID)catalog, schemas: Catalog and schema listtime_range: Overall time range coveredschema_only: Whether the snapshot contains schema onlychunks[]: Array of chunk metadataformat: Data format for exported fileschecksum: Snapshot-level SHA256 checksumChunk metadata structure:
Each chunk entry in the manifest contains:
id: Chunk identifier (sequential number)time_range: Start and end timestampsstatus: Export status (Pending, InProgress, Completed, Failed)files: List of data files in the chunk directorychecksum: Chunk-level checksum for integrity verificationSchema definitions are stored as JSON (not SQL) for better version compatibility and programmatic processing.
Why JSON instead of SQL?
Data is exported via COPY DATABASE, supporting multiple formats:
Format is specified via --format flag and recorded in manifest.json. Import automatically detects the format from manifest.
Export/import operations validate storage paths to prevent misconfigurations:
Path types:
s3://, oss://, gs://, azblob:// → Remote storage (recommended)file:/// → Server-local path (only allowed when CLI and server are co-located)/path, ./path → Rejected (would split snapshot across machines)Validation:
file:///, verify server endpoint resolves to localhost or local IPData is partitioned into time-range chunks for efficient parallel processing and retry.
Algorithm:
Chunk time window selection:
The optimal chunk time window depends on data density (volume per unit time):
Example: 500GB database spanning 30 days → ~16.7GB/day → use 1h chunks → ~695MB/chunk
V2 leverages the existing COPY DATABASE TO for data export, with additional tooling layer for chunking, resume, and metadata management.
How it works:
COPY DATABASE <schema> TO '<chunk_path>' WITH (
START_TIME = '<chunk_start>',
END_TIME = '<chunk_end>',
FORMAT = 'parquet'
)
Separation of concerns:
Three-layer checksum validation ensures data integrity:
Checksums are verified during import before data is written to the database.
Chunk-level retry:
Resume capability:
--force (export only) to delete existing snapshot and start overScenario 1: Export to different paths ✅
Scenario 2: Export to same path ⚠️
greptime export create [OPTIONS] --to <LOCATION>
Required:
--to <LOCATION> Target storage location
Optional:
--catalog <CATALOG> Catalog name (default: greptime)
--schemas <SCHEMAS> Comma-separated schema list (default: all)
--start-time <TIMESTAMP> Time range start (default: earliest)
--end-time <TIMESTAMP> Time range end (default: now)
--chunk-time-window <DURATION> Chunk time window (default: 1d)
--parallelism <N> Concurrency level (default: 1)
--format <FORMAT> Export format for data file: parquet (default), csv, json, or other formats supported by COPY DATABASE
--schema-only Export schema only, no data
--force Delete existing snapshot and recreate
Behavior:
- If snapshot doesn't exist: create new snapshot
- If snapshot exists: automatically resume export (skip completed chunks,
retry failed chunks, process pending chunks)
- If --force is specified: delete existing snapshot first, then create new one
greptime import [OPTIONS] --from <SNAPSHOT>
Required:
--from <SNAPSHOT> Source snapshot location
Optional:
--catalog <CATALOG> Catalog name (default: greptime)
--schemas <SCHEMAS> Comma-separated schema list (default: all)
--parallelism <N> Concurrency level (default: 1)
--dry-run Verify without importing
--time-range <RANGE> Import partial time range only
Behavior:
- Automatically resumes if import was previously interrupted
- Skips completed chunks, retries failed chunks, processes pending chunks
# List snapshots
greptime export list --location s3://bucket/snapshots
# Verify snapshot integrity
greptime export verify --snapshot s3://bucket/snapshots/prod-20250101
# Delete snapshot
greptime export delete --snapshot s3://bucket/snapshots/old-snapshot
Alternatives considered:
Decision: Fixed time-window chunking
Alternatives: SQL dumps (V1 approach)
Decision: JSON schema format