pkg/cmd/roachprod-centralized/docs/ARCHITECTURE.md
This document describes the system architecture, design patterns, and key components of the roachprod-centralized service.
π Related Documentation:
The roachprod-centralized service is designed as a modern, cloud-native REST API that centralizes management of CockroachDB roachprod clusters across multiple cloud providers. The architecture follows clean architecture principles with clear separation of concerns and dependency inversion.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β API Gateway / Router β
β (Gin HTTP Framework) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β Controllers Layer β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β β Health β β Clusters β β Tasks β β Public DNS ββ
β β Controller β β Controller β β Controller β β Controller ββ
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β Services Layer β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β β Health β β Clusters β β Tasks β β Public DNS ββ
β β Service β β Service β β Service β β Service ββ
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β Repositories Layer β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β β Health β β Clusters β β Tasks β β Config ββ
β β Repository β β Repository β β Repository β β Repository ββ
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β Storage Layer β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β β Memory β β CockroachDB β β Files β β Cloud APIs ββ
β β Storage β β Database β β System β βIntegration ββ
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Background Systems β
β βββββββββββββββ βββββββββββββββ ββββββββββββ
β βTask Workers β β Cluster β β DNS ββ
β β Pool β β Sync β β Sync ββ
β βββββββββββββββ βββββββββββββββ ββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
controllers/)Responsibility: HTTP request/response handling and routing
Key Features:
request_id and result_typeAuthorizationRequirement on handlers (RequiredPermissions / AnyOf)services/)Responsibility: Business logic and orchestration
Each service package contains:
clusters/):
tasks/: Cluster-related background tasksmodels/: Cluster-specific data modelsmocks/: Test mockstasks/):
service.go: Orchestration and lifecycleapi.go: Public CRUD operationscoordination.go: Inter-service helpersregistry.go: Task type registration and hydrationoperations.go: Business operationsinternal/processor/: Worker pool and task executioninternal/scheduler/: Periodic task schedulinginternal/metrics/: Metrics collectiontasks/: Concrete task implementations (e.g., purge)types/: Task interfaces and DTOsmocks/: Test mockspublic-dns/):
tasks/: DNS-related background tasksmodels/: DNS-specific data modelsmocks/: Test mockshealth/):
tasks/: Health check background tasksmocks/: Test mocksKey Features:
To keep authorization understandable and maintainable:
This split keeps route declarations simple while preventing authorization bypasses in complex business flows.
repositories/)Responsibility: Data access abstraction
Each repository has multiple implementations:
memory/): In-memory storage for developmentcockroachdb/): Production database storage with migrationsmocks/): Testing supportRepositories:
Key Features:
config/)Responsibility: Multi-source configuration management
The configuration system supports hierarchical configuration from multiple sources:
Packages:
types/: Configuration structure definitionsflags/: CLI flag handling (Cobra integration)env/: Environment variable processingprocessing/: Configuration merging and validationrecursive/: Recursive configuration mergingConfiguration Sources (in order of precedence):
ROACHPROD_*)Key Features:
utils/)Responsibility: Shared utilities across the application
Packages:
api/: HTTP API utilities
bindings/: Request binding helpers for Gin frameworkdatabase/: Database connection and helper utilitiesfilters/: Query filtering system (Stripe-style)
types/: Filter type definitionsmemory/: In-memory filter implementationsql/: SQL query filter generationlogger/: Structured logging wrapperResponsibility: Asynchronous task processing
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Task Queue ββββββ Task Processor ββββββ Task Workers β
β (Database) β β Coordinator β β Pool β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
β βββββββββββββββββββ β
ββββββββββββββββ Task Types βββββββββββββββ
β β
β β’ Cluster Sync β
β β’ DNS Sync β
β β’ Health Check β
βββββββββββββββββββ
HTTP Request
β
βΌ
βββββββββββββββββββ
β Middleware β ββ Authentication, Logging, CORS
β Pipeline β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Controller β ββ Request validation, parameter binding
β β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Service β ββ Business logic, orchestration
β β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Repository β ββ Data access, persistence
β β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Response β ββ Formatted JSON with request_id/result_type
β β
βββββββββββββββββββ
API Request (POST /clusters/sync)
β
βΌ
βββββββββββββββββββ
β Controller β ββ Validate request
β β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Service β ββ Create task record
β β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Task Queue β ββ Store task in database
β β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Background β ββ Process task asynchronously
β Worker β
βββββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Cloud Provider β ββ Execute actual operations
β APIs β
βββββββββββββββββββ
The background task system uses a distributed, fault-tolerant design that supports multiple deployment modes:
All-in-One Mode:
API-Only Mode (--no-workers):
Workers-Only Mode (workers command):
Tasks flow through a simple state machine:
βββββββββββββββββββββββββββ
β Task Created β
β (via API or background β
β periodic job) β
βββββββββββββ¬ββββββββββββββ
β
βΌ
ββββββββββββββ
β pending β ββββ Tasks wait here until claimed by worker
ββββββββ¬ββββββ
β
β Worker claims task
βΌ
ββββββββββββββ
β running β ββββ Worker actively processing task
ββββββββ¬ββββββ
β
βββββββββββββ΄βββββββββββββ
β β
β Success β Error/Timeout
βΌ βΌ
ββββββββββββββ βββββββββββββββ
β done β β failed β ββββ Terminal states
ββββββββββββββ βββββββββββββββ (no auto-retry)
State Transitions:
pending β running: Worker claims task via GetTasksForProcessing()running β done: Task execution succeedsrunning β failed: Task execution errors or times outNote: There is no automatic retry mechanism. Failed tasks remain in failed state. Retry logic must be implemented at the application level (e.g., manually re-creating the task).
| Task Type | Description | Trigger |
|---|---|---|
cluster_sync | Synchronize cluster data from cloud providers | POST /clusters/sync, Initial sync, Periodic sync |
dns_sync | Update DNS records for clusters | POST /public-dns/sync |
health_check | Periodic system health validation | Scheduled |
When instances start, they use an intelligent decision algorithm to determine whether to perform an initial cluster synchronization. This prevents redundant syncs in distributed deployments while ensuring data freshness.
The system performs an initial sync if any of these conditions are true:
cluster_sync task found in the databasePeriodicRefreshInterval (default: 10 minutes)The system skips the initial sync only when ALL conditions are met:
The system tracks API instance coverage because:
Instance Starts
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Health Service Registers β ββ Sends heartbeat to database
β Instance (heartbeat starts) β
ββββββββββββββ¬βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Clusters Service Checks β ββ Query for recent completed sync
β If Initial Sync Needed β Check for other healthy instances
ββββββββββββββ¬βββββββββββββββββββββ
β
ββββββββ΄βββββββ
β β
YES β β NO
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β Schedule β β Skip Initial β
β Sync Task β β Sync β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β
βΌ β
ββββββββββββββββ β
β Wait for β β
β Task Done β β
ββββββββ¬ββββββββ β
β β
ββββββββββ¬βββββββββ
β
βΌ
ββββββββββββββββ
β Start API β ββ Fresh data guaranteed
β Server β
ββββββββββββββββ
Scenario 1: First Worker Instance
Scenario 2: Second Worker Instance (10 minutes later)
Scenario 3: First API Instance (5 minutes after workers)
Scenario 4: Second API Instance (2 minutes later)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cloud Provider Abstraction β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β. Provider Implementations β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β β GCE β β AWS β β Azure β β IBM ββ
β β Provider β β Provider β β Provider β β Provider ββ
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β Cloud APIs β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β β GCP Compute β β AWS EC2 β β Azure VMs β β IBM Cloud ββ
β β Engine β β β β β β VMs ββ
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each provider supports:
-- Tasks table (primary workload)
CREATE TABLE tasks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
type STRING NOT NULL,
status STRING NOT NULL DEFAULT 'pending',
payload JSONB,
result JSONB,
consumer_id STRING,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
INDEX idx_tasks_status_type (status, type),
INDEX idx_tasks_consumer (consumer_id),
INDEX idx_tasks_created (created_at)
);
-- Clusters table (cached cloud data)
CREATE TABLE clusters (
name STRING PRIMARY KEY,
provider STRING NOT NULL,
data JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
INDEX idx_clusters_provider (provider)
);
HTTP request authentication flow:
ββββββββββββ
β Client β
β β
βββββββ¬βββββ
β HTTP Request with JWT
β (X-Goog-IAP-JWT-Assertion header)
βΌ
βββββββββββββββββββ
β Load Balancer β
β (Optional β
β Google IAP) β
ββββββββββ¬βββββββββ
β Forwards request with JWT
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β roachprod-centralized β
β β
β βββββββββββββββββββββββββββββββββββββ β
β β Gin Middleware Pipeline β β
β β β β
β β 1. Request ID β β
β β 2. Logging β β
β β 3. JWT Authentication (optional) ββββΌβββ Configured via
β β - Extract JWT from header β β --api-authentication-disabled
β β - Validate signature β β --api-authentication-jwt-audience
β β - Check audience β β
β β β β
β ββββββββββββββ¬βββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββ β
β β Controller Handler β β
β β (clusters, tasks, dns, health) β β
β βββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
Authentication Modes:
--api-authentication-disabled=true (no JWT validation)βββββββββββββββββββββββββββββββββββ
β Configuration Sources β
β (Highest to Lowest Priority) β
βββββββββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β Environment Variables β
β (ROACHPROD_* prefix) β
βββββββββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β Command Line Flags β
β (--api-port, --log-level, etc.) β
βββββββββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β YAML Configuration File β
β (config.yaml, --config flag) β
βββββββββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β Default Values β
β (Built into code) β
βββββββββββββββββββββββββββββββββββ
// Service interfaces define contracts
type IClusterService interface {
GetAllClusters(ctx context.Context, ...) ([]Cluster, error)
}
// Implementation injected at runtime
type ClusterService struct {
repo clusters.IRepository
logger *logger.Logger
}
// Factory pattern for service creation
func NewServicesFromConfig(cfg *config.Config) (*Services, error) {
// Create repositories based on configuration
// Inject dependencies into services
// Return configured service collection
}
// Abstract interface for data access
type IRepository interface {
Create(ctx context.Context, cluster *Cluster) error
GetByName(ctx context.Context, name string) (*Cluster, error)
Update(ctx context.Context, cluster *Cluster) error
Delete(ctx context.Context, name string) error
}
// Multiple implementations
type MemoryRepository struct { /* ... */ }
type CockroachDBRepository struct { /* ... */ }
The service supports three deployment modes for different scalability and reliability requirements:
Single process handles both API and background work:
βββββββββββββββββββββββββββββ
β Instance β
β βββββββββββββββββββββββ β
β β API Server β β
β β + Controllers β β
β βββββββββββββββββββββββ β
β βββββββββββββββββββββββ β
β β Background Work β β
β β - Task Workers β β
β β - Periodic Sync β β
β β - Health Heartbeat β β
β βββββββββββββββββββββββ β
βββββββββββββββ¬ββββββββββββββ
β
βββββββββββΌββββββββββ
β Database β
β (memory or CRDB) β
βββββββββββββββββββββ
Use Case: Development, small production deployments
Database: Memory or CockroachDB
Command: roachprod-centralized api
Separate API tier and worker tier for independent scaling:
βββββββββββββββ
βLoad Balancerβ
β (HTTP :80) β
ββββββββ¬βββββββ
β
ββββββββββββ΄ββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β API Instance 1 β β API Instance 2 β
β β β β
β βββββββββββββββ β β βββββββββββββββ β
β β Controllers β β β β Controllers β β
β β HTTP Routes β β β β HTTP Routes β β
β β (No Workers)β β β β (No Workers)β β
β βββββββββββββββ β β βββββββββββββββ β
ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
β β
ββββββββββββββ¬ββββββββββββ
β (Read/write cluster data, enqueue tasks)
βΌ
ββββββββββββββββββββ
β CockroachDB β
β (Shared state, β
β task queue) β
ββββββββββββββββββββ
β²
β (Poll & process tasks, coordinate via DB)
ββββββββββββββ΄ββββββββββββ
β β
ββββββββββ΄βββββββββββ βββββββββββ΄ββββββββββ
β Worker Instance1 β β Worker Instance2 β
β β β β
β βββββββββββββββββ β β βββββββββββββββββ β
β β Task Workers β β β β Task Workers β β
β β Periodic Sync β β β β Periodic Sync β β
β β (Metrics Only)β β β β (Metrics Only)β β
β βββββββββββββββββ β β βββββββββββββββββ β
βββββββββββββββββββββ βββββββββββββββββββββ
Use Case: High-availability, load distribution, independent scaling Database: CockroachDB (required) Commands:
roachprod-centralized api --no-workersroachprod-centralized workersBenefits:
Key Differences from All-in-One:
--no-workers):
workers command):
API Tier Scaling:
/health endpoint for LB health checksWorker Tier Scaling: