pkg/workload/workload_generator/README.md
The workload generator is a tool that automatically generates realistic workloads from CockroachDB debug zips. It analyzes schema definitions and SQL statements from debug logs to recreate both the database schema and query patterns, enabling you to replay production-like workloads in testing or development environments.
The workload generator extracts information from debug zips to create:
This is useful for:
Debug zip: An unzipped CockroachDB debug bundle containing:
crdb_internal.create_statements.txt (DDL statements)nodes/*/crdb_internal.node_statement_statistics.txt (SQL queries)Generate a debug zip with:
cockroach debug zip /tmp/debug.zip --url='postgresql://...'
unzip /tmp/debug.zip -d /tmp/debug_logs
Note: Schema is generally always present in debug zips. However, in rare cases where the schema information is missing from the debug zip, you can provide a DDL file as a fallback.
Target database: A running CockroachDB cluster where you want to initialize the workload
DDL file (fallback if schema missing from debug zip):
If the debug zip is missing schema information, you can create a DDL file manually:
cockroach sql --url='postgresql://...' \
--execute="show create all tables;" > ddl_file.sql
Before running the full init, discover which databases are in the debug logs:
./cockroach workload init workload_generator \
'postgresql://root@localhost:26257?sslmode=disable' \
--debug-logs=/path/to/debug_logs \
--list-dbs
Output:
Databases found in debug logs:
- my_app_db
- analytics_db
Generate schema, load data, and create SQL workload files:
./cockroach workload init workload_generator \
'postgresql://root@localhost:26257/my_app_db?sslmode=disable' \
--debug-logs=/path/to/debug_logs \
--rows=10000
This will:
my_app_db_read.sql and my_app_db_write.sqlschema_my_app_db.yamlExecute the extracted workload against your database:
./cockroach workload run workload_generator \
'postgresql://root@localhost:26257/my_app_db?sslmode=disable' \
--input-yaml=schema_my_app_db.yaml \
--duration=5m \
--concurrency=50 \
--read-pct=70
Generate schema files without initializing tables (useful for reviewing before loading data):
./cockroach workload init workload_generator \
'postgresql://root@localhost:26257/test_db?sslmode=disable' \
--debug-logs=/path/to/debug_logs \
--schema-only
Schema is generally always present in debug zips. Use the --ddl-file option only in rare
cases where the schema is missing from the debug zip:
# First, create the DDL file from the source database
cockroach sql --url='postgresql://source-db:26257/mydb' \
--execute="show create all tables;" > ddl_export.sql
# Then use it with workload_generator
./cockroach workload init workload_generator \
'postgresql://root@localhost:26257/test_db?sslmode=disable' \
--ddl-file=ddl_export.sql \
--rows=5000
Note: When using --ddl-file, SQL workload extraction still requires the debug zip's
statement statistics. The DDL file only provides the schema information.
Write generated files to a specific location:
./cockroach workload init workload_generator \
'postgresql://root@localhost:26257/test_db?sslmode=disable' \
--debug-logs=/path/to/debug_logs \
--output-dir=/path/to/workloads
Output files:
/path/to/workloads/schema_test_db.yaml/path/to/workloads/test_db_read.sql/path/to/workloads/test_db_write.sqlSkip schema generation and use an existing YAML file:
./cockroach workload init workload_generator \
'postgresql://root@localhost:26257/test_db?sslmode=disable' \
--input-yaml=schema_test_db.yaml
| Flag | Default | Description |
|---|---|---|
--debug-logs | "" | Path to unzipped debug logs directory |
--ddl-file | "" | Path to DDL file (fallback if schema missing from debug zip) |
--list-dbs | false | List databases in debug logs and exit |
--rows | 1000 | Base row count for tables without FKs |
--input-yaml | "" | Use existing schema YAML instead of generating |
--output-dir | "." | Directory for generated files |
--schema-only | false | Generate schema files only, skip table init |
--db | "" | Override database name |
--drop | false | Drop existing database before init |
| Flag | Default | Description |
|---|---|---|
--input-yaml | "" | Path to schema YAML file (required for run) |
--duration | 0 | How long to run (0 = forever) |
--concurrency | 24 | Number of concurrent workers |
--read-pct | 50 | Percentage of operations that are reads (0-100) |
--max-rate | 0 | Max operations per second (0 = unlimited) |
--tx-timeout | 5m | Per-transaction timeout |
--histograms | "" | File to write latency histogram data |
Parse DDL: Extracts table definitions from crdb_internal.create_statements.txt
Build Schema: Creates YAML schema with metadata for each table:
Generate Data: Populates tables with synthetic data:
Extract Workload: Parses SQL queries from statement statistics:
_, __more__) with column-aware placeholders<dbname>_read.sql and <dbname>_write.sql--read-pctschema_<dbname>.yaml)Contains complete schema metadata:
users:
count: 10000
columns:
id:
type: sequence
isPrimaryKey: true
email:
type: string
args:
pattern: "email"
isUnique: true
created_at:
type: timestamp
pk: [id]
original_table: public.users
column_order: [id, email, created_at]
orders:
count: 50000 # Scaled by FK fanout
columns:
id:
type: sequence
isPrimaryKey: true
user_id:
type: int
hasForeignKey: true
fk: users.id
fanout: 5
pk: [id]
<dbname>_read.sql)Contains SELECT queries extracted from statement statistics:
-- Transaction 1
select id, email from users where id = $1;
select count(*) from orders where user_id = $2;
-- Transaction 2
select * from orders where created_at > $1 limit $2;
<dbname>_write.sql)Contains INSERT, UPDATE, DELETE queries:
-- Transaction 1
insert into users (email, created_at) values ($1, $2);
-- Transaction 2
update orders set status = $1 where id = $2;
The workload generator creates synthetic data that respects your schema:
SERIAL, INT PKsgen_random_uuid() for UUID columnsRow counts are automatically scaled based on foreign key relationships:
Base table (no FKs): 1000 rows (--rows value)
1 level deep (FKs): 5000 rows (1000 × fanout of 5)
2 levels deep: 25000 rows (5000 × fanout of 5)
Override with --rows flag to scale the entire workload.
The debug zip is incomplete or corrupted. Ensure:
crdb_internal.create_statements.txt exists and is validCheck that --debug-logs points to the unzipped directory, not the .zip file:
# Wrong
--debug-logs=/tmp/debug.zip
# Correct
--debug-logs=/tmp/debug_logs
The database name doesn't match what's in the debug logs. Use --list-dbs to see available databases:
./cockroach workload init workload_generator \
'postgresql://...' \
--debug-logs=/path/to/debug_logs \
--list-dbs
In rare cases, the debug zip may not contain schema information in crdb_internal.create_statements.txt.
If this happens, create a DDL file manually from the source database:
# Extract DDL from the source database
cockroach sql --url='postgresql://source-db:26257/mydb' \
--execute="show create all tables;" > ddl_export.sql
# Use the DDL file with workload_generator
./cockroach workload init workload_generator \
'postgresql://...' \
--ddl-file=ddl_export.sql \
--debug-logs=/path/to/debug_logs \
--rows=10000
The --ddl-file flag provides the schema while the debug logs still provide SQL workload queries.
This indicates a bug in topological sorting. As a workaround:
--schema-only to generate files--input-yaml to run workload--concurrency (default: 24)--rows for faster initialization--max-rate to limit throughput if overwhelming cluster--tx-timeout if queries are timing outLarge tables can consume significant memory. Solutions:
--rows value--schema-only then manually load smaller batchesIf your debug logs contain multiple databases:
# List all databases
./cockroach workload init workload_generator \
'postgresql://...' --debug-logs=/path/to/logs --list-dbs
# Initialize each separately
for db in db1 db2 db3; do
./cockroach workload init workload_generator \
"postgresql://.../$db?sslmode=disable" \
--debug-logs=/path/to/logs \
--rows=5000
done
Export histograms for analysis:
./cockroach workload run workload_generator \
'postgresql://...' \
--input-yaml=schema_mydb.yaml \
--duration=10m \
--histograms=latency.json
Integrate with monitoring:
./cockroach workload run workload_generator \
'postgresql://...' \
--input-yaml=schema_mydb.yaml \
--prometheus-port=2112 \
--duration=1h
Edit the schema_<dbname>.yaml file to customize data generation:
users:
count: 50000 # Change row count
columns:
email:
type: string
args:
pattern: "[email protected]" # Custom pattern
status:
type: enum
args:
values: ["active", "inactive", "pending"] # Custom enum values
Then reinitialize with the modified YAML:
./cockroach workload init workload_generator \
'postgresql://...' \
--input-yaml=schema_mydb.yaml \
--drop # Drop and recreate with new settings
Initialization time scales with row count and FK complexity
Run performance depends on query complexity and concurrency
Memory usage during init scales with batch size and table count
--rows if memory constrained# 1. Generate debug zip from production
cockroach debug zip /tmp/prod_debug.zip \
--url='postgresql://root@prod-cluster:26257'
# 2. Unzip
unzip /tmp/prod_debug.zip -d /tmp/prod_logs
# 3. List available databases
./cockroach workload init workload_generator \
'postgresql://root@test-cluster:26257?sslmode=disable' \
--debug-logs=/tmp/prod_logs \
--list-dbs
# 4. Initialize workload (10K base rows)
./cockroach workload init workload_generator \
'postgresql://root@test-cluster:26257/myapp?sslmode=disable' \
--debug-logs=/tmp/prod_logs \
--rows=10000 \
--output-dir=./workloads
# 5. Run workload for 1 hour with metrics
./cockroach workload run workload_generator \
'postgresql://root@test-cluster:26257/myapp?sslmode=disable' \
--input-yaml=./workloads/schema_myapp.yaml \
--duration=1h \
--concurrency=50 \
--read-pct=80 \
--histograms=./workloads/metrics.json \
--prometheus-port=2112
# Generate baseline workload from production
./cockroach workload init workload_generator \
'postgresql://root@test1:26257/app?sslmode=disable' \
--debug-logs=/tmp/prod_logs \
--rows=50000
# Run baseline performance test
./cockroach workload run workload_generator \
'postgresql://root@test1:26257/app?sslmode=disable' \
--input-yaml=schema_app.yaml \
--duration=10m \
--histograms=baseline.json
# Apply schema changes to test2
# ... make schema modifications ...
# Run workload against modified schema
./cockroach workload run workload_generator \
'postgresql://root@test2:26257/app?sslmode=disable' \
--input-yaml=schema_app.yaml \
--duration=10m \
--histograms=modified.json
# Compare metrics
diff baseline.json modified.json
information_schema, crdb_internal are filtered outWhen making changes to the workload generator:
*_test.go files./dev test pkg/workload/workload_generatorcrlfmt -w -tab 2 *.go