docs/code-indexing/cobol/performance.md
This document covers real-world benchmarks, worker pool configuration, memory management, known limitations, and troubleshooting for COBOL indexing.
The PROJECT-NAME project is a large Italian payroll system written in COBOL. It serves as the primary benchmark for COBOL indexing performance.
| Metric | Value |
|---|---|
| Paths scanned | 14,217 |
| Parseable files | 13,129 |
| Total source size | 224 MB |
| Chunks | 12 (at 20 MB budget) |
| Copybooks loaded | 2,976 |
| Copybooks used in expansion | 2,955 |
| Key directories | s/ (7773 programs), c/ (3036 copybooks), wfproc/ (1973 workflow files) |
| Metric | Value |
|---|---|
| Graph nodes | 2.79M |
| Graph edges | 5.67M |
| Clusters (communities) | 16,679 |
| Execution flows | 300 |
| Phase | Duration |
|---|---|
| Total | ~251s |
| KuzuDB write | 132s |
| Full-text search indexing | 6.7s |
| Regex extraction (avg per file) | ~1ms |
| COPY expansion + deep indexing | Remainder (~112s) |
cd /path/to/PROJECT-NAME
GITNEXUS_COBOL_DIRS=s,c,wfproc GITNEXUS_VERBOSE=1 node --max-old-space-size=8192 \
/path/to/gitnexus/dist/cli/index.js analyze --force
| Metric | Value |
|---|---|
| Graph nodes | 12,323 |
| Graph edges | 8,893 |
| Total time | 7.4s |
| Metric | Value |
|---|---|
| Graph nodes | 14,016 |
| Graph edges | 15,452 |
| Total time | 9.3s |
| Metric | Value |
|---|---|
| Per-iteration | 0.65ms |
| Throughput | ~382K lines/sec |
The worker pool splits each worker's chunk into sub-batches to bound peak memory per postMessage serialization. COBOL repos use a smaller sub-batch size than the default:
| Parameter | Default | COBOL Mode |
|---|---|---|
| Sub-batch size | 1,500 files | 200 files |
| Per sub-batch timeout | 120s | 120s (configurable) |
Why 200? COBOL regex extraction + preprocessing takes ~1ms per file on average, but with COPY expansion and deep indexing the effective time is ~150ms per file. At sub-batch size 1500, that would be ~225s per sub-batch, exceeding the 120s timeout.
COBOL mode is activated automatically when GITNEXUS_COBOL_DIRS is set:
// From pipeline.ts
const cobolSubBatch = process.env.GITNEXUS_COBOL_DIRS ? 200 : undefined;
workerPool = createWorkerPool(workerUrl, undefined, cobolSubBatch);
Workers default to min(8, cpus - 1). For COBOL repos, this is usually sufficient since regex extraction is CPU-bound but fast. The bottleneck is typically KuzuDB write, not extraction.
| Environment Variable | Default | Purpose |
|---|---|---|
GITNEXUS_WORKER_TIMEOUT_MS | 120,000 (2 min) | Per sub-batch processing timeout |
GITNEXUS_WORKER_STARTUP_TIMEOUT_MS | 60,000 (1 min) | Worker initialization timeout (tree-sitter loading) |
For COBOL-only repos, worker startup is faster because tree-sitter native modules are loaded lazily (skipped entirely if only COBOL files are present).
const MAX_DATA_ITEMS_PER_FILE = 500;
This constant appears in both parse-worker.ts (worker path) and parsing-processor.ts (sequential fallback).
Some COBOL programs, especially after COPY expansion, can have 10,000+ data items. At that scale:
The cap truncates data items beyond the 500th in source order. Since 01-level Records appear first in COBOL source, the cap preserves:
To increase the cap for specific needs, modify the MAX_DATA_ITEMS_PER_FILE constant in both files.
A per-file MAX_TOTAL_EXPANSIONS = 500 limit prevents exponential blowup from diamond-shaped COPY graphs (e.g., N copybooks each containing N COPY statements). Once the limit is reached, further COPY statements in that file are left unexpanded. See copy-expansion.md for details.
All copybook content is loaded upfront into a Map before chunk processing begins. For PROJECT-NAME:
cobolCopybookContents = undefined)Source files are grouped into chunks of max 20MB (CHUNK_BYTE_BUDGET). Each chunk's lifecycle:
This ensures only ~20MB of source + ~200-400MB of working memory (ASTs, extracted records, serialization) is active at any time.
The warnedCircular set (used by the COPY expansion engine) is shared across all files in a chunk. This prevents the same circular copybook warning (e.g., ANAZI includes itself) from being logged thousands of times.
| Limitation | Impact | Workaround |
|---|---|---|
| tree-sitter-cobol hangs on ~5% of files | Cannot use tree-sitter for COBOL | Regex-only extraction (current approach) |
| Data item cap (500/file) | May miss deeply nested items in large programs | Increase MAX_DATA_ITEMS_PER_FILE in source |
| Circular copybooks (ANAZI, ANDIP, QDIPE) | Self-referential includes cannot be expanded | Detected and skipped with warning |
| wfproc/ files may not be pure COBOL | Workflow files may produce extraction noise | Exclude wfproc from GITNEXUS_COBOL_DIRS if problematic |
| No MOVE DATA_FLOW edges yet | Data flow between variables not in graph | Reserved for future release |
| Continuation line handling | Some complex multi-line continuations (especially in string literals spanning 3+ lines) may not merge correctly | Known edge case; affects <0.1% of lines |
| Single-line EXEC blocks | EXEC SQL SELECT ... END-EXEC on one line is handled, but pathological nesting is not | Extremely rare in practice |
| Extension case sensitivity | .GNM and .gnm are matched differently | Use the exact case from the codebase |
[pipeline] COPY expansion failed for s/BGTABFL: Cannot read properties of null
Cause: A copybook referenced by a COPY statement cannot be found.
Fix:
GITNEXUS_COBOL_DIRS includes the directory containing copybooks (typically c).gitignoreWorker 3 sub-batch timed out after 120s (chunk: 200 items)
Cause: A sub-batch took longer than the timeout. Typically happens when one file is extremely large (50,000+ lines after COPY expansion).
Fix: Increase the timeout:
GITNEXUS_WORKER_TIMEOUT_MS=300000 gitnexus analyze
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
Fix: Increase Node.js heap size:
node --max-old-space-size=16384 /path/to/gitnexus/dist/cli/index.js analyze
For very large repos (>500MB source), consider --max-old-space-size=32768.
Rule: Only ONE gitnexus analyze process should run at a time per repository. Concurrent writes to KuzuDB corrupt the database.
If corruption occurs:
# Remove the KuzuDB directory and re-index
rm -rf .gitnexus/kuzu
gitnexus analyze --force
The KuzuDB write phase (132s for PROJECT-NAME) is the bottleneck for large COBOL repos. This is proportional to the number of nodes and edges being written. Reducing MAX_DATA_ITEMS_PER_FILE or excluding non-essential directories from GITNEXUS_COBOL_DIRS can help.
Enable verbose logging to see per-phase timing and statistics:
GITNEXUS_VERBOSE=1 gitnexus analyze
This outputs:
gitnexus/src/core/ingestion/workers/worker-pool.ts -- DEFAULT_SUB_BATCH_SIZE, SUB_BATCH_TIMEOUT_MS, WORKER_STARTUP_TIMEOUT_MSgitnexus/src/core/ingestion/pipeline.ts -- CHUNK_BYTE_BUDGET, COBOL sub-batch configuration, chunk lifecyclegitnexus/src/core/ingestion/workers/parse-worker.ts -- MAX_DATA_ITEMS_PER_FILE, processCobolRegexOnly()gitnexus/src/core/ingestion/parsing-processor.ts -- Sequential fallback MAX_DATA_ITEMS_PER_FILE