Back to Gitnexus

COBOL Code Indexing

docs/code-indexing/cobol/README.md

1.6.34.3 KB
Original Source

COBOL Code Indexing

GitNexus indexes COBOL codebases using a regex-only extraction strategy, bypassing tree-sitter entirely. This document explains why, how the pipeline works, and links to detailed sub-documents.

Why Regex-Only?

The tree-sitter-cobol grammar (v0.0.1) has three critical limitations that make it unusable for production indexing:

IssueImpactSeverity
External scanner hangs on ~5% of filesNo timeout mechanism exists for the C scanner; the process blocks indefinitelyBlocking
Only ~15% of paragraph headers detectedMost procedure-division paragraphs are invisible to the grammarHigh
Patch markers in cols 1-6 cause parse errorsEnterprise COBOL uses non-standard sequence area content (e.g., mzADD, estero, #FIX)High

Because the external scanner hang cannot be interrupted (there is no setTimeoutMicros equivalent for tree-sitter), using tree-sitter-cobol would hang the indexing pipeline on a non-trivial fraction of real-world files.

The regex-only approach provides:

  • Speed: ~1ms per file average extraction time
  • Reliability: zero hangs, zero crashes across 13,000+ files
  • Coverage: captures all critical symbols -- program name, paragraphs, sections, CALL, PERFORM, COPY, data items (01-77, 88-level), file declarations, FD entries, EXEC SQL/CICS blocks, ENTRY points, and MOVE statements

Architecture

mermaid
flowchart TD
    A[Repository Scan] --> B{File Detection}
    B -->|Extension match| C[COBOL file]
    B -->|GITNEXUS_COBOL_DIRS match| C
    B -->|No match| Z[Skip]

    C --> D{Copybook?}
    D -->|Yes| E[Add to Copybook Map]
    D -->|No| F[Source Program]

    E --> G[COPY Expansion Engine]
    F --> G

    G -->|Inline copybook content| H[Expanded Source]
    H --> I[Patch Marker Cleanup]
    I --> J[Regex State Machine]

    J --> K[Extracted Symbols]
    K --> L[Graph Model Builder]
    L --> M[Knowledge Graph]

    subgraph "Per-Chunk Processing"
        G
        H
        I
        J
        K
        L
    end

    subgraph "Post-Processing"
        M --> N[Community Detection]
        M --> O[Process Detection]
        M --> P[Contract Detection]
    end

    style J fill:#e8f5e9,stroke:#2e7d32
    style G fill:#e3f2fd,stroke:#1565c0

COBOL vs Tree-Sitter Languages

FeatureCOBOL (Regex)Tree-Sitter Languages
ParserSingle-pass regex state machinetree-sitter grammar + queries
Speed~1ms/file~5ms/file
AST availableNoYes
COPY expansionYes (pre-processing step)N/A
Deep indexingData items, SQL, CICS, FD, ENTRYType annotations, generics, etc.
Call extractionPERFORM (intra-file) + CALL (cross-program)AST-based call site detection
Import extractionCOPY statementsimport/require/use/#include
CoverageAll critical symbolsLanguage-dependent query coverage
Failure modeNever hangsExternal scanner can hang (COBOL only)

Sub-Documents

DocumentDescription
File DetectionExtension mapping, GITNEXUS_COBOL_DIRS, copybook classification
COPY ExpansionCopybook inlining, REPLACING transformations, cycle detection
Regex ExtractionState machine, regex patterns, line processing
Deep IndexingData items, EXEC SQL/CICS, file declarations, FD, ENTRY, MOVE
Graph ModelCOBOL-specific node types, edge types, full annotated example
PerformanceBenchmarks, worker pool tuning, caps, troubleshooting

Key Source Files

FilePurpose
gitnexus/src/core/ingestion/cobol-preprocessor.tsPatch marker cleanup + regex extraction engine
gitnexus/src/core/ingestion/cobol-copy-expander.tsCOPY statement expansion with REPLACING
gitnexus/src/core/ingestion/utils.tsgetLanguageFromPath, getLanguageFromFilename
gitnexus/src/core/ingestion/pipeline.tsisCobolCopybook, expandCobolCopies, detectCrossProgamContracts
gitnexus/src/core/ingestion/workers/parse-worker.tsprocessCobolRegexOnly -- graph model builder
gitnexus/src/core/ingestion/workers/worker-pool.tsConfigurable sub-batch size for COBOL