Back to Gitnexus

COBOL Deep Indexing

docs/code-indexing/cobol/deep-indexing.md

1.6.311.7 KB
Original Source

COBOL Deep Indexing

Beyond basic symbol extraction (program name, paragraphs, CALL, PERFORM, COPY), GitNexus performs deep indexing of COBOL-specific constructs: data items, EXEC SQL/CICS blocks, file declarations, FD entries, ENTRY points, and MOVE statements.

Data Items

Level Numbers

Level RangeMeaningGraph Node Type
01Record (group item)Record
02-49Elementary/group itemsProperty
66RENAMESProperty
77Independent itemProperty
88Condition nameConst

FILLER items are skipped (no useful name for the graph).

Clauses Parsed

The parseDataItemClauses() function extracts these clauses from the trailing text of a data item declaration:

ClausePatternExample
PIC / PICTURE\bPIC(?:TURE)?\s+(?:IS\s+)?(\S+)PIC X(30), PICTURE IS 9(5)V99
USAGE\bUSAGE\s+(?:IS\s+)?(COMP|BINARY|...)USAGE IS COMP-3, BINARY
REDEFINES\bREDEFINES\s+([A-Z][A-Z0-9-]+)REDEFINES WK-DATE-NUM
OCCURS\bOCCURS\s+(\d+)OCCURS 12 TIMES

Standalone COMP variants (without the USAGE keyword) are also detected: COMP, COMP-1 through COMP-6, COMP-X, BINARY, PACKED-DECIMAL.

Data Hierarchy

Data items form a hierarchical structure based on level numbers. The extractor uses a stack algorithm:

Processing order:
  01 WK-RECORD          -> push {01, WK-RECORD}   -> parent: Module
  05 WK-NAME            -> push {05, WK-NAME}     -> parent: WK-RECORD (01 < 05)
  10 WK-FIRST           -> push {10, WK-FIRST}    -> parent: WK-NAME (05 < 10)
  10 WK-LAST            -> pop WK-FIRST, push      -> parent: WK-NAME (05 < 10)
  05 WK-CODE            -> pop WK-LAST, WK-NAME    -> parent: WK-RECORD (01 < 05)
  88 WK-ACTIVE          -> (88 handled separately)  -> parent: WK-CODE

The stack maintains items where each entry's level is strictly less than the next. When a new item arrives with a level <= the top of stack, items are popped until the stack top has a smaller level. A CONTAINS edge is created from the stack top to the new item.

For 88-level condition names, the parent is the immediately preceding non-88 data item (found by scanning backwards).

Annotated Example

cobol
       01  WK-EMPLOYEE.
           05  WK-EMP-ID          PIC 9(6).
           05  WK-EMP-NAME        PIC X(30).
           05  WK-EMP-STATUS      PIC X(01).
               88  WK-ACTIVE      VALUE "A".
               88  WK-INACTIVE    VALUE "I".
           05  WK-SALARY          PIC 9(7)V99 COMP-3.
           05  WK-DEPT            PIC X(04) OCCURS 3 TIMES.

Produces:

  • Record node: WK-EMPLOYEE (level 01, section: working-storage)
  • Property nodes: WK-EMP-ID, WK-EMP-NAME, WK-EMP-STATUS, WK-SALARY, WK-DEPT
  • Const nodes: WK-ACTIVE (values: A), WK-INACTIVE (values: I)
  • CONTAINS edges: WK-EMPLOYEE -> WK-EMP-ID, WK-EMPLOYEE -> WK-EMP-NAME, etc.
  • CONTAINS edges: WK-EMP-STATUS -> WK-ACTIVE, WK-EMP-STATUS -> WK-INACTIVE

Data Item Cap

A maximum of 500 data items per file (MAX_DATA_ITEMS_PER_FILE) are processed. Some COBOL programs (especially after COPY expansion) can have 10,000+ data items, which would cause graph bloat and push the V8 relationship Map past its 16.7M entry limit across thousands of files.

The cap applies after extraction: the first 500 items in source order are kept. Since 01-level records appear first, critical top-level structure is preserved.

EXEC SQL

EXEC SQL blocks are accumulated across lines between EXEC SQL and END-EXEC, then parsed as a unit.

Operation Classification

The first SQL keyword determines the operation:

First KeywordOperation
SELECTSELECT
INSERTINSERT
UPDATEUPDATE
DELETEDELETE
DECLAREDECLARE
OPENOPEN
CLOSECLOSE
FETCHFETCH
(anything else)OTHER

Table Extraction

Tables are extracted from SQL clauses:

Clause PatternExample
FROM <table>SELECT * FROM EMPLOYEES
INSERT INTO <table>INSERT INTO EMPLOYEES
UPDATE <table>UPDATE EMPLOYEES SET ...
JOIN <table>LEFT JOIN DEPARTMENTS ON ...

Note: The INTO pattern is restricted to INSERT INTO to avoid false positives from FETCH ... INTO :host-var and SELECT ... INTO :host-var statements, where INTO introduces host variables rather than table names.

Cursor Detection

cobol
           EXEC SQL
               DECLARE C-EMPLOYEES CURSOR FOR
               SELECT EMP-ID, EMP-NAME FROM EMPLOYEES
               WHERE DEPT = :WK-DEPT
           END-EXEC

Extracts: cursor C-EMPLOYEES, table EMPLOYEES, host variable WK-DEPT.

Host Variables

Host variables are COBOL variables referenced in SQL with a : prefix. The colon is stripped:

sql
WHERE EMP-ID = :WK-EMP-ID AND DEPT = :WK-DEPT

Extracts: WK-EMP-ID, WK-DEPT.

Graph Output

  • CodeElement node per table, with description sql-table op:{OP}
  • CodeElement node per cursor, with description sql-cursor
  • ACCESSES edge from Module to each CodeElement
  • Deduplication: if the same table appears in multiple SQL blocks, only one node is created

EXEC CICS

EXEC CICS blocks are accumulated and parsed similarly to SQL blocks.

Command Detection

Two-word commands are detected first (matched against the block start):

SEND MAP, RECEIVE MAP, SEND TEXT, SEND CONTROL, READ NEXT, READ PREV

If no two-word command matches, the first word is used (e.g., LINK, XCTL, RETURN, READ, WRITE).

Extraction

ElementPatternExample
MAP nameMAP('name') or MAP("name")EXEC CICS SEND MAP('EMPMENU')
PROGRAM namePROGRAM('name') or PROGRAM("name")EXEC CICS LINK PROGRAM('BGTABUP')
TRANSIDTRANSID('name') or TRANSID("name")EXEC CICS START TRANSID('EMP1')

Graph Output

  • MAP: CodeElement node with description cics-map cmd:{CMD} + ACCESSES edge from Module
  • PROGRAM: CALLS edge (cross-program call via CICS LINK/XCTL)
  • TRANSID: CodeElement node with description cics-transid cmd:{CMD} + ACCESSES edge from Module

Annotated Example

cobol
           EXEC CICS
               SEND MAP('EMPMENU')
               MAPSET('EMPSET')
               FROM(WK-MAP-DATA)
               ERASE
           END-EXEC

Produces:

  • CodeElement node: EMPMENU (description: cics-map cmd:SEND MAP)
  • ACCESSES edge: Module -> EMPMENU

File Declarations

SELECT statements in the INPUT-OUTPUT SECTION are accumulated across multiple lines (until a period terminator) and parsed for:

ClausePatternExample
SELECTSELECT <name>SELECT MASTER-FILE
ASSIGNASSIGN TO <file>ASSIGN TO "MASTER.DAT"
ORGANIZATIONORGANIZATION IS <type>ORGANIZATION IS INDEXED
ACCESSACCESS MODE IS <mode>ACCESS MODE IS DYNAMIC
RECORD KEYRECORD KEY IS <field>RECORD KEY IS WK-EMP-ID
FILE STATUSFILE STATUS IS <field>FILE STATUS IS WK-FILE-STATUS

Graph Output

  • CodeElement node with description containing all parsed clauses (e.g., select org:INDEXED access:DYNAMIC key:WK-EMP-ID status:WK-FILE-STATUS assign:MASTER.DAT)
  • RECORD_KEY_OF edge: from Property node to CodeElement (confidence 0.8)
  • FILE_STATUS_OF edge: from Property node to CodeElement (confidence 0.8)

FD Entries

FD (File Description) entries associate a file name with its record layout:

cobol
       FD  MASTER-FILE.
       01  MASTER-RECORD.
           05  MR-EMP-ID       PIC 9(6).
           05  MR-EMP-NAME     PIC X(30).

The extractor tracks pendingFdName state: when an FD line is seen, the next 01-level data item becomes its record.

Graph Output

  • CodeElement node with description fd record:{recordName}
  • CONTAINS edge: FD CodeElement -> Record node
  • CONTAINS edge: SELECT CodeElement -> FD CodeElement (linking file declaration to file description)

ENTRY Points

The ENTRY statement defines additional entry points into a COBOL program (in addition to the main program entry):

cobol
       ENTRY "SUBPROG" USING WK-PARAM-1 WK-PARAM-2.

Graph Output

  • Constructor node with description entry params:{param1},{param2} (or just entry if no parameters)
  • CONTAINS edge: Module -> Constructor
  • Symbol table entry (so the entry point is discoverable by name)

PROCEDURE DIVISION USING

cobol
       PROCEDURE DIVISION USING WK-INPUT-REC WK-OUTPUT-REC.

The USING clause identifies parameters received by the program from its caller.

Graph Output

  • RECEIVES edge: Module -> Property (for each parameter name, confidence 0.8)

MOVE Statements

MOVE statements produce ACCESSES edges in the graph:

cobol
       MOVE WK-NAME TO OUT-NAME.
       MOVE CORRESPONDING WK-INPUT TO WK-OUTPUT.
       MOVE CORR WK-IN TO WK-OUT.

Extraction Details

  • Source and target identifiers are captured
  • CORRESPONDING and its abbreviation CORR are both recognized (bulk field-by-field move)
  • Figurative constants (SPACES, ZEROS, LOW-VALUES, HIGH-VALUES, QUOTES, ALL) are skipped
  • The enclosing paragraph (caller) is tracked for context

MOVE CORRESPONDING / CORR Edge Reasons

MOVE CORRESPONDING (and CORR) produces distinct edge reasons to differentiate from simple MOVE:

EdgeReason (simple MOVE)Reason (CORRESPONDING/CORR)
Read (source)cobol-move-readcobol-move-corresponding-read
Write (target)cobol-move-writecobol-move-corresponding-write

This distinction allows queries to find bulk field-by-field moves separately from simple variable assignments.

GO TO DEPENDING ON

The GO TO statement with multiple targets and a DEPENDING ON clause is a computed branch:

cobol
       GO TO PARA-1 PARA-2 PARA-3
           DEPENDING ON WK-SELECTOR.

All target paragraph names are extracted and emitted as separate gotos entries. Each target produces a CALLS edge in the graph (same semantics as PERFORM). The DEPENDING ON variable is not currently tracked as a data-flow dependency.

SORT INPUT/OUTPUT PROCEDURE

SORT and MERGE statements can specify procedural entry points instead of file-based I/O:

cobol
       SORT SORT-FILE ON ASCENDING KEY SORT-KEY
           INPUT PROCEDURE IS PREPARE-INPUT
           OUTPUT PROCEDURE IS FORMAT-OUTPUT.

INPUT PROCEDURE IS and OUTPUT PROCEDURE IS targets are extracted as control-flow targets (same as PERFORM). They produce performs entries and corresponding CALLS edges in the graph.

Fixed-Format Literal Continuation

In fixed-format COBOL, string literals can span multiple lines using the continuation indicator (- in column 7). When a continuation line starts with a quote character, the extractor joins it with the predecessor by removing the trailing quote from the previous line and the opening quote from the continuation:

Line N:          MOVE "THIS IS A LONG STRI
Line N+1 (cont): -    "NG VALUE" TO WK-FIELD.
Merged:          MOVE "THIS IS A LONG STRING VALUE" TO WK-FIELD.

The trailing " on line N and the opening " on line N+1 are both removed, producing a seamless literal. If no matching quote is found on the predecessor line, the continuation is appended as-is.

Source Files

  • gitnexus/src/core/ingestion/cobol-preprocessor.ts -- All extraction logic, clause parsers, EXEC block parsers
  • gitnexus/src/core/ingestion/workers/parse-worker.ts -- processCobolRegexOnly(), graph node/edge emission
  • gitnexus/src/core/ingestion/parsing-processor.ts -- Sequential fallback with same MAX_DATA_ITEMS_PER_FILE cap