docs/code-indexing/cobol/deep-indexing.md
Beyond basic symbol extraction (program name, paragraphs, CALL, PERFORM, COPY), GitNexus performs deep indexing of COBOL-specific constructs: data items, EXEC SQL/CICS blocks, file declarations, FD entries, ENTRY points, and MOVE statements.
| Level Range | Meaning | Graph Node Type |
|---|---|---|
| 01 | Record (group item) | Record |
| 02-49 | Elementary/group items | Property |
| 66 | RENAMES | Property |
| 77 | Independent item | Property |
| 88 | Condition name | Const |
FILLER items are skipped (no useful name for the graph).
The parseDataItemClauses() function extracts these clauses from the trailing text of a data item declaration:
| Clause | Pattern | Example |
|---|---|---|
PIC / PICTURE | \bPIC(?:TURE)?\s+(?:IS\s+)?(\S+) | PIC X(30), PICTURE IS 9(5)V99 |
USAGE | \bUSAGE\s+(?:IS\s+)?(COMP|BINARY|...) | USAGE IS COMP-3, BINARY |
REDEFINES | \bREDEFINES\s+([A-Z][A-Z0-9-]+) | REDEFINES WK-DATE-NUM |
OCCURS | \bOCCURS\s+(\d+) | OCCURS 12 TIMES |
Standalone COMP variants (without the USAGE keyword) are also detected: COMP, COMP-1 through COMP-6, COMP-X, BINARY, PACKED-DECIMAL.
Data items form a hierarchical structure based on level numbers. The extractor uses a stack algorithm:
Processing order:
01 WK-RECORD -> push {01, WK-RECORD} -> parent: Module
05 WK-NAME -> push {05, WK-NAME} -> parent: WK-RECORD (01 < 05)
10 WK-FIRST -> push {10, WK-FIRST} -> parent: WK-NAME (05 < 10)
10 WK-LAST -> pop WK-FIRST, push -> parent: WK-NAME (05 < 10)
05 WK-CODE -> pop WK-LAST, WK-NAME -> parent: WK-RECORD (01 < 05)
88 WK-ACTIVE -> (88 handled separately) -> parent: WK-CODE
The stack maintains items where each entry's level is strictly less than the next. When a new item arrives with a level <= the top of stack, items are popped until the stack top has a smaller level. A CONTAINS edge is created from the stack top to the new item.
For 88-level condition names, the parent is the immediately preceding non-88 data item (found by scanning backwards).
01 WK-EMPLOYEE.
05 WK-EMP-ID PIC 9(6).
05 WK-EMP-NAME PIC X(30).
05 WK-EMP-STATUS PIC X(01).
88 WK-ACTIVE VALUE "A".
88 WK-INACTIVE VALUE "I".
05 WK-SALARY PIC 9(7)V99 COMP-3.
05 WK-DEPT PIC X(04) OCCURS 3 TIMES.
Produces:
Record node: WK-EMPLOYEE (level 01, section: working-storage)Property nodes: WK-EMP-ID, WK-EMP-NAME, WK-EMP-STATUS, WK-SALARY, WK-DEPTConst nodes: WK-ACTIVE (values: A), WK-INACTIVE (values: I)CONTAINS edges: WK-EMPLOYEE -> WK-EMP-ID, WK-EMPLOYEE -> WK-EMP-NAME, etc.CONTAINS edges: WK-EMP-STATUS -> WK-ACTIVE, WK-EMP-STATUS -> WK-INACTIVEA maximum of 500 data items per file (MAX_DATA_ITEMS_PER_FILE) are processed. Some COBOL programs (especially after COPY expansion) can have 10,000+ data items, which would cause graph bloat and push the V8 relationship Map past its 16.7M entry limit across thousands of files.
The cap applies after extraction: the first 500 items in source order are kept. Since 01-level records appear first, critical top-level structure is preserved.
EXEC SQL blocks are accumulated across lines between EXEC SQL and END-EXEC, then parsed as a unit.
The first SQL keyword determines the operation:
| First Keyword | Operation |
|---|---|
SELECT | SELECT |
INSERT | INSERT |
UPDATE | UPDATE |
DELETE | DELETE |
DECLARE | DECLARE |
OPEN | OPEN |
CLOSE | CLOSE |
FETCH | FETCH |
| (anything else) | OTHER |
Tables are extracted from SQL clauses:
| Clause Pattern | Example |
|---|---|
FROM <table> | SELECT * FROM EMPLOYEES |
INSERT INTO <table> | INSERT INTO EMPLOYEES |
UPDATE <table> | UPDATE EMPLOYEES SET ... |
JOIN <table> | LEFT JOIN DEPARTMENTS ON ... |
Note: The INTO pattern is restricted to INSERT INTO to avoid false positives from FETCH ... INTO :host-var and SELECT ... INTO :host-var statements, where INTO introduces host variables rather than table names.
EXEC SQL
DECLARE C-EMPLOYEES CURSOR FOR
SELECT EMP-ID, EMP-NAME FROM EMPLOYEES
WHERE DEPT = :WK-DEPT
END-EXEC
Extracts: cursor C-EMPLOYEES, table EMPLOYEES, host variable WK-DEPT.
Host variables are COBOL variables referenced in SQL with a : prefix. The colon is stripped:
WHERE EMP-ID = :WK-EMP-ID AND DEPT = :WK-DEPT
Extracts: WK-EMP-ID, WK-DEPT.
CodeElement node per table, with description sql-table op:{OP}CodeElement node per cursor, with description sql-cursorACCESSES edge from Module to each CodeElementEXEC CICS blocks are accumulated and parsed similarly to SQL blocks.
Two-word commands are detected first (matched against the block start):
SEND MAP, RECEIVE MAP, SEND TEXT, SEND CONTROL, READ NEXT, READ PREV
If no two-word command matches, the first word is used (e.g., LINK, XCTL, RETURN, READ, WRITE).
| Element | Pattern | Example |
|---|---|---|
| MAP name | MAP('name') or MAP("name") | EXEC CICS SEND MAP('EMPMENU') |
| PROGRAM name | PROGRAM('name') or PROGRAM("name") | EXEC CICS LINK PROGRAM('BGTABUP') |
| TRANSID | TRANSID('name') or TRANSID("name") | EXEC CICS START TRANSID('EMP1') |
CodeElement node with description cics-map cmd:{CMD} + ACCESSES edge from ModuleCALLS edge (cross-program call via CICS LINK/XCTL)CodeElement node with description cics-transid cmd:{CMD} + ACCESSES edge from Module EXEC CICS
SEND MAP('EMPMENU')
MAPSET('EMPSET')
FROM(WK-MAP-DATA)
ERASE
END-EXEC
Produces:
CodeElement node: EMPMENU (description: cics-map cmd:SEND MAP)ACCESSES edge: Module -> EMPMENUSELECT statements in the INPUT-OUTPUT SECTION are accumulated across multiple lines (until a period terminator) and parsed for:
| Clause | Pattern | Example |
|---|---|---|
| SELECT | SELECT <name> | SELECT MASTER-FILE |
| ASSIGN | ASSIGN TO <file> | ASSIGN TO "MASTER.DAT" |
| ORGANIZATION | ORGANIZATION IS <type> | ORGANIZATION IS INDEXED |
| ACCESS | ACCESS MODE IS <mode> | ACCESS MODE IS DYNAMIC |
| RECORD KEY | RECORD KEY IS <field> | RECORD KEY IS WK-EMP-ID |
| FILE STATUS | FILE STATUS IS <field> | FILE STATUS IS WK-FILE-STATUS |
CodeElement node with description containing all parsed clauses (e.g., select org:INDEXED access:DYNAMIC key:WK-EMP-ID status:WK-FILE-STATUS assign:MASTER.DAT)RECORD_KEY_OF edge: from Property node to CodeElement (confidence 0.8)FILE_STATUS_OF edge: from Property node to CodeElement (confidence 0.8)FD (File Description) entries associate a file name with its record layout:
FD MASTER-FILE.
01 MASTER-RECORD.
05 MR-EMP-ID PIC 9(6).
05 MR-EMP-NAME PIC X(30).
The extractor tracks pendingFdName state: when an FD line is seen, the next 01-level data item becomes its record.
CodeElement node with description fd record:{recordName}CONTAINS edge: FD CodeElement -> Record nodeCONTAINS edge: SELECT CodeElement -> FD CodeElement (linking file declaration to file description)The ENTRY statement defines additional entry points into a COBOL program (in addition to the main program entry):
ENTRY "SUBPROG" USING WK-PARAM-1 WK-PARAM-2.
Constructor node with description entry params:{param1},{param2} (or just entry if no parameters)CONTAINS edge: Module -> Constructor PROCEDURE DIVISION USING WK-INPUT-REC WK-OUTPUT-REC.
The USING clause identifies parameters received by the program from its caller.
RECEIVES edge: Module -> Property (for each parameter name, confidence 0.8)MOVE statements produce ACCESSES edges in the graph:
MOVE WK-NAME TO OUT-NAME.
MOVE CORRESPONDING WK-INPUT TO WK-OUTPUT.
MOVE CORR WK-IN TO WK-OUT.
CORRESPONDING and its abbreviation CORR are both recognized (bulk field-by-field move)caller) is tracked for contextMOVE CORRESPONDING (and CORR) produces distinct edge reasons to differentiate from simple MOVE:
| Edge | Reason (simple MOVE) | Reason (CORRESPONDING/CORR) |
|---|---|---|
| Read (source) | cobol-move-read | cobol-move-corresponding-read |
| Write (target) | cobol-move-write | cobol-move-corresponding-write |
This distinction allows queries to find bulk field-by-field moves separately from simple variable assignments.
The GO TO statement with multiple targets and a DEPENDING ON clause is a computed branch:
GO TO PARA-1 PARA-2 PARA-3
DEPENDING ON WK-SELECTOR.
All target paragraph names are extracted and emitted as separate gotos entries. Each target produces a CALLS edge in the graph (same semantics as PERFORM). The DEPENDING ON variable is not currently tracked as a data-flow dependency.
SORT and MERGE statements can specify procedural entry points instead of file-based I/O:
SORT SORT-FILE ON ASCENDING KEY SORT-KEY
INPUT PROCEDURE IS PREPARE-INPUT
OUTPUT PROCEDURE IS FORMAT-OUTPUT.
INPUT PROCEDURE IS and OUTPUT PROCEDURE IS targets are extracted as control-flow targets (same as PERFORM). They produce performs entries and corresponding CALLS edges in the graph.
In fixed-format COBOL, string literals can span multiple lines using the continuation indicator (- in column 7). When a continuation line starts with a quote character, the extractor joins it with the predecessor by removing the trailing quote from the previous line and the opening quote from the continuation:
Line N: MOVE "THIS IS A LONG STRI
Line N+1 (cont): - "NG VALUE" TO WK-FIELD.
Merged: MOVE "THIS IS A LONG STRING VALUE" TO WK-FIELD.
The trailing " on line N and the opening " on line N+1 are both removed, producing a seamless literal. If no matching quote is found on the predecessor line, the continuation is appended as-is.
gitnexus/src/core/ingestion/cobol-preprocessor.ts -- All extraction logic, clause parsers, EXEC block parsersgitnexus/src/core/ingestion/workers/parse-worker.ts -- processCobolRegexOnly(), graph node/edge emissiongitnexus/src/core/ingestion/parsing-processor.ts -- Sequential fallback with same MAX_DATA_ITEMS_PER_FILE cap