gitnexus-claude-plugin/skills/gitnexus-taint-analysis/SKILL.md
Expert knowledge for the opt-in --pdg program-analysis subsystem: control-flow
graphs, reaching definitions, and intra- + inter-procedural taint. Read this
before touching gitnexus/src/core/ingestion/cfg/** or
gitnexus/src/core/ingestion/taint/**, or when explaining a finding.
explain MCP tool's findings (intra- vs inter-procedural).--pdg output.Taint runs on the graph, not beside it. Each layer is opt-in behind --pdg
and a default analyze run is byte-identical (the golden parity gate is the
hard floor for every change here).
L1 CFG per-function basic blocks + control-flow edges (M1 #2081)
L2 REACHING_DEF GEN/KILL def→use data dependence (pure solver) (M2 #2082)
L3 Taint (intra) source→sink over RD facts, minus sanitizers (M3 #2083)
L4 Taint (inter) per-function summaries composed over CALLS (M4 #2084)
ParsedFile.cfgSideChannel
(plain, structured-clone-safe data — never AST nodes). The main thread runs
the pure solvers. NEVER re-parse on the main thread (re-introduces the #1983
OOM).scope-resolution/pipeline/run.ts, gated input.pdg === true),
because the disk-backed ParsedFile store is cleared when that phase ends — a
standalone post-mro phase would read empty data. The cross-function fixpoint
(L4) is the exception: it runs in its OWN registered phase (taintSummaries)
AFTER scope-resolution, because it needs the COMPLETE call graph, and consumes
small plain summary data threaded out via ScopeResolutionOutput.computeReachingDefs, computeTaintFlows,
harvestFunctionSummary, and solveInterprocTaint are pure and deterministic
(no graph, no I/O, no logger; sorted outputs). Snapshot tests and
content-derived edge ids depend on it.Forward reachability over RD facts from matched sources to matched sinks, killed by sanitizers. Key design points worth internalizing:
exec(escape(x)) (safe) from exec(x) (finding); the harvest records nested
call structure (SiteRecord.parent/via-tags) so sanitizer interposition is
precise.SinkKinds; a sink fires unless its kind is in the set. So escape(req.body)
suppresses res.send (xss) but STILL fires db.query (sql) — a kind-blind
kill would be a suppressed live injection (the forbidden FN direction).
path.basename(t) neutralizes path-traversal only, not command-injection.exec(req.body, req.query) is two findings).TAINTED edges (BasicBlock→BasicBlock); the path rides the
reason column via the shared versioned codec (taint/path-codec.ts).The production approach (Sharir-Pnueli 1981; the same shape as Meta's Pysa and
Mariana Trench, and FB Infer) — NOT full IFDS tabulation. Each function is
reduced to a compact summary, and summaries are composed over the already-
resolved CALLS graph.
Summary shape (taint/summary-model.ts, whole-parameter granularity):
| Edge | Meaning | Analogue |
|---|---|---|
param→return | a param flows to the return value | TITO — reserved (the floor already covers its recall; precision pass deferred) |
param→callee-arg | a param flows into arg j of a call (carries the path's neutralized sink kinds) | TITO into callee |
param→sink | a param reaches a modelled sink | partial/triggered sink |
source→return | the function generates+returns a source | generative — composed via the caller's callResults |
source→callee-arg | a generated source flows into a call | fixpoint SEED |
callResults | a user-function call's result flows to a sink/return/callee-arg in the caller | composes with callee source→return |
The fixpoint (taint/interproc-solver.ts): the unit is (function, parameter, source). Seed from source→callee-arg, propagate via
param→callee-arg, fire a finding when a tainted param meets param→sink.
fn × param × source), so the worklist converges — a recursive call
just re-proposes an already-visited entry. SCC condensation would only refine
processing order; correctness/termination don't require it.(fn, param) collapses multi-source flows: a sink param
tainted by source A is marked visited and a later flow from source B is dropped
before firing — the recurring multi-source bug class. (Bit M3; bit M4 U9.)CALLS edge by
CALLEE NAME, not call-site line — line-base parity (CFG 1-based vs reference
site) is fragile; the callee identity is exact and context-insensitivity
taints the callee's param identically at every call site.TAINT_PATH edges (Function→Function), function-level hop chain
in reason via the same codec; confidence < the intra-procedural 1.0.Context-insensitivity is the accepted trade-off at this tier: one summary per function, return/call-site merging accepted (security-conservative). Expect some FP from merging; the bigger FN sources are unmodeled features (below).
The largest is closures/callbacks (arr.forEach(() => sink(y))) — taint
into a callback is dropped without per-library models (true of CodeQL's JS libs
too). Also deferred: field/property flows (obj.x = taint; sink(obj.y)),
field-sensitive access paths, guard-style sanitizers, implicit/control-dependence
flows, promise/async-await threading, and destructured/rest params before a
tainted simple param (the summary port index is the binding ordinal, not the
formal arg position — needs a formal-param index threaded from the worker
BindingEntry). The interprocedural join is also context-insensitive: when one
caller invokes two distinct same-named callees, a flow into one
over-attributes to both (sound — over-report, never a missed flow). Absence of a
finding is NOT proof of safety.
FunctionCfg.functionStartLine is 1-based; Function/
Method node startLine is 0-based — join at startLine - 1. Function nodes
have no column, so same-line functions ({a:()=>x(), b:()=>y()}) are
ambiguous → drop (the summary driver counts unresolved) rather than
cross-wire.[:TAINTED*]/[:TAINT_PATH*] queries explode.
TAINT_PATH is therefore MATERIALIZED + anchored at analyze time, never
traversed live; explain reads it source-anchored + LIMIT-guarded.explain is the only discovery surface. TAINTED/TAINT_PATH are
deliberately OUT of VALID_RELATION_TYPES (impact's allow-list) and the web
schema (pinned in security.test.ts). explain enumerates both layers
(cross-function findings carry interprocedural: true).explain import
taint/path-codec.ts. Two hand-rolled copies of a wire format drift — never
fork it. New metadata extends the format WITHIN the version when writer +
reader ship together.pdg:N), NOT SCHEMA_BUMP (which cold-invalidates every user).
Persisted-graph/config changes ride RepoMeta.pdg's key-union mismatch →
full writeback. Model content rides taintModelVersion.Edit the language model in taint/typescript-model.ts (registered via the
explicit registerBuiltinTaintModels seam, keyed by SupportedLanguages). The
spec is hashable data (no functions). A sanitizer's neutralizes lists the
EXACT sink kinds it defends — never a blanket kill. Add a fixture + assert the
finding (or its absence) in test/unit/taint/ (real-source harness:
test/helpers/ts-cfg-harness.ts); the end-to-end proof is
test/integration/cfg/.
--pdg change1. tsc clean (schema additions are exhaustiveness-checked; watch the
api.ts getNodeQuery runtime read-path if a node label is added).
2. Targeted vitest by directory (test/unit/taint, test/unit/cfg,
test/integration/cfg) — verify by ISOLATION, not full-suite exit
(known load-flakes). `node scripts/build.js` before worker/integration runs.
3. Flag-off golden byte-identical (pipeline-graph-golden.test.ts).
4. bench/cfg/measure.mjs --check (no fingerprint drift / budget regression).
5. detect_changes() before commit; impact({direction:'upstream'}) before
editing shared symbols (KnowledgeGraph, RepoMeta, RelationshipType, codec).
Sharir & Pnueli 1981 (functional approach); Reps-Horwitz-Sagiv IFDS (POPL 1995); FlowDroid/StubDroid (access-path summaries); Pysa & Mariana Trench (TITO / propagations, parallel SCC fixpoint); CodeQL Models-as-Data (the richest port notation, incl. callback ports); Infer (content-keyed incremental summaries).