docs/design/plans/2026-02-13-ruby-analyzer-design.md
Date: 2026-02-13 Status: Design approved, ready for implementation planning
Woods has two unmet needs that share the same root requirement:
Self-analysis. The gem cannot analyze itself — it's plain Ruby with no Rails runtime. Contributors (human and AI) have no structured way to query "what calls DependencyGraph#register?" or "how does data flow from extraction to JSON output?"
Execution flow tracing. The dependency graph captures what-depends-on-what but not execution order. When an agent is asked "what happens when a customer creates a checkout?", it can find related units but not the order they execute. This forces agents to reconstruct flow from memory, producing systematic errors (wrong status codes, swapped pipeline ordering, wrong transaction receivers).
Both features need the same AST infrastructure: Prism parsing, method extraction, call site detection, constant resolution. Building them independently would duplicate ~350 lines of fragile work that already exists as regex/indentation heuristics across 4+ extractors.
A shared AST layer (lib/woods/ast/) that provides robust Ruby source parsing, consumed by two independent features:
ExtractedUnit objects for self-referencing dataflow mapsThe AST layer also resolves two open backlog items:
FlowAssembler's AstParser + OperationExtractor and RubyAnalyzer's PrismHelpers + MethodAnalyzer do the same work with different names: parse Ruby source, walk AST nodes, extract call sites, resolve constants. Building them independently means:
Prism is in Ruby stdlib since 3.3. No new dependencies. It produces a well-documented AST with source location tracking. The gem requires Ruby 3.0+, and Prism is available as a gem for Ruby 3.0-3.2. RubyVM::AbstractSyntaxTree is deprecated in favor of Prism.
Static analysis shows what could be called. TracePoint shows what is called during test execution. The combination surfaces dead code, hot paths, and untested branches. Static is the source of truth; tracing adds color.
FlowAssembler consumes existing ExtractedUnit data and the DependencyGraph. It doesn't modify extractors or add fields to ExtractedUnit. Flows are cross-cutting paths through the graph, not individual code units. This keeps the extraction layer stable while adding a new query capability.
lib/woods/
├── ast/ # SHARED AST LAYER
│ ├── parser.rb # Prism adapter (parser gem fallback)
│ ├── node.rb # Normalized AstNode struct
│ ├── method_extractor.rb # Extract method body ASTs from source
│ ├── call_site_extractor.rb # Extract call sites from any AST node
│ └── constant_resolver.rb # Resolve constant paths to FQNs
│
├── ruby_analyzer.rb # Self-analysis orchestrator
├── ruby_analyzer/
│ ├── class_analyzer.rb # Classes, modules, inheritance, mixins
│ ├── method_analyzer.rb # Method defs, call graph, parameters
│ ├── dataflow_analyzer.rb # Data shape transformations
│ └── trace_enricher.rb # Optional TracePoint integration
│
├── flow_assembler.rb # Flow tracing orchestrator
├── flow_document.rb # FlowDocument value object
└── flow_analysis/
├── operation_extractor.rb # Ordered operation extraction from method bodies
└── response_code_mapper.rb # render/redirect → HTTP status codes
ast/)Ast::Parser — Adapter that normalizes Prism and parser gem to a common interface. Auto-detects available parser at load time.
Woods::Ast::Parser.new.parse(source) # → Ast::Node (root)
Woods::Ast::Parser.new.extract_method(source, "create") # → Ast::Node | nil
Ast::Node — Normalized AST node struct used by all consumers:
Ast::Node = Struct.new(
:type, # Symbol: :send, :block, :if, :rescue, :def, :class, etc.
:children, # Array<Ast::Node>
:line, # Integer: source line number
:receiver, # String, nil: method call receiver (for :send)
:method_name, # String, nil: method name (for :send, :def)
:arguments, # Array<String>: argument representations (for :send)
keyword_init: true
)
Ast::MethodExtractor — Extracts method body ASTs. Replaces the ~240 lines of nesting_delta / neutralize_strings_and_comments / indentation-based boundary detection duplicated across controller and mailer extractors.
Ast::CallSiteExtractor — Extracts call sites (receiver, method, args, line) from any AST node. Shared by RubyAnalyzer's MethodAnalyzer (for call graph building) and FlowAssembler's OperationExtractor (for flow ordering).
Ast::ConstantResolver — Resolves constant paths (e.g., Woods::Extractor → fully qualified name). Shared by all consumers that need to map constants to known units.
Entry point: Woods::RubyAnalyzer.analyze(paths:, trace_data: nil)
ClassAnalyzer — Walks Prism AST via Ast::Parser to extract class/module definitions, superclasses, includes, constant references. Produces :ruby_class and :ruby_module ExtractedUnit objects.
MethodAnalyzer — Uses Ast::MethodExtractor + Ast::CallSiteExtractor to extract method definitions, call sites, parameters, visibility. Produces :ruby_method units linked to parent class via dependencies.
DataFlowAnalyzer — Uses Ast::CallSiteExtractor to find data transformation boundaries (.new, .to_h, .to_json, assignment patterns). Adds data_transformations metadata to existing units. Initial version is conservative.
TraceEnricher — Optional runtime layer. Wraps test execution with TracePoint.new(:call, :return) filtered to woods source paths. Records caller→callee pairs, call counts, argument types. Writes tmp/trace_data.json. RubyAnalyzer merges trace data into static analysis.
Entry point: Woods::FlowAssembler.new(graph:, extracted_dir:).assemble(entry_point)
OperationExtractor — Uses Ast::CallSiteExtractor + domain-specific classification (transaction detection, async enqueue detection, response call detection). Extracts operations in source line order with nesting for transaction blocks and conditionals.
ResponseCodeMapper — Maps render/redirect AST nodes to HTTP status codes via Rack::Utils::SYMBOL_TO_STATUS_CODE. FlowAssembler-specific, no equivalent in self-analysis.
FlowDocument — Value object holding the assembled flow tree. to_h for JSON, to_markdown for human-readable output.
Full FlowAssembler design (entry point resolution, recursive expansion, cycle detection, edge cases, output format, testing strategy) is in docs/FLOW_EXTRACTION.md.
Three new type values for self-analysis, distinct from Rails types:
:ruby_class — a class definition:ruby_module — a module definition:ruby_method — a method definition (linked to parent class/module)These coexist with existing types (:model, :controller, etc.) in the same DependencyGraph.
tmp/woods_self/
├── manifest.json
├── dependency_graph.json
├── graph_analysis.json
├── ruby_classes/
│ ├── Woods__Extractor.json
│ ├── Woods__ExtractedUnit.json
│ └── _index.json
├── ruby_modules/
│ └── _index.json
└── ruby_methods/
├── Woods__Extractor__extract_all.json
└── _index.json
Each unit JSON includes standard ExtractedUnit fields plus:
call_graph — methods this method callscalled_by — methods that call this method (reverse pass)data_transformations — type/shape changes at boundariestrace_data — runtime call counts, argument types, hot path flag (when available)docs/self-analysis/
├── DATAFLOW.md # Data transformation pipeline (Mermaid flowchart)
├── CALL_GRAPH.md # Class-level call relationships (Mermaid graph)
├── DEPENDENCY_MAP.md # Class dependency graph (Mermaid graph)
└── ARCHITECTURE.md # Combined summary with embedded diagrams + stats
Mermaid files are generated from JSON — a view, not a source of truth.
On-demand via rake task:
bundle exec rake woods:flow[CheckoutsController#create] # Markdown
FORMAT=json bundle exec rake woods:flow[CheckoutsController#create] # JSON
Output format documented in docs/FLOW_EXTRACTION.md.
All self-analysis output committed to the repo with .gitattributes:
tmp/woods_self/** linguist-generated=true
docs/self-analysis/** linguist-generated=true
Collapses generated files in GitHub PRs while keeping them accessible.
scripts/regenerate-self-analysis.sh:
#!/bin/bash
changed_files=$(git diff --cached --name-only -- 'lib/')
if [ -n "$changed_files" ]; then
bundle exec rake woods:self_analyze
git add tmp/woods_self/ docs/self-analysis/
fi
The manifest includes a source_checksum (SHA256 of all lib/**/*.rb contents concatenated, sorted by path). If the checksum matches, the hook is a no-op.
bundle exec rake woods:self_analyze # Static analysis + Mermaid generation
bundle exec rake woods:self_trace # Run specs with TracePoint, write tmp/trace_data.json
bundle exec rake woods:flow[entry] # Generate execution flow trace (requires Rails boot)
The shared AST layer resolves two open optimization backlog items as a side effect:
| Backlog Item | Current State | Resolution |
|---|---|---|
| #12 — Fragile method boundary detection | ~240 lines of nesting_delta + neutralize_strings_and_comments duplicated in controller + mailer extractors | Replaced by Ast::MethodExtractor |
| #13 — Fragile scope extraction regex | ~90 lines of regex + dual-depth tracking in model extractor | Replaced by Ast::Parser block boundary detection |
Additionally, neutralize_strings_and_comments() is duplicated in 3+ extractors. The AST parser handles this natively, eliminating all copies.
This design is structured for parallel agent execution. The shared AST layer is the critical path — once it's built and tested, RubyAnalyzer and FlowAssembler can proceed independently.
┌─────────────────┐
│ AST Layer (L0) │
│ parser, node, │
│ method_ext, │
│ call_site_ext, │
│ const_resolver │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────────┐
│ RubyAnalyzer│ │FlowAssembler│ │Extractor Backlog│
│ (L1a) │ │ (L1b) │ │ #12, #13 (L1c)│
└─────┬──────┘ └─────┬──────┘ └────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Self-Analysis │ │ Rake Tasks │
│ Output (L2a) │ │ + Docs (L2b)│
└──────┬───────┘ └──────────────┘
│
▼
┌──────────────┐
│ Automation │
│ Hook (L3) │
└──────────────┘
| Agent | Specialty | Scope | Blocked By |
|---|---|---|---|
| ast-foundation | AST parsing, Prism API, normalized node model | lib/woods/ast/ + specs | Nothing (start immediately) |
| ruby-analyzer | Class/method/dataflow analysis, ExtractedUnit production | lib/woods/ruby_analyzer/ + specs | ast-foundation |
| flow-assembler | Execution flow tracing, graph traversal, FlowDocument | lib/woods/flow_assembler.rb, flow_analysis/, flow_document.rb + specs | ast-foundation |
| output-and-automation | JSON/Mermaid output, rake tasks, pre-commit hook, .gitattributes | lib/tasks/, scripts/, docs/self-analysis/ | ruby-analyzer (for self-analysis output), flow-assembler (for flow rake task) |
| backlog-cleanup | Replace regex heuristics in existing extractors with AST layer | Controller, mailer, model extractors | ast-foundation |
[
{
"id": "L0-1",
"title": "Implement Ast::Node normalized struct",
"file": "lib/woods/ast/node.rb",
"spec": "spec/ast/node_spec.rb",
"acceptance": "Struct with type, children, line, receiver, method_name, arguments. Supports keyword_init."
},
{
"id": "L0-2",
"title": "Implement Ast::Parser with Prism adapter",
"file": "lib/woods/ast/parser.rb",
"spec": "spec/ast/parser_spec.rb",
"acceptance": "Parses Ruby source into Ast::Node tree. Auto-detects Prism availability. Falls back to parser gem if Prism unavailable. Tests verify identical output for both parsers on known snippets."
},
{
"id": "L0-3",
"title": "Implement Ast::MethodExtractor",
"file": "lib/woods/ast/method_extractor.rb",
"spec": "spec/ast/method_extractor_spec.rb",
"acceptance": "Extracts method body AST by name from source. Handles: def/end, multi-line signatures, rescue/ensure blocks, class methods (def self.foo). Tests include edge cases that break current indentation heuristics."
},
{
"id": "L0-4",
"title": "Implement Ast::CallSiteExtractor",
"file": "lib/woods/ast/call_site_extractor.rb",
"spec": "spec/ast/call_site_extractor_spec.rb",
"acceptance": "Extracts call sites from any AST node. Returns [{receiver:, method_name:, arguments:, line:}]. Handles: method calls, chained calls, block-passing calls. Tests verify source-order preservation."
},
{
"id": "L0-5",
"title": "Implement Ast::ConstantResolver",
"file": "lib/woods/ast/constant_resolver.rb",
"spec": "spec/ast/constant_resolver_spec.rb",
"acceptance": "Resolves constant paths from AST nodes to fully qualified names. Handles: nested modules (A::B::C), relative constants, top-level (::Foo). Takes a known_constants list for disambiguation."
}
]
[
{
"id": "L1a-1",
"title": "Implement ClassAnalyzer",
"file": "lib/woods/ruby_analyzer/class_analyzer.rb",
"spec": "spec/ruby_analyzer/class_analyzer_spec.rb",
"acceptance": "Extracts class/module definitions from Ruby files. Produces :ruby_class and :ruby_module ExtractedUnit objects with identifier, file_path, namespace, source_code, dependencies (superclass, includes)."
},
{
"id": "L1a-2",
"title": "Implement MethodAnalyzer",
"file": "lib/woods/ruby_analyzer/method_analyzer.rb",
"spec": "spec/ruby_analyzer/method_analyzer_spec.rb",
"acceptance": "Extracts method definitions for each class. Produces :ruby_method ExtractedUnit objects linked to parent class via dependencies. Includes call_graph metadata (methods called). Uses Ast::MethodExtractor + Ast::CallSiteExtractor."
},
{
"id": "L1a-3",
"title": "Implement DataFlowAnalyzer",
"file": "lib/woods/ruby_analyzer/dataflow_analyzer.rb",
"spec": "spec/ruby_analyzer/dataflow_analyzer_spec.rb",
"acceptance": "Identifies .new, .to_h, .to_json calls and annotates units with data_transformations metadata. Conservative: only explicit transformation calls, no inference."
},
{
"id": "L1a-4",
"title": "Implement TraceEnricher",
"file": "lib/woods/ruby_analyzer/trace_enricher.rb",
"spec": "spec/ruby_analyzer/trace_enricher_spec.rb",
"acceptance": "TracePoint recording filtered to woods paths. Writes trace_data.json. Merges into static analysis: hot paths, traced_callers, untested method flags."
},
{
"id": "L1a-5",
"title": "Implement RubyAnalyzer orchestrator",
"file": "lib/woods/ruby_analyzer.rb",
"spec": "spec/ruby_analyzer_spec.rb",
"acceptance": "analyze(paths:, trace_data:) coordinates ClassAnalyzer + MethodAnalyzer + DataFlowAnalyzer. Returns Array<ExtractedUnit>. Feeds into DependencyGraph and GraphAnalyzer. Run on lib/woods/ produces correct self-analysis."
}
]
[
{
"id": "L1b-1",
"title": "Implement OperationExtractor",
"file": "lib/woods/flow_analysis/operation_extractor.rb",
"spec": "spec/flow_analysis/operation_extractor_spec.rb",
"acceptance": "Uses Ast::CallSiteExtractor + domain classification. Extracts operations in source order: method calls, transaction blocks (with nesting), async enqueues, response calls, conditionals. Tests verify correct ordering and nesting."
},
{
"id": "L1b-2",
"title": "Implement ResponseCodeMapper",
"file": "lib/woods/flow_analysis/response_code_mapper.rb",
"spec": "spec/flow_analysis/response_code_mapper_spec.rb",
"acceptance": "Maps render/redirect AST nodes to HTTP status codes via Rack::Utils. Handles: status kwarg, render_<status> convention, head, redirect_to (default 302). Returns nil for unresolvable."
},
{
"id": "L1b-3",
"title": "Implement FlowDocument value object",
"file": "lib/woods/flow_document.rb",
"spec": "spec/flow_document_spec.rb",
"acceptance": "to_h produces JSON matching the format in FLOW_EXTRACTION.md. to_markdown produces the table format. Round-trip: FlowDocument.from_h(doc.to_h) == doc."
},
{
"id": "L1b-4",
"title": "Implement FlowAssembler orchestrator",
"file": "lib/woods/flow_assembler.rb",
"spec": "spec/flow_assembler_spec.rb",
"acceptance": "assemble(entry_point) walks DependencyGraph, calls OperationExtractor per unit, resolves cross-unit calls recursively. Cycle detection via visited set. Configurable max_depth (default 5). Prepends before_action filters from controller metadata."
}
]
[
{
"id": "L2-1",
"title": "Implement self_analyze rake task",
"file": "lib/tasks/woods.rake",
"acceptance": "rake woods:self_analyze runs RubyAnalyzer on lib/woods/, writes JSON to tmp/woods_self/. Includes staleness detection via source_checksum."
},
{
"id": "L2-2",
"title": "Implement self_trace rake task",
"file": "lib/tasks/woods.rake",
"acceptance": "rake woods:self_trace runs specs with TracePoint recording, writes tmp/trace_data.json."
},
{
"id": "L2-3",
"title": "Implement flow rake task",
"file": "lib/tasks/woods.rake",
"acceptance": "rake woods:flow[entry_point] generates flow document. FORMAT=json|markdown. MAX_DEPTH configurable."
},
{
"id": "L2-4",
"title": "Implement Mermaid generation from JSON",
"files": ["lib/woods/ruby_analyzer/mermaid_renderer.rb"],
"acceptance": "Reads self-analysis JSON, produces DATAFLOW.md, CALL_GRAPH.md, DEPENDENCY_MAP.md, ARCHITECTURE.md in docs/self-analysis/."
},
{
"id": "L2-5",
"title": "Implement pre-commit hook and .gitattributes",
"files": ["scripts/regenerate-self-analysis.sh", ".gitattributes"],
"acceptance": "Hook detects lib/ changes in staged files, runs self_analyze, adds output. .gitattributes marks output as linguist-generated."
}
]
[
{
"id": "L1c-1",
"title": "Replace controller method boundary detection with Ast::MethodExtractor",
"file": "lib/woods/extractors/controller_extractor.rb",
"acceptance": "Remove extract_action_source indentation heuristic (~120 lines). Use Ast::MethodExtractor. Existing controller_extractor_spec passes. Output unchanged."
},
{
"id": "L1c-2",
"title": "Replace mailer method boundary detection with Ast::MethodExtractor",
"file": "lib/woods/extractors/mailer_extractor.rb",
"acceptance": "Remove duplicated extract_action_source (~120 lines). Use Ast::MethodExtractor. Existing specs pass."
},
{
"id": "L1c-3",
"title": "Replace model scope extraction regex with AST parsing",
"file": "lib/woods/extractors/model_extractor.rb",
"acceptance": "Remove extract_scope_source regex + scope_keyword_delta (~90 lines). Use Ast::Parser for block boundary detection. Existing model_extractor_spec passes."
},
{
"id": "L1c-4",
"title": "Remove neutralize_strings_and_comments duplicates",
"files": ["All extractors with the method"],
"acceptance": "No extractor defines neutralize_strings_and_comments. AST parser handles this natively. Full spec suite passes."
}
]
.gitattributes with linguist-generatedlib/woods/Context: After v1, the AST layer is proven. The remaining extractors (service, job, GraphQL, serializer, etc.) still use regex-based source parsing.
Goal: All extractors delegate AST-level work to ast/, keeping only domain-specific reflection.
Tasks:
[
{
"id": "phase2-1",
"title": "Migrate ModelExtractor to AST layer (beyond scope extraction)",
"description": "ModelExtractor still uses regex for class definition detection and some dependency scanning. Refactor remaining regex patterns to use Ast::Parser and Ast::CallSiteExtractor.",
"files": ["lib/woods/extractors/model_extractor.rb"],
"acceptance": "No regex-based source parsing remains in ModelExtractor. Existing specs pass."
},
{
"id": "phase2-2",
"title": "Migrate ServiceExtractor to AST layer",
"description": "ServiceExtractor is the most regex-heavy extractor. Replace entry point detection, public method extraction, and initialize parameter parsing with AST equivalents.",
"files": ["lib/woods/extractors/service_extractor.rb"],
"acceptance": "ServiceExtractor uses Ast::MethodExtractor for method discovery and Ast::CallSiteExtractor for dependency detection. Existing specs pass."
},
{
"id": "phase2-3",
"title": "Migrate remaining extractors",
"description": "Apply the pattern to JobExtractor, GraphQLExtractor, SerializerExtractor, ManagerExtractor, PolicyExtractor, ValidatorExtractor, PhlexExtractor, ViewComponentExtractor.",
"files": ["lib/woods/extractors/*.rb"],
"acceptance": "All extractor outputs unchanged. Full spec suite passes."
}
]
Context: TraceEnricher records method calls during test runs. Cross-referencing with spec files produces a coverage map.
Goal: "Which specs test which source methods?" and "Which methods have no spec coverage?"
Tasks:
[
{
"id": "phase3-1",
"title": "Extend TraceEnricher to record caller source file",
"description": "Record source file of caller alongside method pairs. When caller is in spec/, creates a spec→source link.",
"acceptance": "trace_data.json includes caller_file. Spec-originating calls identifiable."
},
{
"id": "phase3-2",
"title": "Build spec coverage analyzer",
"description": "Reads trace_data.json, produces coverage report: source method → spec files, spec file → source methods.",
"acceptance": "Coverage report JSON produced. Cross-references match actual test execution."
},
{
"id": "phase3-3",
"title": "Add coverage gaps to Mermaid output",
"description": "Red nodes for untested methods in call graph and dependency map diagrams.",
"acceptance": "Mermaid diagrams visually distinguish tested vs untested methods."
}
]
Context: Existing MCP server reads extraction JSON from a configurable directory.
Goal: Expose self-analysis via dedicated MCP resource without switching index directories.
Tasks:
[
{
"id": "phase4-1",
"title": "Add codebase://self resource to MCP server",
"description": "New resource scoped to self-analysis output directory.",
"acceptance": "Resource returns self-analysis manifest and graph data. Existing resources unaffected."
},
{
"id": "phase4-2",
"title": "Add self-analysis query tools",
"description": "Scope parameter on existing tools (or self_ prefix) to query self-analysis index.",
"acceptance": "Agents can query gem's own class/method data through MCP."
}
]
Not yet planned. Key questions to answer after Phase 2: