Woods Optimization & Best Practices Review

Status as of Feb 2026: 39 of 43 items resolved. Remaining: #6 (fixture specs — ongoing), #18 (intentionally not memoized), #21 (tiktoken runtime — low priority), #40 (Amplitude — deferred).

Context

Woods is a runtime-aware Rails codebase extraction system (~2,700 lines across 7 extractors). The extraction layer is complete and well-designed. This review identifies 29 items across performance, security, correctness, coverage, and best practices — prioritized by impact. Batches 1-4 fully resolved (items #1-5, #7-11, #15-17) in commit cab9061. Items #12-13 resolved via shared AST layer (Prism-based Ast::MethodExtractor and Ast::Parser) in commit 30b6563. Item #6 is partially resolved (86 gem specs + 87 integration specs; extractor-level fixture specs still needed).

Critical: Performance

1. ✅ Git Data Extraction — N+1 Shell Commands — RESOLVED

Files: lib/woods/extractor.rb:191-247 Resolution: Replaced per-file subprocess spawns with batch_git_data — two git commands total (git log --all --name-only + parsing). Commit cab9061.

~~Currently spawns 6-7 shell processes per unit file (git log, git rev-list, git shortlog). For a codebase with 200 units, that's ~1,400 subprocess spawns — easily the biggest bottleneck.~~

2. ✅ Repeated File Reads Within Each Extractor — RESOLVED

Files: All 7 extractors Resolution: Each extractor now reads source once and passes the string through all methods. Commit cab9061.

~~Each extractor reads the same file 3-5 times during a single extraction.~~

3. ✅ O(n^2) Model Name Scanning in Dependency Extraction — RESOLVED

Files: All extractors Resolution: ModelNameCache precomputes model names and builds a single compiled regex shared across all extractors. Commit cab9061.

~~Every extractor iterates all ActiveRecord::Base.descendants for every unit to find model name references.~~

4. ✅ O(n) Linear `find_unit` in Dependency Resolution — RESOLVED

File: lib/woods/extractor.rb Resolution: resolve_dependents now builds a { identifier => unit } hash via index_by before the loop. Commit cab9061.

~~resolve_dependents calls find_unit (linear scan) for every dependency of every unit.~~

Critical: Security

5. ✅ Shell Injection in Git Commands — RESOLVED

File: lib/woods/extractor.rb:195-197, 214, 224, 235-238 Resolution: Backtick git commands replaced with Open3.capture2 argument arrays. No shell interpretation, no injection risk.

~~File paths were string-interpolated into backtick shell commands. A file path containing "$(rm -rf /) or backticks would execute arbitrary commands.~~

Critical: Missing Fundamentals

6. 🔶 Test Suite — PARTIALLY RESOLVED

Status: 86 unit specs in the gem (spec/) + 87 integration specs in the host Rails app (spec/integration/). Unit-level coverage for core value objects, graph analysis, ModelNameCache, and json_serialize. Integration coverage for full extraction pipeline, incremental extraction, :via assertions, _index.json regeneration, git metadata structure, and pretty_json config.

Remaining: Extractor-level specs against fixture Rails apps are still needed. Priority areas:

Individual extractors with fixture classes (requires a booted Rails environment)
Edge cases: empty files, namespaced classes, STI, concern inlining

High: Correctness Bugs

7. ✅ DependencyGraph Key Mismatch After JSON Round-Trip — RESOLVED

File: lib/woods/dependency_graph.rb Resolution: from_h now uses symbolize_node and transform_keys to ensure symbol keys after JSON deserialization. units_of_type(:model) works correctly after round-trip.

~~@type_index used symbol keys (:model) during extraction, but from_h loaded string keys from JSON.~~

8. ✅ Incremental Extraction Doesn't Update Index Files — RESOLVED

File: lib/woods/extractor.rb Resolution: extract_changed now tracks affected_types and calls regenerate_type_index for each. Commit cab9061.

~~extract_changed re-writes individual unit JSON files and the dependency graph, but skips _index.json files.~~

9. ✅ Inconsistent `:via` Key in Dependencies — RESOLVED

Resolution: All extractors now include :via key (:association, :code_reference) consistently. Commit cab9061.

~~Model extractor includes :via, but controller, service, job, and mailer extractors omit it.~~

High: Best Practices

10. ✅ Bare `rescue` Blocks — RESOLVED

Files: All extractors Resolution: All bare rescue blocks changed to rescue StandardError. Critical exceptions (SystemExit, SignalException, NoMemoryError) now propagate correctly.

~~17+ instances of bare rescue across all extractors caught Exception, masking critical failures.~~

11. ✅ Repeated `eager_load!` Calls — RESOLVED

Files: lib/woods/extractor.rb (orchestrator), all extractors Resolution: Rails.application.eager_load! consolidated to the orchestrator. No longer called redundantly by each individual extractor.

~~Called 5 times when the orchestrator ran all extractors sequentially.~~

Medium: Code Quality

12. ✅ Fragile Method Boundary Detection — RESOLVED

Files: controller_extractor.rb, mailer_extractor.rb Resolution: Replaced extract_action_source indentation heuristics (nesting_delta, neutralize_strings_and_comments, detect_heredoc_start) with Ast::MethodExtractor#extract_method_source — Prism-based AST parsing with exact line spans. Deleted ~190 lines of heuristic code across both files. Commit 30b6563.

~~Uses indentation heuristics to find method end. Fails for multi-line signatures, rescue/ensure blocks, heredocs containing end.~~

13. ✅ Fragile Scope Extraction Regex — RESOLVED

File: model_extractor.rb Resolution: Replaced extract_scope_source regex with Ast::Parser-based scope extraction. Parses full source, finds :send nodes with method_name == 'scope', uses line/end_line spans for boundaries. Regex fallback retained for parse failures. Deleted scope_keyword_delta and neutralize_strings_and_comments. Commit 30b6563.

~~Regex breaks on multi-line lambda bodies, nested blocks, scopes with comments inside, and Proc.new syntax.~~

14. ✅ Concern Detection Heuristic — RESOLVED

File: model_extractor.rb:176-197 Resolution: Improved concern detection to check module source location first (cheaper), with method-level checks as fallback. Filters out third-party gem concerns more reliably.

~~mod.name.include?("Concerns") matches any module with "Concerns" in its name, including third-party gems. defined_in_app? iterates all instance methods checking source locations (expensive).~~

15. ✅ Redundant `extract_public_api`/`extract_dsl_methods` Calls — RESOLVED

File: lib/woods/extractors/rails_source_extractor.rb Resolution: rate_importance now receives pre-computed metadata instead of re-extracting. Commit cab9061.

~~rate_importance calls extract_public_api(source) and extract_dsl_methods(source) even though the same data was just computed.~~

16. ✅ `JSON.pretty_generate` for All Output — RESOLVED

File: lib/woods/extractor.rb Resolution: Added config.pretty_json (defaults to true for backward compat). json_serialize dispatches to pretty_generate or generate based on config. Commit cab9061.

~~Pretty-printed JSON adds ~30-40% size overhead from whitespace.~~

Low: Minor Improvements

17. ✅ Cache `git_available?` Result — RESOLVED

File: lib/woods/extractor.rb Resolution: Memoized with defined?(@git_available) guard. Commit cab9061.

~~Spawns a subprocess every time it's called.~~

18. ⚠️ Memoize `estimated_tokens` — INTENTIONALLY NOT MEMOIZED

File: lib/woods/extracted_unit.rb Status: source_code and metadata are mutable after construction (extractors modify units during the dependency resolution pass). Memoizing estimated_tokens would return stale values. The per-call cost is negligible (one division + ceil).

~~Recalculates on every call.~~

19. ✅ Use Set for Job Deduplication — RESOLVED

File: lib/woods/extractors/job_extractor.rb Resolution: Replaced O(n) units.any? with a Set of seen identifiers for O(1) lookup.

~~units.any? { |u| u.identifier == job_class.name } is O(n) per check.~~

20. ✅ Configuration Validation — RESOLVED

File: lib/woods.rb:35-58 Resolution: Added validate! method with checks for positive integers, valid ranges, and writable paths. Called before extraction runs.

~~No validation on max_context_tokens, similarity_threshold, output_dir, etc.~~

21. Token Estimation Accuracy

File: lib/woods/extracted_unit.rb:66-69

(length / 4.0).ceil is a rough heuristic. Ruby code tokenizes differently than natural language.

Fix: Consider tiktoken_ruby gem for accurate token counting, with the 4-char heuristic as fallback.

22. ✅ Concurrent Extraction — RESOLVED

File: lib/woods/extractor.rb Resolution: Added extract_all_concurrent with concurrent_extraction config flag (default: false). Sequential extraction remains the default for safety.

~~Extractors run sequentially but are independent.~~

23. ✅ Missing Mailer/Job Types in `re_extract_unit` — RESOLVED

File: lib/woods/extractor.rb Resolution: re_extract_unit now uses TYPE_TO_EXTRACTOR_KEY mapping and handles all types including :job, :mailer, and GraphQL types. Commit cab9061.

~~The case statement for re-extraction only handles :model, :controller, :service, :component.~~

New: Extraction Coverage Gaps

24. ✅ No Serializer/Decorator Extractor — RESOLVED

Resolution: Added SerializerExtractor covering ActiveModelSerializers, Blueprinter, Alba, and Draper. Auto-detects loaded gems and extracts accordingly. Includes dependency tracking to underlying models.

~~No extractor exists for serializer or decorator patterns.~~

25. ✅ No ViewComponent Extractor — RESOLVED

Resolution: Added ViewComponentExtractor for ViewComponent::Base descendants. Extracts component slots, template paths, preview classes, and collection support. Registered alongside Phlex extractor.

~~Only Phlex view components are extracted.~~

New: Documentation & Design Drift

26. ✅ Voyage Code 2 → Code 3 in Doc Examples — RESOLVED

Resolution: Updated all docs to lead with Voyage Code 3 (1024 dims, 32K context). Code 2 retained as legacy option where referenced. Cost figures updated across BACKEND_MATRIX.md, RETRIEVAL_ARCHITECTURE.md, and CONTEXT_AND_CHUNKING.md.

~~All embedding model references in docs still reference Voyage Code 2.~~

27. ✅ Scale Assumptions Outdated Throughout Docs — RESOLVED

Resolution: Updated prose references from "300" to "993"/"~1,000" across BACKEND_MATRIX.md and other docs. Cost projections recalculated for 1000-unit baseline. Tabular data retained at varying sizes (50-1000) for comparison.

~~Docs reference "300+ models" as the scale target.~~

New: Retrieval Pipeline Gaps

28. ✅ RRF Should Replace Ad-Hoc Score Fusion — RESOLVED

Resolution: Replaced merge_candidates in RETRIEVAL_ARCHITECTURE.md with Reciprocal Rank Fusion (RRF) implementation: score(d) = Σ 1/(k + rank_i(d)) with k=60. Eliminates need for cross-backend score normalization.

~~HybridSearch uses ad-hoc weighted score fusion.~~

29. ✅ Cross-Encoder Reranking Missing from Ranking Pipeline — RESOLVED

Resolution: Added cross-encoder reranking section to RETRIEVAL_ARCHITECTURE.md as an optional stage between initial ranking and context assembly. Defined Reranker::Interface, documented Cohere Rerank and Voyage Reranker as candidates, with configuration for enabling/disabling.

~~The retrieval pipeline has no reranking stage.~~

Recommended Implementation Order

Batch 1 — High-impact, low-risk: ✅ ALL RESOLVED

~~Fix bare rescue blocks (#10)~~ ✅
~~Fix find_unit O(n) scan (#4)~~ ✅ cab9061
~~Fix DependencyGraph key mismatch (#7)~~ ✅
Fix missing types in re_extract_unit (#23) ✅ cab9061
~~Fix incremental index file updates (#8)~~ ✅ cab9061

Batch 2 — Performance wins: ✅ ALL RESOLVED 6. ~~Eliminate repeated file reads (#2)~~ ✅ cab9061 7. ~~Precompute model names for dependency scanning (#3)~~ ✅ cab9061 8. ~~Move eager_load! to orchestrator (#11)~~ ✅ 9. ~~Cache git_available? (#17)~~ ✅ cab9061

Batch 3 — Security + Git performance: ✅ ALL RESOLVED 10. ~~Fix shell injection in git commands (#5)~~ ✅ 11. ~~Batch git data extraction (#1)~~ ✅ cab9061

Batch 4 — Code quality: ✅ ALL RESOLVED 12. ~~Add consistent :via key (#9)~~ ✅ cab9061 13. ~~Reduce JSON.pretty_generate overhead (#16)~~ ✅ cab9061 14. ~~Fix redundant analysis calls (#15)~~ ✅ cab9061

Batch 5 — Extraction coverage: ✅ ALL RESOLVED 15. ~~Add serializer/decorator extractor (#24)~~ ✅ 16. ~~Add ViewComponent extractor (#25)~~ ✅

Batch 6 — Retrieval pipeline design: ✅ ALL RESOLVED 17. ~~Replace ad-hoc score fusion with RRF (#28)~~ ✅ 18. ~~Add cross-encoder reranking stage (#29)~~ ✅

Batch 7 — Documentation & code quality: ✅ ALL RESOLVED 19. ~~Update Voyage Code 2 → Code 3 references (#26)~~ ✅ 20. ~~Update scale assumptions to 993-model baseline (#27)~~ ✅ 21. ~~Improve concern detection (#14)~~ ✅ 22. ~~Add configuration validation (#20)~~ ✅

Deferred:

Test suite (#6) — 2506 gem specs; extractor-level fixture specs still growing
Token estimation (#21) — tiktoken_ruby optional benchmark; 4.0 divisor is acceptable

New: MCP Index Server

Items identified from the initial MCP server implementation (commits baa5b85..6e4de8f) and real-world testing against a production Rails app.

30. ✅ MCP Index Server — Semantic Search Tool — RESOLVED

Resolution: codebase_retrieve tool implemented with full retrieval pipeline (query classification, hybrid search, RRF ranking, context assembly). Degrades gracefully when embedding provider is not configured.

~~The index server currently has keyword regex search only.~~

31. ✅ MCP Index Server — Framework Source Tool — RESOLVED

Resolution: framework tool implemented. Searches rails_source type units by concept keyword across identifier, source_code, and metadata fields.

~~The codebase_framework tool from AGENTIC_STRATEGY.md is not yet implemented.~~

32. ✅ MCP Index Server — Recent Changes Tool — RESOLVED

Resolution: recent_changes tool implemented. Sorts units by metadata.git.last_modified, supports type filtering and limit.

~~The codebase_recent_changes tool from AGENTIC_STRATEGY.md is not implemented.~~

33. ✅ MCP Index Server — HTTP Transport — RESOLVED

Resolution: exe/woods-mcp-http executable provides HTTP/Rack transport via Rackup. Supports network-accessible retrieval.

~~The server only supports stdio transport.~~

34. ✅ MCP Index Server — Resource Templates for Unit Lookup — RESOLVED

Resolution: codebase://unit/{identifier} and codebase://type/{type} resource templates implemented alongside 2 static resources (manifest, graph).

~~Only two static resources exist.~~

New: Console MCP Server

Implementation items from the CONSOLE_SERVER.md design document, organized by phase.

35. ✅ Console Server — Phase 0: Bridge Protocol — RESOLVED

Resolution: Bridge script (lib/woods/console/bridge.rb) implemented with JSON-lines protocol, model/column validation against ActiveRecord::Base.descendants, and connection manager with Docker exec, direct, and SSH modes.

~~Build the JSON-lines bridge script.~~

36. ✅ Console Server — Phase 1: MVP Tools — RESOLVED

Resolution: exe/woods-console-mcp executable with 9 Tier 1 tools. Safety layers 1-4 implemented (read-only connection via SafeContext transaction rollback, statement timeout, structured validation). Column redaction and result size caps active.

~~Implement Tier 1 tools.~~

37. ✅ Console Server — Phase 2: Domain-Aware Tools + Controlled Writes — RESOLVED

Resolution: 9 Tier 2 tools implemented with registered write actions, class discovery from conventional directories, and preset configurations.

~~Implement Tier 2 tools.~~

38. ✅ Console Server — Phase 3: Job Queue, Cache, and Analytics Tools — RESOLVED

Resolution: 10 Tier 3 tools implemented. Job adapters for Sidekiq (Redis API), Solid Queue (DB tables), GoodJob (DB tables). Cache adapters for Redis, Solid Cache, memory/file stores.

~~Implement Tier 3 tools.~~

39. ✅ Console Server — Phase 4: Guarded Eval + Advanced Queries — RESOLVED

Resolution: 3 Tier 4 tools implemented. SqlValidator rejects DML/DDL at the string level. Human confirmation flow and audit logging active.

~~Implement Tier 4 tools.~~

40. Console Server — Amplitude Analytics Integration — DEFERRED

Requested: add Amplitude as an analytics provider for Tier 3 tools. Amplitude's event and cohort data maps to throughput and data_snapshot tool patterns. Requires a provider adapter interface and Amplitude API client.

Depends on: #38, Amplitude API key and event schema from client app.

41. ✅ Extraction — Manager/Delegator Extractor — RESOLVED

Resolution: ManagerExtractor implemented. Scans app/managers/ for SimpleDelegator subclasses. Captures wrapped model, public methods, and delegation chain. Registered in default extractors config.

~~The host app uses SimpleDelegator subclasses in app/managers/.~~

42. ✅ Extraction — Policy Class Extractor — RESOLVED

Resolution: PolicyExtractor implemented. Scans app/policies/ for domain policy classes. Captures policy names, evaluated models, and decision methods. Registered in default extractors config.

~~Domain policy classes in app/policies/ not covered.~~

43. ✅ Extraction — Standalone Validator Extractor — RESOLVED

Resolution: ValidatorExtractor implemented. Scans app/validators/ for custom validator classes. Captures validator names, operated models, and validation rules. Registered in default extractors config.

~~Custom validator classes in app/validators/ not covered.~~

Recommended Implementation Order (New Items)

Batch 8 — MCP index server gaps: ✅ ALL RESOLVED (#30-34)

Batch 9 — Console server foundation: ✅ ALL RESOLVED (#35-36)

Batch 10 — Console server domain tools: ✅ ALL RESOLVED (#37-38)

Batch 11 — Extraction coverage for domain classes: ✅ ALL RESOLVED (#41-43)

Batch 12 — Advanced console + eval: ✅ RESOLVED (#39), #40 deferred (Amplitude)

Verification

After each batch:

Run rake woods:extract on a real Rails app
Run rake woods:validate to verify output integrity
Compare output JSON files before/after (should be identical except for timing fields)
Run rake woods:incremental with a known changed file
Verify _index.json and SUMMARY.md are consistent with unit files

Woods Optimization & Best Practices Review

Woods Optimization & Best Practices Review

Context

Critical: Performance

1. ✅ Git Data Extraction — N+1 Shell Commands — RESOLVED

2. ✅ Repeated File Reads Within Each Extractor — RESOLVED

3. ✅ O(n^2) Model Name Scanning in Dependency Extraction — RESOLVED

4. ✅ O(n) Linear find_unit in Dependency Resolution — RESOLVED

Critical: Security

5. ✅ Shell Injection in Git Commands — RESOLVED

Critical: Missing Fundamentals

6. 🔶 Test Suite — PARTIALLY RESOLVED

High: Correctness Bugs

7. ✅ DependencyGraph Key Mismatch After JSON Round-Trip — RESOLVED

8. ✅ Incremental Extraction Doesn't Update Index Files — RESOLVED

9. ✅ Inconsistent :via Key in Dependencies — RESOLVED

High: Best Practices

10. ✅ Bare rescue Blocks — RESOLVED

11. ✅ Repeated eager_load! Calls — RESOLVED

Medium: Code Quality

12. ✅ Fragile Method Boundary Detection — RESOLVED

13. ✅ Fragile Scope Extraction Regex — RESOLVED

14. ✅ Concern Detection Heuristic — RESOLVED

15. ✅ Redundant extract_public_api/extract_dsl_methods Calls — RESOLVED

16. ✅ JSON.pretty_generate for All Output — RESOLVED

Low: Minor Improvements

17. ✅ Cache git_available? Result — RESOLVED

18. ⚠️ Memoize estimated_tokens — INTENTIONALLY NOT MEMOIZED

19. ✅ Use Set for Job Deduplication — RESOLVED

20. ✅ Configuration Validation — RESOLVED

21. Token Estimation Accuracy

22. ✅ Concurrent Extraction — RESOLVED

23. ✅ Missing Mailer/Job Types in re_extract_unit — RESOLVED

New: Extraction Coverage Gaps

24. ✅ No Serializer/Decorator Extractor — RESOLVED

25. ✅ No ViewComponent Extractor — RESOLVED

New: Documentation & Design Drift

26. ✅ Voyage Code 2 → Code 3 in Doc Examples — RESOLVED

27. ✅ Scale Assumptions Outdated Throughout Docs — RESOLVED

New: Retrieval Pipeline Gaps

28. ✅ RRF Should Replace Ad-Hoc Score Fusion — RESOLVED

29. ✅ Cross-Encoder Reranking Missing from Ranking Pipeline — RESOLVED

Recommended Implementation Order

New: MCP Index Server

30. ✅ MCP Index Server — Semantic Search Tool — RESOLVED

31. ✅ MCP Index Server — Framework Source Tool — RESOLVED

32. ✅ MCP Index Server — Recent Changes Tool — RESOLVED

33. ✅ MCP Index Server — HTTP Transport — RESOLVED

34. ✅ MCP Index Server — Resource Templates for Unit Lookup — RESOLVED

New: Console MCP Server

35. ✅ Console Server — Phase 0: Bridge Protocol — RESOLVED

36. ✅ Console Server — Phase 1: MVP Tools — RESOLVED

37. ✅ Console Server — Phase 2: Domain-Aware Tools + Controlled Writes — RESOLVED

38. ✅ Console Server — Phase 3: Job Queue, Cache, and Analytics Tools — RESOLVED

39. ✅ Console Server — Phase 4: Guarded Eval + Advanced Queries — RESOLVED

40. Console Server — Amplitude Analytics Integration — DEFERRED

41. ✅ Extraction — Manager/Delegator Extractor — RESOLVED

42. ✅ Extraction — Policy Class Extractor — RESOLVED

43. ✅ Extraction — Standalone Validator Extractor — RESOLVED

Recommended Implementation Order (New Items)

Verification

4. ✅ O(n) Linear `find_unit` in Dependency Resolution — RESOLVED

9. ✅ Inconsistent `:via` Key in Dependencies — RESOLVED

10. ✅ Bare `rescue` Blocks — RESOLVED

11. ✅ Repeated `eager_load!` Calls — RESOLVED

15. ✅ Redundant `extract_public_api`/`extract_dsl_methods` Calls — RESOLVED

16. ✅ `JSON.pretty_generate` for All Output — RESOLVED

17. ✅ Cache `git_available?` Result — RESOLVED

18. ⚠️ Memoize `estimated_tokens` — INTENTIONALLY NOT MEMOIZED

23. ✅ Missing Mailer/Job Types in `re_extract_unit` — RESOLVED