docs/ARCHITECTURE.md
This doc explains how Woods works from the inside — how extraction, storage, retrieval, and the two MCP servers fit together.
Woods runs in three phases across two environments:
Inside Rails app (rake task):
1. Extract — 34 extractors introspect the live Rails environment
2. Resolve — dependency graph is built and enriched with git data
3. Write — one JSON file per code unit to tmp/woods/
On the host / in CI:
4. Embed — units are chunked and embedded into a vector store
5. Query — MCP server reads the JSON index and answers questions
The key insight: extraction requires a booted Rails application (ActiveRecord::Base.descendants, Rails.application.routes, etc.), but querying does not. The Index MCP server reads static JSON — no Rails, no database.
┌──────────────────────────────────────────────────────────┐
│ Rails Application │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Extract │──▶│ Resolve │──▶│ Enrich │ │
│ │ 34 types │ │ graph │ │ git │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Write JSON │ │
│ │ tmp/woods/ │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ Host / CI Environment │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │
│ │ Embed │──▶│ Index │ │ MCP Index Server │ │
│ │ OpenAI │ │ pgvector│ │ 27 tools, no Rails │ │
│ │ Ollama │ │ Qdrant │ └──────────────────────┘ │
│ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────────────┘
▲
┌────────────────────────┘
│ Console MCP Server
│ 31 tools, live Rails
│ (runs inside the app)
ExtractedUnit is the universal currency of Woods. Extractors produce them, the dependency graph connects them, the embedding pipeline consumes them, and the retrieval pipeline returns them.
Every unit carries:
identifier — unique key, usually the class name ("User", "OrdersController") or a descriptive string for non-class units ("POST /orders")type — what kind of thing this is (:model, :controller, :service, :route, etc.)file_path — relative path from Rails.root (e.g., "app/models/user.rb")source_code — the annotated source: for models this includes concerns inlined and schema prepended; for controllers this includes a route context headermetadata — type-specific structured data (associations, callbacks, actions, fields, etc.)dependencies — forward edges: [{ type:, target:, via: }]dependents — reverse edges, populated in a second pass after all units are registeredchunks — semantic sub-sections for large units (populated by SemanticChunker)estimated_tokens — approximate token count using 4.0 chars/token (benchmarked conservative floor)Units are serialized to JSON with two additional fields: extracted_at (timestamp) and source_hash (SHA-256 of source_code for change detection).
See EXTRACTOR_REFERENCE.md for the full field table and a complete example JSON.
Before any extractor runs, Rails.application.eager_load! is called once to load all application classes into memory. If eager_load! fails with a NameError (common when app/graphql/ references an uninstalled gem — Zeitwerk processes directories alphabetically, so a failure in graphql/ can prevent models/ from loading), the orchestrator falls back to per-directory loading across the 19 directories in EXTRACTION_DIRECTORIES.
# Phase 1: Extract
EXTRACTORS.each { |type, klass| @results[type] = klass.new.extract_all }
# Phase 1.5: Deduplicate
# Duplicate identifiers (e.g., engine routes duplicating app routes) are dropped
# Phase 2: Resolve dependents
# Second pass: if A.dependencies includes B, B.dependents gets a back-reference to A
# Phase 3: Graph analysis
# PageRank, orphans, dead ends, hubs, cycles, bridges
# Phase 4: Enrich with git
# batch git log for all file paths → last_modified, contributors, change_frequency
# Phase 5: Write
# One JSON file per unit + _index.json per type + dependency_graph.json + SUMMARY.md + manifest.json
Set config.concurrent_extraction = true to run extractors in parallel threads. Thread safety is ensured by:
ModelNameCache before threads start (avoids a ||= race)Mutex-protected hashextract_changed(changed_files) re-extracts only the units affected by a set of changed files. It:
dependency_graph.json_index.jsonIncremental extraction skips unit types that don't map to individual files: route, middleware, engine, scheduled_job. These require a full extraction to update.
The DependencyGraph is a directed graph where nodes are ExtractedUnit identifiers and edges are dependency relationships. It tracks:
@edges): what each unit depends on — populated when units are registered@reverse): what depends on each unit — built during registration and in the resolve phasegraph = DependencyGraph.new
graph.register(user_unit) # adds User to nodes, adds User→Order edge (from belongs_to)
graph.register(order_unit) # adds Order to nodes
graph.dependencies_of("User") # => ["Order", "UserService"]
graph.dependents_of("Order") # => ["User", "OrdersController"]
# Blast radius: what needs re-indexing if user.rb changes?
graph.affected_by(["app/models/user.rb"]) # BFS over reverse edges
DependencyGraph#pagerank computes importance scores using the reverse edge structure: units with many dependents score higher. This matches the intuition that "important" units are the ones many other units depend on — the same insight as Google's PageRank applied to code graphs.
Scores feed into the retrieval ranker as one signal in the final ranking formula.
GraphAnalyzer computes read-only structural reports from the graph:
| Metric | What it means |
|---|---|
| Orphans | Units with no dependents — potential dead code or public entry points. Framework sources are excluded (they're naturally unreferenced in the reverse index). |
| Dead ends | Units with no dependencies — self-contained leaf nodes (value objects, standalone utilities) |
| Hubs | Units with many dependents — architectural bottlenecks; changes here have high blast radius |
| Cycles | Circular dependencies — A→B→C→A. Detected via DFS. |
| Bridges | Edges whose removal would disconnect the graph — high-risk structural connections |
Analysis results are written to graph_analysis.json and surfaced in SUMMARY.md.
Retrieval is a four-stage pipeline coordinated by Retriever:
Query → [Classify] → [Execute] → [Rank] → [Assemble] → Context string
QueryClassifier)Classifies the query to determine:
:model, :controller, :service, etc. (or nil for cross-type)Classification determines which search strategy to use and whether framework source context is relevant.
SearchExecutor)Executes one or more search strategies based on classification:
| Strategy | When used | How |
|---|---|---|
| Vector | Semantic/conceptual queries | Embeds the query and finds nearest neighbors |
| Keyword | Identifier lookups by name | Exact or prefix match on identifier field |
| Graph | "What uses X?" / "What does X depend on?" | Traverses forward/reverse edges from a starting node |
| Hybrid | Default for ambiguous queries | Combines vector + keyword, re-ranked via RRF |
Ranker)Re-ranks candidates using multiple signals with weighted combination:
last_modified)Uses Reciprocal Rank Fusion (RRF) to merge ranked lists from multiple search strategies without score normalization.
ContextAssembler)Allocates token budget across layers:
Token Budget Allocation:
├── 10% Structural overview ("Codebase: 42 units — 10 models, 5 controllers, ...")
├── 50% Primary relevant units (highest-ranked candidates)
├── 25% Supporting context (direct dependencies of primary units)
└── 15% Framework reference (Rails source, when query intent = :framework)
Units that exceed the budget are truncated to their first semantic chunk. The assembled context string is then optionally post-processed by a formatter (context_format: :claude, :markdown, :plain, :json).
Woods uses three independent store abstractions:
| Store | Purpose | Available Backends |
|---|---|---|
| VectorStore | Embedding vectors for semantic search | In-memory (dev/test), pgvector (PostgreSQL), Qdrant |
| MetadataStore | Unit metadata for keyword search and type filtering | In-memory, SQLite, pgvector (JSON columns) |
| GraphStore | Dependency graph for graph-based traversal | In-memory, JSON file (via dependency_graph.json) |
The gem is backend-agnostic by design. MySQL and PostgreSQL have different JSON querying, indexing, and CTE syntax — no backend-specific SQL is written into the core.
# Local development (SQLite + in-memory vector)
Woods.configure_with_preset(:local)
# PostgreSQL with pgvector
Woods.configure_with_preset(:postgresql)
# Production (Qdrant for vectors, pgvector for metadata)
Woods.configure_with_preset(:production)
Or wire backends manually:
Woods.configure do |config|
config.vector_store = :qdrant
config.vector_store_options = { url: "http://localhost:6333", collection: "woods" }
config.metadata_store = :sqlite
config.embedding_provider = :openai
config.embedding_model = "text-embedding-3-small"
end
The two servers have fundamentally different runtime requirements:
woods-mcp)27 tools, 2 resources, 2 templates. Reads pre-extracted JSON. No Rails boot required.
Starts with a path to the extraction output directory and reads from it:
woods-mcp-start /path/to/rails-app/tmp/woods
Use the Index Server for:
The Index Server is safe to run anywhere — it has no database connection and makes no writes to the Rails application.
woods-console-mcp)31 tools, 4 tiers. Bridges to a live Rails process. Runs inside the app.
Starts via rake task inside the Rails app (or docker compose exec):
bundle exec rake woods:console
Use the Console Server for:
User.where(...) with schema awareness)All Console Server queries run inside a rolled-back transaction (SafeContext). SQL is validated by SqlValidator (rejects DML/DDL at the string level) before any database interaction. Writes are silently discarded by the rollback — this is intentional defense-in-depth. Tier 4 tools that need to actually write require explicit human-in-the-loop confirmation.
| Task | Server |
|---|---|
| Find the User model source | Index |
| What jobs does CheckoutService enqueue? | Index |
| How many pending orders are in the database? | Console |
| What does our middleware stack look like? | Index |
| Run a query against the live database | Console |
| Trigger a re-extraction | Index |
| Check Sidekiq queue depth | Console |
Large units are split into semantic chunks before embedding. The SemanticChunker is type-aware — it doesn't split on arbitrary token counts.
Models are split into purpose-specific sections:
summary — class declaration, table info, concerns list
associations — all has_many, belongs_to, has_one, HABTM
callbacks — all before/after/around hooks with side-effects
validations — all validates and validate calls
scopes — named scopes
methods — remaining public and private methods
Each chunk includes a header with the unit's identifier, type, and file path so it's self-contained when retrieved without the parent.
Controllers chunk per-action:
summary — class declaration, before_action filters, layout
<action> — each public action method with its applicable filters and route context
This matches how queries actually come in: "how does the create action work?" retrieves only the create chunk and the filter context, not the entire controller.
Units below 200 estimated tokens stay as a single :whole chunk. Above that, the semantic chunker applies type-specific splitting. Units that are still large after splitting use the fallback build_default_chunks method (line-based splitting with a 1500-token limit per chunk).
Token-count splits break semantic units arbitrarily — an associations section split mid-way loses context. Semantic splits align with how the code is actually understood: "tell me about the associations" maps to the associations chunk, not to arbitrary line ranges 150–300.