docs/EXTRACTOR_REFERENCE.md
Woods ships 34 extractors — one for each meaningful category of Rails code. This doc covers what each extractor captures, how to configure them, and the shape of the data they produce.
A full extraction (bundle exec rake woods:extract) runs five phases:
Phase 1: Extract — All 34 extractors run, producing ExtractedUnit objects
Phase 1.5: Dedupe — Duplicate identifiers are dropped (engines can double-register routes)
Phase 2: Resolve — Reverse dependency edges are built (A depends on B → B gets a dependent)
Phase 3: Graph — PageRank + structural analysis (orphans, hubs, cycles, bridges)
Phase 4: Enrich — Git metadata added (last author, change frequency, recent commits)
Phase 5: Write — One JSON file per unit, _index.json per type, dependency_graph.json, SUMMARY.md
Extractors discover code one of two ways:
| Strategy | How it works | Examples |
|---|---|---|
| Class-based | ActiveRecord::Base.descendants, ApplicationController.descendants, etc. — requires eager_load! | ModelExtractor, ControllerExtractor, MailerExtractor |
| File-based | Scans conventional directories (app/services, db/migrate, etc.) — more robust for non-AR classes | ServiceExtractor, MigrationExtractor, ViewTemplateExtractor |
Some extractors combine both (e.g., JobExtractor scans directories first, then supplements with ApplicationJob.descendants).
The orchestrator calls Rails.application.eager_load! once before extraction begins. If that fails with a NameError (common when app/graphql/ references an uninstalled gem), it falls back to per-directory loading via EXTRACTION_DIRECTORIES. This fallback covers the directories that matter for extraction.
Every extractor returns Array<ExtractedUnit>. An ExtractedUnit is a self-contained snapshot of one code unit with source, metadata, and relationships. See ExtractedUnit Field Reference at the bottom of this doc.
What it captures: Every non-abstract ActiveRecord::Base descendant with concrete table-backed state. The source_code is the model's actual Ruby source plus all included concerns inlined below it as formatted comment blocks. Schema information (columns, types, indexes, foreign keys) is prepended as a header comment.
Key details:
ActiveRecord::Base.descendants for discovery (runtime introspection, not static parsing)include FooConcern references are resolved and the concern source is appended to source_code. Inlined concern names are recorded in metadata[:inlined_concerns]before_validation, after_validation, before_save, after_save, around_save, before_create, after_create, around_create, before_update, after_update, around_update, before_destroy, after_destroy, around_destroy, after_commit, after_rollback, after_initialize, after_find, after_touchCallbackAnalyzer: detects columns written (self.col =), jobs enqueued (perform_later), and services called:summary, :associations, :callbacks, :validations, :scopes, :methodsinstance_methods(false) captures every method Rails generates dynamically — enum predicates (status_active?, status_pending?), association builders (build_profile, create_line_item!), attribute accessors, and dynamically registered scopes. Static analysis tools cannot see these methods because they only exist after Rails processes the DSL declarations at boot timeEdge cases:
callback.options was removed in Rails 4.2 — the extractor uses @if/@unless ivars and ActionFilter duck-typing to extract :only/:except action listsautosave_associated_records_for_comments) are filtered by a single combined regex to avoid noiseExample output (abbreviated):
{
"type": "model",
"identifier": "Order",
"file_path": "app/models/order.rb",
"namespace": null,
"source_code": "# == Schema Information\n# id :bigint\n# user_id :bigint\n# status :string\n# total_cents :integer\n#\nclass Order < ApplicationRecord\n belongs_to :user\n has_many :line_items\n ...\nend\n\n# ┌───────────────────────────────────────────────────────────────────┐\n# │ Included from: Auditable │\n# └───────────────────────────────────────────────────────────────────┘\n# module Auditable\n# ...\n# end\n# ──────────────────────── End Auditable ────────────────────────────",
"metadata": {
"associations": [
{ "type": "belongs_to", "name": "user", "target": "User" },
{ "type": "has_many", "name": "line_items", "target": "LineItem" }
],
"callbacks": [
{ "type": "before_save", "filter": "calculate_total", "kind": "before", "conditions": {},
"side_effects": { "columns_written": ["total_cents"], "jobs_enqueued": [], "services_called": [], "mailers_triggered": [], "database_reads": [], "operations": [] } },
{ "type": "after_commit", "filter": "send_confirmation_email", "kind": "after", "conditions": {},
"side_effects": { "columns_written": [], "jobs_enqueued": ["OrderConfirmationJob"], "services_called": [], "mailers_triggered": ["OrderMailer"], "database_reads": [], "operations": [] } }
],
"validations": [
{ "attribute": "status", "type": "inclusion", "options": { "in": ["pending", "paid", "shipped"] }, "conditions": {} }
],
"inlined_concerns": ["Auditable"]
},
"dependencies": [
{ "type": "model", "target": "User", "via": "belongs_to" },
{ "type": "model", "target": "LineItem", "via": "has_many" }
]
}
What it captures: Every ApplicationController and ActionController::API descendant. Route context is prepended to the source — each controller gets a header block showing which HTTP verb + path maps to each action. Before/after filter chains are resolved per action.
Key details:
ApplicationController.descendants (and ActionController::API.descendants if present)Rails.application.routes at initialization timesource_code as a comment header, not just in metadata:action chunk with its applicable filters and routeEdge cases:
ActionController::API descendants) are included when the gem is presentExample output (abbreviated):
{
"type": "controller",
"identifier": "OrdersController",
"metadata": {
"actions": ["index", "show", "create", "update"],
"routes": [
{ "verb": "GET", "path": "/orders", "action": "index" },
{ "verb": "POST", "path": "/orders", "action": "create" }
],
"filters": {
"before": ["authenticate_user!", "set_order"],
"after": ["track_event"]
}
}
}
What it captures: Service objects, interactors, operations, commands, and use cases — the "business logic layer." Discovers them by scanning conventional directories for Ruby files.
Key details:
app/services, app/interactors, app/operations, app/commands, app/use_casescall, perform, execute, run), custom error classes, and dependency referencesExample output (abbreviated):
{
"type": "service",
"identifier": "CheckoutService",
"metadata": {
"entry_points": ["call"],
"custom_errors": ["CheckoutService::PaymentFailedError"],
"dependencies": ["Order", "PaymentProcessor"]
}
}
What it captures: ActiveJob workers and Sidekiq workers. Scans job directories, then supplements with ApplicationJob.descendants for anything discovered at runtime but not found via files.
Key details:
app/jobs, app/workers, app/sidekiqExample output (abbreviated):
{
"type": "job",
"identifier": "ProcessOrderJob",
"metadata": {
"queue": "default",
"retry_on": ["Stripe::APIError"],
"perform_args": ["order_id"],
"adapter": "ActiveJob"
}
}
What it captures: ActionMailer classes with their mailer actions, defaults, template paths, callbacks, and helper usage.
Key details:
ActionMailer::Base.descendants)default from:, layout, and per-action subject patternsWhat it captures: Rails initializers (config/initializers/**/*.rb) and environment files (config/environments/*.rb). Also extracts a behavioral profile from the resolved Rails.application.config values at runtime.
Key details:
BehavioralProfile introspects live config using respond_to?/defined? guards — a missing config section produces nil, not an error:behavioral_profile unit per environmentWhat it captures: Every route in the Rails routing table via Rails.application.routes.routes. Each route becomes its own ExtractedUnit.
Key details:
config/routes.rb AST"VERB /path" (e.g., "POST /orders")Example output (abbreviated):
{
"type": "route",
"identifier": "POST /orders",
"metadata": {
"controller": "orders",
"action": "create",
"route_name": "orders"
}
}
What it captures: The full Rack middleware stack as a single ordered unit. Useful for understanding request preprocessing and which middleware is active.
Key details:
What it captures: Phlex component classes (Phlex::HTML, Phlex::SVG subclasses) from app/components. Extracts slots, initialize parameters, sub-component references, Stimulus controller names, and route helper usage.
Key details:
view_template methodWhat it captures: ViewComponent classes from app/components. Extracts slots, template paths, preview class references, and collection rendering support.
Key details:
ButtonComponent → button_component.html.erb)<ComponentName>Preview is found in spec/components/previews/ or test/components/previews/Edge cases:
app/components — the orchestrator uses separate extractors for each. A Phlex component won't be extracted by ViewComponentExtractor and vice versa (the filtering is by superclass, not file name)What it captures: ERB view templates from app/views. Extracts render calls (partials and components), instance variable references, and helper method usage.
Key details:
What it captures: Decorator, presenter, and form object classes from app/decorators, app/presenters, and app/form_objects.
Key details:
EXTRACTION_DIRECTORIES for eager loadingWhat it captures: ActiveSupport::Concern modules from app/models/concerns and app/controllers/concerns.
Key details:
app/models/concerns, app/controllers/concernsClassMethods block, instance methods, and class methods added by the concernWhat it captures: Plain Ruby objects in app/models that are not ActiveRecord (non-AR classes, excluding concerns).
Key details:
app/models for files that don't define an ActiveRecord::Base descendantapp/models, domain structsWhat it captures: Serializer classes for ActiveModelSerializers, Blueprinter, Alba, and Draper. Auto-detects which serialization gems are loaded.
Key details:
defined? before attempting extractionWhat it captures: Custom ActiveModel::Validator subclasses with their validation rules.
Key details:
validate method logic and the attribute being validatedWhat it captures: SimpleDelegator subclasses that wrap a model. Records the wrapped model class, all public methods, and the delegation chain.
What it captures: graphql-ruby types, mutations, queries, and resolvers. Produces four distinct unit types from one extractor.
Key details:
app/graphql with runtime introspection via GraphQL::Schema.types when available, falls back to file discoverygraphql_type, graphql_mutation, graphql_resolver, graphql_queryauthorized?), and dependencies on models/servicesextract_graphql_fileExample output (abbreviated):
{
"type": "graphql_type",
"identifier": "Types::UserType",
"metadata": {
"fields": [
{ "name": "id", "type": "ID!", "description": null },
{ "name": "email", "type": "String!" }
],
"authorized_by": "pundit"
}
}
What it captures: Pundit policy classes with their action methods (index?, show?, create?, update?, destroy?, and custom predicates).
Key details:
UserPolicy → User)resolve method when presentWhat it captures: Domain policy classes (non-Pundit) with decision methods and eligibility rules. Covers plain Ruby objects used for authorization decisions.
Key details:
app/policies for files not identified as Pundit policiesWhat it captures: Mounted Rails engines via runtime introspection. Records mount points and route counts for each engine.
Key details:
Rails::Engine.subclasses at runtime — finds both gem-mounted and in-repo enginesWhat it captures: Locale files from config/locales with the full translation key hierarchy.
Key details:
config/locales/**/*.{yml,yaml}What it captures: ActionCable channel classes with stream subscriptions, subscribed/unsubscribed hooks, broadcast patterns, and action methods.
Key details:
ActionCable::Channel::Base.descendantssubscribed, and any broadcast_to callsWhat it captures: Scheduled job definitions from cron-style config files. Supports multiple scheduling backends.
Key details:
config/recurring.yml (Solid Queue), config/sidekiq_cron.yml (Sidekiq Cron), config/schedule.rb (Whenever)What it captures: Rake tasks from lib/tasks/*.rake. Extracts namespaces, task names, descriptions, prerequisites (:depends_on), and the task body.
Key details:
.rake files statically — no Rails boot required for parsingblock_opener? for depth tracking; if/unless only match at line start to avoid counting trailing modifiers as blocksnamespace :data do namespace :import do task :users)What it captures: ActiveRecord migration files from db/migrate. Extracts DDL metadata, affected tables, risk indicators, and reversibility.
Key details:
db/migrate/*.rbremove_column without type), execute calls with raw SQLschema_migrations, active_storage_blobs, etc.) are excluded from model dependency linksExample output (abbreviated):
{
"type": "migration",
"identifier": "AddStatusToOrders",
"metadata": {
"version": "20240115120000",
"tables_affected": ["orders"],
"operations": [
{ "type": "add_column", "table": "orders", "column": "status", "column_type": "string" }
],
"reversible": true,
"risk_level": "low"
}
}
What it captures: SQL views from db/views following the Scenic gem convention.
Key details:
_vNN suffix)What it captures: State machine DSL definitions using AASM, Statesman, or the state_machines gem.
Key details:
defined? for each DSL constantScheduledJobExtractor) — cannot be used in the incremental file-based dispatch mapWhat it captures: Event publish/subscribe patterns using ActiveSupport::Notifications or Wisper.
Key details:
publish/instrument calls, then subscribe/on calls, then merges themWhat it captures: Cache usage patterns across controllers, models, and ERB view templates.
Key details:
.erb view filescache blocks, Rails.cache.fetch, expire_fragment, TTLs, and cache keysfile_type parameter on extract_caching_file defaults to nil (auto-detected from path)What it captures: FactoryBot factory definitions including traits, associations, and lazy attribute blocks.
Key details:
spec/factories and test/factoriesWhat it captures: Test file-to-subject mappings with test counts, describe/context hierarchy, and test framework detection.
Key details:
spec/ and test/ directoriesspec/models/user_spec.rb → User)app/ so no eager loading is neededWhat it captures: Ruby files from lib/ — utility modules, standalone libraries, and infrastructure code.
Key details:
lib/tasks/ (covered by RakeTaskExtractor) and lib/generators/What it captures: High-value Rails framework source and gem source files, pinned to the exact versions in Gemfile.lock.
Key details:
Gem.loaded_specs — paths depend on the installed gem locationactiverecord (associations, callbacks, validations, relation, enum, transactions), actionpack (controller metal, callbacks, rendering, redirecting), activesupport (callbacks, concern, configurable, delegation)config.add_gem "devise", paths: [...]has_many support?" returns the actual source for the installed Rails versionAll 34 extractors run during a full extraction. The config.extractors array controls which unit types are considered by the retrieval pipeline (embedding and search scope), not which extractors run during extraction.
To customize the retrieval scope:
# config/initializers/woods.rb
Woods.configure do |config|
# Default retrieval scope (13 types)
config.extractors = %i[
models controllers services components view_components
jobs mailers graphql serializers managers policies validators
rails_source
]
# Add more types to retrieval scope
config.extractors += %i[concerns routes migrations]
# Or restrict to a focused subset
config.extractors = %i[models controllers services]
# Index additional gem source files
config.add_gem "devise", paths: ["lib/devise/models"], priority: :high
end
To add a custom gem to be indexed by RailsSourceExtractor:
config.add_gem "pundit", paths: ["lib/pundit"], priority: :medium
Every extractor produces ExtractedUnit objects with this schema:
| Field | Type | Description |
|---|---|---|
type | Symbol | Unit category: :model, :controller, :service, :job, :mailer, :component, :view_component, :graphql_type, :graphql_mutation, :graphql_resolver, :graphql_query, :serializer, :manager, :policy, :validator, :concern, :route, :middleware, :i18n, :pundit_policy, :configuration, :engine, :view_template, :migration, :action_cable_channel, :scheduled_job, :rake_task, :state_machine, :event, :decorator, :database_view, :caching, :factory, :test_mapping, :rails_source, :poro, :lib |
identifier | String | Unique key for this unit. Usually the class name (e.g., "User", "OrdersController") or a descriptive string for non-class units (e.g., "POST /orders") |
file_path | String | Relative path to the source file (e.g., "app/models/user.rb"). Relative to Rails.root after normalization. |
namespace | String|nil | Module namespace if the class is nested (e.g., "Admin" for Admin::DashboardController) |
source_code | String | The full source code, potentially enriched: models have concerns inlined and schema prepended; controllers have a route context header prepended |
metadata | Hash | Type-specific structured data — associations, callbacks, actions, fields, etc. Keys and structure vary by extractor |
dependencies | Array<Hash> | Forward edges: [{ type: :model, target: "User", via: "belongs_to" }, ...] |
dependents | Array<Hash> | Reverse edges: populated in the second pass. [{ type: :controller, identifier: "OrdersController" }, ...] |
chunks | Array<Hash> | Semantic sub-sections for large units. Each chunk: { chunk_index:, identifier:, content:, content_hash:, estimated_tokens: } |
estimated_tokens | Integer | Approximate token count for source_code + metadata.to_json using 4.0 chars/token. Computed, not stored. |
When written to disk, units also include:
| Field | Description |
|---|---|
extracted_at | ISO 8601 timestamp of extraction |
source_hash | SHA-256 of source_code for change detection |
metadata[:git])If the host app is a git repo, the following are added to metadata[:git] after extraction:
| Field | Description |
|---|---|
last_modified | ISO 8601 date of last commit touching this file |
last_author | Name of the author who last modified the file |
commit_count | Total commit count for this file (past 365 days) |
contributors | Top 5 contributors by commit count: [{ name:, commits: }] |
recent_commits | Last 5 commits: [{ sha:, message:, date:, author: }] |
change_frequency | :new, :hot, :active, :stable, or :dormant |
{
"type": "model",
"identifier": "User",
"file_path": "app/models/user.rb",
"namespace": null,
"source_code": "# == Schema Information\n# id :bigint not null, pk\n# email :string not null\n# created_at :datetime\n#\nclass User < ApplicationRecord\n has_many :orders\n validates :email, presence: true, uniqueness: true\nend\n\n# ┌───────────────────────────────────────────────────────────────────┐\n# │ Included from: Searchable │\n# └───────────────────────────────────────────────────────────────────┘\n# module Searchable\n# extend ActiveSupport::Concern\n# ...\n# end\n# ──────────────────────── End Searchable ───────────────────────────",
"metadata": {
"associations": [{ "type": "has_many", "name": "orders", "target": "Order" }],
"validations": [{ "attribute": "email", "type": "presence", "options": {}, "conditions": {} }, { "attribute": "email", "type": "uniqueness", "options": {}, "conditions": {} }],
"callbacks": [],
"scopes": [],
"inlined_concerns": ["Searchable"],
"git": {
"last_modified": "2024-11-20T14:32:00Z",
"last_author": "Alice",
"commit_count": 23,
"change_frequency": "active"
}
},
"dependencies": [
{ "type": "model", "target": "Order", "via": "has_many" }
],
"dependents": [
{ "type": "controller", "identifier": "UsersController" }
],
"chunks": [
{
"chunk_index": 0,
"identifier": "User#chunk_0",
"content": "# Unit: User (model)\n# File: app/models/user.rb\n# ---\nclass User < ApplicationRecord\n has_many :orders\n ...",
"content_hash": "abc123...",
"estimated_tokens": 312
}
],
"extracted_at": "2024-11-21T09:15:00Z",
"source_hash": "def456..."
}