Back to Woods

Configuration Reference

docs/CONFIGURATION_REFERENCE.md

1.3.017.3 KB
Original Source

Configuration Reference

All configuration is done via the Woods.configure block, typically in config/initializers/woods.rb.

ruby
Woods.configure do |config|
  config.output_dir = Rails.root.join('tmp/woods')
  config.max_context_tokens = 8000
  # ...
end

Common Configuration Patterns

CI-Only Extraction (Subset of Extractors)

ruby
Woods.configure do |config|
  config.output_dir = Rails.root.join('tmp/woods')

  # In CI, only extract models and controllers for faster builds
  config.extractors = %i[models controllers services] if ENV['CI']
end

Docker Extraction with Environment-Based Paths

ruby
Woods.configure do |config|
  # Inside Docker, /app is the Rails root
  config.output_dir = ENV.fetch('WOODS_OUTPUT_DIR', Rails.root.join('tmp/woods'))
end

Environment-Conditional Embedding Provider

ruby
Woods.configure do |config|
  # Use OpenAI in production/CI where the API key is set,
  # fall back to Ollama for local development (free, no API key needed)
  if ENV['OPENAI_API_KEY']
    config.embedding_provider = :openai
    config.embedding_model = 'text-embedding-3-small'
    config.embedding_options = { api_key: ENV['OPENAI_API_KEY'] }
  else
    config.embedding_provider = :ollama
    config.embedding_options = {
      model: 'nomic-embed-text',
      host: ENV.fetch('OLLAMA_URL', 'http://localhost:11434')
    }
  end
end

Core Options

Columns:

  • User-settable: a direct Woods.configure { |c| c.<option> = ... } writes the value verbatim.
  • Preset-derived: set by Builder.preset_config(:local | :postgresql | :production) as a group. You can override any preset value afterwards in the configure block — later writes win.
  • Computed: derived from other options at read time (or at build_* time by Woods::Builder). Writing directly has no effect; change the inputs instead.
OptionTypeDefaultRoleDescription
output_dirPathname/StringRails.root.join('tmp/woods')user-settableDirectory where extracted data is written
extractorsArray<Symbol>[:models, :controllers, :services, ...]user-settableList of enabled extractors (see Extractors below)
pretty_jsonBooleantrueuser-settableFormat extracted JSON with indentation
max_context_tokensInteger8000user-settableMaximum tokens for retrieval context windows
similarity_thresholdFloat0.7user-settableMinimum similarity score (0.0-1.0) for retrieval results
context_formatSymbol:markdownuser-settableOutput format for retrieval: :claude, :markdown, :plain, :json
include_framework_sourcesBooleantrueuser-settableExtract Rails and gem source code
concurrent_extractionBooleanfalseuser-settableEnable parallel extraction (experimental)
vector_store / metadata_store / graph_store / embedding_providerSymbolpreset-derivedAdapter types. Set by presets; override individually to mix stacks.
chars-per-token ratio (used by ContextAssembler, TextPreparer, Builder, cost_model)Float4.0 (OpenAI) / 1.5 (Ollama)computedDerived from the active embedding provider via Woods::TokenUtils.chars_per_token_for(...). Not directly user-settable; change embedding_provider to change the ratio.

Embedding Options

OptionTypeDefaultDescription
embedding_providerSymbolEmbedding backend: :openai or :ollama
embedding_modelString'text-embedding-3-small'Model name for the embedding provider
embedding_optionsHashnilProvider-specific options (see below)

OpenAI Embeddings

ruby
config.embedding_provider = :openai
config.embedding_model = 'text-embedding-3-small'
config.embedding_options = {
  api_key: ENV['OPENAI_API_KEY'],
  dimensions: 1536
}

Ollama Embeddings

ruby
config.embedding_provider = :ollama
config.embedding_options = {
  model: 'nomic-embed-text',
  host: 'http://localhost:11434'
  # num_ctx: 2048  # Optional override — see below
}

The provider reads model:, host:, and num_ctx: from embedding_options. num_ctx is auto-selected from a per-model registry (nomic-embed-text → 2048, bge-m3 → 8192, snowflake-arctic-embed2 → 8192, mxbai-embed-large → 512, all-minilm → 256). Unknown models fall back to 2048, matching Ollama's conservative embedding default. Set num_ctx: explicitly only when running a model with a known-larger native context that isn't in the registry yet.

Why num_ctx is capped at the native context. Ollama has an open regression (ollama/ollama#14186) where options.num_ctx does not lift the effective ceiling on /api/embed for models whose native context is smaller than the override. Woods advertises the native ceiling so the chunker sizes inputs to what Ollama will actually accept.

Optional exact tokenization. Install the tokenizers gem alongside Woods to get BERT WordPiece token counting. Without it, Woods falls back to a chars/token ratio, which under-counts dense Ruby source (CamelCase constants, callback DSLs) and can silently over-pack chunks. Recommended for any Ollama setup.

ruby
# Gemfile (optional)
gem 'tokenizers', '~> 0.5'

See EMBEDDING_MODELS.md for the full model comparison and the procedure for adding a new model to the registry.

Storage Options

OptionTypeDefaultDescription
vector_storeSymbolVector backend: :in_memory, :pgvector, :qdrant
vector_store_optionsHashnilBackend-specific connection options
metadata_storeSymbolMetadata backend: :in_memory, :sqlite
metadata_store_optionsHashnilBackend-specific options
graph_storeSymbolGraph backend: :in_memory

pgvector (PostgreSQL)

ruby
config.vector_store = :pgvector
config.vector_store_options = {
  connection: ActiveRecord::Base.connection,
  dimensions: 1536
}

Requires the pgvector extension. Run the generator to create migrations:

bash
bundle exec rails generate woods:pgvector
bundle exec rails db:migrate

Qdrant

ruby
config.vector_store = :qdrant
config.vector_store_options = {
  url: 'http://localhost:6333',
  collection: 'woods',
  dimensions: 1536
}

SQLite Metadata

ruby
config.metadata_store = :sqlite
config.metadata_store_options = {
  database: Rails.root.join('tmp/woods/metadata.sqlite3').to_s
}

Requires the sqlite3 gem in your host bundle. Rails apps backed by MySQL or PostgreSQL won't have it by default — selecting :sqlite without it raises Woods::ConfigurationError with install instructions. For MySQL/Postgres-only hosts, use :in_memory (below) unless cross-process metadata persistence matters.

In-Memory Metadata

ruby
config.metadata_store = :in_memory

Pure-Ruby hash-backed store. No external dependencies, no persistence — vectors and metadata both live in the building process and die with it. The _index.json manifest under output_dir is the durable metadata for the index MCP server, so this is a reasonable default for hosts that don't bundle sqlite3.

Deployment Shapes

Woods supports three deployment shapes — pick the preset that matches yours.

ShapeWhenPreset
Single-processEmbed + query in one Ruby VM (dev console, tests, rails runner scripts). Simplest.:local
Shared filesystemRake task runs woods:embed, separate woods-mcp server reads the dump. Common with MCP sidecars.:shared_filesystem
DistributedVectors live in an external service (pgvector / Qdrant) queried by both the embed process and the MCP server. Highest durability, highest ops cost.:postgresql or :production

Shape 2 setup (:shared_filesystem)

ruby
Woods.configure_with_preset(:shared_filesystem) do |config|
  config.output_dir = Rails.root.join('tmp/woods')
  config.embedding_options = {
    model: 'nomic-embed-text',
    host:  ENV.fetch('WOODS_OLLAMA_URL', 'http://localhost:11434')
  }
end

The embed run writes woods.json + dumps/<ISO8601>/vectors.bin + metadata.msgpack under output_dir. The MCP server reads them at boot — no sqlite3 gem required, no pgvector/Qdrant service needed. Dump retention defaults to the last 3 (configurable via config.dump_retention_count).

Requirements:

  • output_dir must be set and readable by both the embed process and the MCP server.
  • The MCP server must know the same output_dir (pass via woods-mcp <DIR> or set WOODS_DIR).

Presets

For quick setup, use named presets that configure storage + embedding together:

ruby
# Local development — no external services needed (requires sqlite3 gem)
Woods.configure_with_preset(:local)
# → in_memory vectors, SQLite metadata, in_memory graph, Ollama embeddings

# Shared filesystem — rake embed → separate MCP server reads the dump.
# No sqlite3 gem needed; works on MySQL/Postgres-only hosts.
Woods.configure_with_preset(:shared_filesystem)
# → in_memory everything + Snapshotter-based persistence via output_dir

# PostgreSQL — requires pgvector extension and OpenAI API key
Woods.configure_with_preset(:postgresql)
# → pgvector vectors, SQLite metadata, in_memory graph, OpenAI embeddings

# Production — requires Qdrant server and OpenAI API key
Woods.configure_with_preset(:production)
# → Qdrant vectors, SQLite metadata, in_memory graph, OpenAI embeddings

Presets can be overridden:

ruby
Woods.configure_with_preset(:local) do |config|
  config.max_context_tokens = 16000
  config.embedding_model = 'mxbai-embed-large'
end

Pipeline Options

OptionTypeDefaultDescription
precompute_flowsBooleanfalsePre-compute per-action request flow maps during extraction
extract_navigation_edgesBooleantrueExtract link_to, redirect_to, and form_action navigation edges from views and controllers
enable_snapshotsBooleanfalseEnable temporal snapshots (requires migrations 004+005)

Session Tracer Options

OptionTypeDefaultDescription
session_tracer_enabledBooleanfalseEnable session tracing middleware
session_storeObjectnilStore backend: FileStore, RedisStore, or SolidCacheStore
session_id_procProcnilCustom proc to extract session ID from requests
session_exclude_pathsArray<String>[]Path patterns to exclude from tracing
ruby
config.session_tracer_enabled = true
config.session_store = Woods::SessionTracer::FileStore.new(
  Rails.root.join('tmp/session_traces')
)
config.session_exclude_paths = ['/health', '/metrics', '/assets']

Gem Indexing

Register additional gems to extract source from:

ruby
config.add_gem 'devise', paths: ['lib/devise/models'], priority: :high
config.add_gem 'pundit', paths: ['lib/pundit'], priority: :medium
config.add_gem 'sidekiq', paths: ['lib/sidekiq/worker', 'lib/sidekiq/job'], priority: :high

Priority levels (:low, :medium, :high) affect retrieval ranking when framework source is relevant to a query.

Extractors

The extractors config accepts an array of symbols. Default set:

ruby
config.extractors = %i[
  models controllers services components view_components
  jobs mailers graphql serializers managers policies validators
  rails_source
]

Additional extractors available (not in default set):

SymbolExtractorWhat it adds
:concernsConcernExtractorActiveSupport::Concern modules
:routesRouteExtractorRails routes (auto-included)
:middlewareMiddlewareExtractorRack middleware stack
:i18nI18nExtractorLocale translation files
:pundit_policiesPunditExtractorPundit authorization policies
:configurationsConfigurationExtractorRails initializers + behavioral profile
:enginesEngineExtractorMounted Rails engines
:view_templatesViewTemplateExtractorERB view templates
:migrationsMigrationExtractorActiveRecord migrations
:action_cable_channelsActionCableExtractorActionCable channels
:scheduled_jobsScheduledJobExtractorRecurring/scheduled jobs
:rake_tasksRakeTaskExtractorRake task definitions
:state_machinesStateMachineExtractorAASM/Statesman state machines
:eventsEventExtractorEvent publish/subscribe patterns
:decoratorsDecoratorExtractorDecorators, presenters, form objects
:database_viewsDatabaseViewExtractorSQL views (Scenic)
:cachingCachingExtractorCache usage patterns
:factoriesFactoryExtractorFactoryBot factory definitions
:test_mappingsTestMappingExtractorTest file → subject class mapping
:porosPoroExtractorPlain Ruby objects in app/models
:libsLibExtractorRuby files in lib/

Console MCP Options

These options configure the Console MCP server (live database queries via MCP). See CONSOLE_MCP_SETUP.md for the full deployment guide including defense layers.

KeyTypeDefaultDescription
console_mcp_enabledBooleanfalseMaster switch. When false, the Railtie does not mount the Console MCP middleware.
console_mcp_tokenStringENV['WOODS_CONSOLE_MCP_TOKEN'] or nilBearer token required on every HTTP request. Required in production — the Railtie raises Woods::ConfigurationError when console_mcp_enabled is true but no token is set. In non-prod without a token the middleware refuses to mount (warn + skip). Generate with SecureRandom.hex(32).
console_mcp_allowed_originsArray<String>%w[http://localhost http://127.0.0.1 http://[::1]]OriginGuard allowlist. Port is stripped before comparison, so http://localhost matches any localhost port. Override for tunneled / internal-dashboard access.
console_mcp_pathString/mcp/consoleURL path the Rack middleware responds on.
console_embedded_read_toolsBooleanfalseEnable the Tier 4 read tools console_sql / console_query in embedded (Rack) mode. Bridge-mode deployments always expose them.
console_blocked_tablesArray<String>Woods::DEFAULT_CONSOLE_BLOCKED_TABLESTableGate denylist (case-insensitive). Bare names match every schema; qualified names (schema.table) match exactly.
console_redacted_columnsArray<String>Woods::DEFAULT_CONSOLE_REDACTED_COLUMNSColumn names whose values are replaced with [REDACTED] in responses.
console_redacted_key_valuesArray<Hash>[]EAV-style redaction patterns. Each entry: { key_column:, value_column:, sensitive_keys: [] }.
console_credential_defense_enabledBooleantrueLayer 5 toggle for the CredentialScanner. Leave on unless you have a specific reason to disable.
console_credential_rotation_warningBooleantrueEmit a structured log warning when any Rails credentials file is modified after process start.
console_unsafe_eval_enabledBooleanfalseGate for console_eval. Off by default; no execution path is currently wired.

Environment Variables

These variables are read by the gem and its MCP servers at runtime. They complement (not replace) the configure block — most exist so the MCP servers can self-configure when no explicit config is available.

VariableRead byDefaultPurpose
WOODS_DIRwoods-mcp bootstrapperDir.pwdPath to the extraction output directory.
WOODS_SEARCH_MAX_SCANwoods-mcp search tool500Cap on the number of unit files loaded during a phase-2 (metadata/source_code) search. When the cap is hit, the response includes partial: true. Set empty or unset to use the default.
WOODS_SNAPSHOTSwoods-mcp bootstrapperunsetSet to "true" to force-enable temporal snapshot storage, even without a pre-existing SQLite database.
OPENAI_API_KEYwoods-mcp bootstrapperWhen set and no embedding provider is configured, the MCP server auto-enables OpenAI-backed semantic search with in-memory stores.
OLLAMA_BASE_URLwoods-mcp bootstrapper auto-detecthttp://localhost:11434Base URL the bootstrapper probes (GET /api/tags, 500ms timeout) when no embedding provider is configured. A reachable Ollama instance auto-enables local semantic search.
OLLAMA_EMBED_MODELwoods-mcp bootstrapper auto-detectnomic-embed-textModel to use when Ollama is auto-detected.

The woods-mcp bootstrapper emits a one-line STDERR banner at startup indicating whether semantic search is enabled and which provider is active. If no key/instance is found, pattern search still works and codebase_retrieve surfaces an actionable fix message.

Database Compatibility

All storage options work with both MySQL and PostgreSQL, except:

  • pgvector — PostgreSQL only (requires the pgvector extension)
  • SQLite metadata store — uses a standalone SQLite database file, independent of your app's database

See BACKEND_MATRIX.md for the full compatibility matrix.