Model Extraction Fixes — Signal Quality for AI Consumption

Context

The extraction layer produces ExtractedUnit JSON files that downstream systems (embedding pipeline, retrieval, context assembly, agent tools) consume. Five issues in model extraction were degrading the quality of that output in ways that directly impact AI usefulness. A sixth issue — broken callback condition extraction — was fixed alongside.

These fixes target the extraction-to-embedding boundary: the point where raw Rails introspection becomes structured data that embeddings and LLM context windows must work with. Every fix below improves either retrieval precision, token efficiency, or semantic accuracy of the model units that dominate most Rails codebases.

Fixes and Their Impact on AI Consumption

1. All Models Now Get Semantic Chunks

What changed: Removed the needs_chunking? gate that required estimated_tokens > 1500 before build_chunks would run. All models now produce semantic chunks unconditionally.

Why it matters for retrieval: The chunking strategy in CONTEXT_AND_CHUNKING.md explicitly states that models get split into summary, associations, callbacks, and validations chunks. These chunks are what the embedding pipeline indexes — a model with chunks: [] is invisible to chunk-level retrieval. A small model like Comment (3 LOC, one belongs_to) still has meaningful associations and validations that agents need to find.

Model	Before	After
Post (bare)	`chunks: []`	1 chunk (summary)
Comment (belongs_to + validation)	`chunks: []`	3 chunks (summary, associations, validations)

Downstream effect: Queries like "what validates Comment" or "Comment associations" now have dedicated chunk embeddings to match against, instead of requiring full-unit retrieval and LLM parsing.

2. Validation and Callback Conditions Are Serializable

What changed:

Validation conditions now pass through condition_label(): Symbol → ":name", Proc → "Proc", String → as-is.
Callback conditions extracted via @if/@unless instance variables instead of the non-existent cb.options[:if] (removed in Rails 4.2).

Why it matters for context windows: Raw Proc#inspect output like "#<Proc:0x0000000122a828c0 /path/to/file.rb:42>" wastes tokens and confuses LLMs. Memory addresses are meaningless noise. "Proc" communicates "there's a conditional" without the garbage. Symbols like :published? are even better — they tell the LLM exactly which method gates the behavior.

The callback fix also eliminates a silent NoMethodError that was being swallowed by the rescue block, meaning callback conditions were silently nil on every Rails 7+ app.

Field	Before	After
Validation `:if`	`"#<Proc:0x0000000122a828c0...>"`	`["Proc"]`
Validation `:if` (symbol)	`[:published?]` (worked)	`[":published?"]` (consistent string format)
Callback `:if`	`nil` (broken — `cb.options` doesn't exist)	`[":published?"]` or `["Proc"]`

3. Implicit belongs_to Validators Tagged

What changed: Validators auto-generated by belongs_to associations are now tagged with implicit_belongs_to: true.

Why it matters for agents: When an agent asks "what validations did the developer add to Comment?", implicit framework-generated validators (every belongs_to adds a presence validator by default) are noise. The tag lets retrieval and context formatting distinguish developer intent from framework defaults. An agent tool can filter these out or present them separately.

4. STI Detection Requires Type Column

What changed:

is_sti_base now requires both descends_from_active_record? AND the inheritance_column (usually type) to exist in the table.
Added is_sti_child field for completeness.

Why it matters for retrieval accuracy: descends_from_active_record? returns true for every model that directly inherits from ApplicationRecord — which is almost every model. The old output marked Post, Comment, User, etc. all as is_sti_base: true. This is semantically wrong and would cause retrieval queries about STI ("which models use single-table inheritance?") to return the entire codebase. Now only models with an actual type column qualify.

5. Framework-Generated Methods Filtered

What changed: Added AR_INTERNAL_METHOD_PATTERNS constant that filters out:

_run_save_callbacks, _validators, _reflections (underscore-prefixed internals)
autosave_associated_records_for_* (generated per-association)
validate_associated_records_for_* (generated per-association)
before_add_for_*, after_remove_for_* (collection callbacks)

Why it matters for token budgets: A model with 5 associations generates 10+ internal methods (autosave_associated_records_for_comments, validate_associated_records_for_comments, etc.). These methods exist in the instance_methods array, consuming embedding tokens and diluting the signal of actual developer-defined methods. Filtering them reduces noise in the API surface metadata and makes the instance methods list useful for "what can I call on this model?" queries.

6. Token Estimation Includes Metadata

What changed: estimated_tokens now sums source_code.length / 4 and metadata.to_json.length / 4. Empty metadata ({}) adds nothing.

Why it matters for chunking and context assembly: The token budget system described in RETRIEVAL_ARCHITECTURE.md uses estimated_tokens to decide how many units fit in a context window. Source-only estimation undercounts by 2-3x for metadata-rich models — a model with 10 associations, 15 validations, and 20 callbacks has substantial metadata that the old estimate ignored. Accurate token estimates prevent context window overflow during assembly.

Model	Before (source only)	After (source + metadata)
Post	109 tokens	351 tokens
Comment	136 tokens	388 tokens

Connection to Architecture Layers

These fixes sit at the extraction layer but have ripple effects through every downstream system:

Layer	Document	How These Fixes Help
Chunking	CONTEXT_AND_CHUNKING.md	All models now produce chunks. Token estimates are accurate for budget calculations. Chunk content is clean (no Proc#inspect, no framework noise).
Embedding	RETRIEVAL_ARCHITECTURE.md	Cleaner chunk text produces more focused embeddings. STI fields are accurate for metadata filtering.
Retrieval	RETRIEVAL_ARCHITECTURE.md	Metadata filters (`is_sti_base`, `implicit_belongs_to`) enable precise query narrowing. Chunk-level retrieval works for all models, not just large ones.
Context Assembly	CONTEXT_AND_CHUNKING.md	Accurate `estimated_tokens` prevents context window overflow. Serializable conditions render cleanly in LLM prompts.
Agent Tools	AGENTIC_STRATEGY.md	`implicit_belongs_to` tag supports "show me developer validations" queries. Filtered instance methods give clean API surface answers.

Verification Results

All changes verified against the host app (Post + Comment models, 3 controllers, 2 jobs, 1 mailer):

Unit tests: 162 examples, 0 failures
Integration tests: 87 examples, 0 failures
Rake extraction: 129 units, 20 chunks (was 16 — gained 4 model chunks)
Validation: woods:validate passes

Backlog Status

These fixes address new items discovered during host app extraction review. They are independent of the optimization backlog, which tracks items #1-29 (batches 1-7 fully resolved, 4 deferred). The token estimation change partially addresses deferred item #21 (token accuracy) by including metadata weight — the tiktoken_ruby approach remains deferred.