docs/design/MODEL_EXTRACTION_FIXES.md
The extraction layer produces ExtractedUnit JSON files that downstream systems (embedding pipeline, retrieval, context assembly, agent tools) consume. Five issues in model extraction were degrading the quality of that output in ways that directly impact AI usefulness. A sixth issue — broken callback condition extraction — was fixed alongside.
These fixes target the extraction-to-embedding boundary: the point where raw Rails introspection becomes structured data that embeddings and LLM context windows must work with. Every fix below improves either retrieval precision, token efficiency, or semantic accuracy of the model units that dominate most Rails codebases.
What changed: Removed the needs_chunking? gate that required estimated_tokens > 1500 before build_chunks would run. All models now produce semantic chunks unconditionally.
Why it matters for retrieval: The chunking strategy in CONTEXT_AND_CHUNKING.md explicitly states that models get split into summary, associations, callbacks, and validations chunks. These chunks are what the embedding pipeline indexes — a model with chunks: [] is invisible to chunk-level retrieval. A small model like Comment (3 LOC, one belongs_to) still has meaningful associations and validations that agents need to find.
| Model | Before | After |
|---|---|---|
| Post (bare) | chunks: [] | 1 chunk (summary) |
| Comment (belongs_to + validation) | chunks: [] | 3 chunks (summary, associations, validations) |
Downstream effect: Queries like "what validates Comment" or "Comment associations" now have dedicated chunk embeddings to match against, instead of requiring full-unit retrieval and LLM parsing.
What changed:
condition_label(): Symbol → ":name", Proc → "Proc", String → as-is.@if/@unless instance variables instead of the non-existent cb.options[:if] (removed in Rails 4.2).Why it matters for context windows: Raw Proc#inspect output like "#<Proc:0x0000000122a828c0 /path/to/file.rb:42>" wastes tokens and confuses LLMs. Memory addresses are meaningless noise. "Proc" communicates "there's a conditional" without the garbage. Symbols like :published? are even better — they tell the LLM exactly which method gates the behavior.
The callback fix also eliminates a silent NoMethodError that was being swallowed by the rescue block, meaning callback conditions were silently nil on every Rails 7+ app.
| Field | Before | After |
|---|---|---|
Validation :if | "#<Proc:0x0000000122a828c0...>" | ["Proc"] |
Validation :if (symbol) | [:published?] (worked) | [":published?"] (consistent string format) |
Callback :if | nil (broken — cb.options doesn't exist) | [":published?"] or ["Proc"] |
What changed: Validators auto-generated by belongs_to associations are now tagged with implicit_belongs_to: true.
Why it matters for agents: When an agent asks "what validations did the developer add to Comment?", implicit framework-generated validators (every belongs_to adds a presence validator by default) are noise. The tag lets retrieval and context formatting distinguish developer intent from framework defaults. An agent tool can filter these out or present them separately.
What changed:
is_sti_base now requires both descends_from_active_record? AND the inheritance_column (usually type) to exist in the table.is_sti_child field for completeness.Why it matters for retrieval accuracy: descends_from_active_record? returns true for every model that directly inherits from ApplicationRecord — which is almost every model. The old output marked Post, Comment, User, etc. all as is_sti_base: true. This is semantically wrong and would cause retrieval queries about STI ("which models use single-table inheritance?") to return the entire codebase. Now only models with an actual type column qualify.
What changed: Added AR_INTERNAL_METHOD_PATTERNS constant that filters out:
_run_save_callbacks, _validators, _reflections (underscore-prefixed internals)autosave_associated_records_for_* (generated per-association)validate_associated_records_for_* (generated per-association)before_add_for_*, after_remove_for_* (collection callbacks)Why it matters for token budgets: A model with 5 associations generates 10+ internal methods (autosave_associated_records_for_comments, validate_associated_records_for_comments, etc.). These methods exist in the instance_methods array, consuming embedding tokens and diluting the signal of actual developer-defined methods. Filtering them reduces noise in the API surface metadata and makes the instance methods list useful for "what can I call on this model?" queries.
What changed: estimated_tokens now sums source_code.length / 4 and metadata.to_json.length / 4. Empty metadata ({}) adds nothing.
Why it matters for chunking and context assembly: The token budget system described in RETRIEVAL_ARCHITECTURE.md uses estimated_tokens to decide how many units fit in a context window. Source-only estimation undercounts by 2-3x for metadata-rich models — a model with 10 associations, 15 validations, and 20 callbacks has substantial metadata that the old estimate ignored. Accurate token estimates prevent context window overflow during assembly.
| Model | Before (source only) | After (source + metadata) |
|---|---|---|
| Post | 109 tokens | 351 tokens |
| Comment | 136 tokens | 388 tokens |
These fixes sit at the extraction layer but have ripple effects through every downstream system:
| Layer | Document | How These Fixes Help |
|---|---|---|
| Chunking | CONTEXT_AND_CHUNKING.md | All models now produce chunks. Token estimates are accurate for budget calculations. Chunk content is clean (no Proc#inspect, no framework noise). |
| Embedding | RETRIEVAL_ARCHITECTURE.md | Cleaner chunk text produces more focused embeddings. STI fields are accurate for metadata filtering. |
| Retrieval | RETRIEVAL_ARCHITECTURE.md | Metadata filters (is_sti_base, implicit_belongs_to) enable precise query narrowing. Chunk-level retrieval works for all models, not just large ones. |
| Context Assembly | CONTEXT_AND_CHUNKING.md | Accurate estimated_tokens prevents context window overflow. Serializable conditions render cleanly in LLM prompts. |
| Agent Tools | AGENTIC_STRATEGY.md | implicit_belongs_to tag supports "show me developer validations" queries. Filtered instance methods give clean API surface answers. |
All changes verified against the host app (Post + Comment models, 3 controllers, 2 jobs, 1 mailer):
woods:validate passesThese fixes address new items discovered during host app extraction review. They are independent of the optimization backlog, which tracks items #1-29 (batches 1-7 fully resolved, 4 deferred). The token estimation change partially addresses deferred item #21 (token accuracy) by including metadata weight — the tiktoken_ruby approach remains deferred.