Back to Source Monitor

Phase 04: Smart Scrape Recommendations — Research

.vbw-planning/milestones/ui-fixes-and-smart-scraping/phases/04-smart-scrape-recommendations/.context-lead.md

0.13.05.9 KB
Original Source

Phase 04 Context (Compiled)

Goal

Not available

Success Criteria

Not available

Requirements (Not available)

No matching requirements found

Active Decisions

None

Codebase Map Available

Codebase mapping exists in .vbw-planning/codebase/. Key files:

  • ARCHITECTURE.md
  • CONCERNS.md
  • PATTERNS.md
  • DEPENDENCIES.md
  • STRUCTURE.md
  • CONVENTIONS.md
  • TESTING.md
  • STACK.md

Read ARCHITECTURE.md, CONCERNS.md, and STRUCTURE.md first to bootstrap codebase understanding.

Research Findings

Phase 04: Smart Scrape Recommendations — Research

Findings

1. Source Model Structure

File: app/models/source_monitor/source.rb

  • Key attributes: scraping_enabled (boolean), scraper_adapter (string, validates presence), scrape_settings (JSONB), min_scrape_interval (optional)
  • Existing scopes & methods:
    • active scope for filtering
    • due_for_fetch class method for scheduling
    • avg_word_count method (lines 134-139) — currently computes average of scraped word counts only
    • Ransacker columns for avg_feed_words and avg_scraped_words (lines 83-99) — query ItemContent for feed/scraped word counts by source
  • Counter caches: items_count auto-maintained via has_many association

2. Configuration DSL

File: lib/source_monitor/configuration.rb

  • SourceMonitor.configure { |c| c.attr = value } pattern
  • @scraping = ScrapingSettings.new already initialized
  • ScrapingSettings has: max_in_flight_per_source, max_bulk_batch_size, min_scrape_interval
  • Pattern: attr_accessor + DEFAULT_* constant + reset! method

3. Dashboard Structure

Files: app/controllers/source_monitor/dashboard_controller.rb, lib/source_monitor/dashboard/queries.rb

  • Dashboard renders via queries.stats, queries.recent_activity, queries.quick_actions, queries.job_metrics, queries.upcoming_fetch_schedule
  • StatsQuery returns: { total_sources, active_sources, failed_sources, total_items, fetches_today, health_distribution }
  • Widget structure: bordered card with header + divider + list/content

4. Sources Index Structure

Files: app/controllers/source_monitor/sources_controller.rb + view

  • Uses Ransack for search (searchable_with mixin)
  • Index action: builds @q (Ransack query), computes @avg_feed_word_counts and @avg_scraped_word_counts as hashes (source_id -> avg)
  • Row partial renders with item_activity_rates, word count maps
  • Row has dropdown menu with View/Edit/Delete actions

5. Scraping Pipeline

Files: app/jobs/source_monitor/scrape_item_job.rb, lib/source_monitor/scraping/bulk_source_scraper.rb, lib/source_monitor/scraping/item_scraper.rb

  • ScrapeItemJob: Checks source.scraping_enabled?, respects min_scrape_interval, calls ItemScraper
  • BulkSourceScraper: Selections: :current, :unscraped, :all. Returns Result struct with: status, selection, attempted_count, enqueued_count, already_enqueued_count, failure_count, failure_details, messages, rate_limited
  • ItemScraper: Resolves adapter via AdapterResolver, calls adapter, persists result

6. Item Model & Word Count Computation

Files: app/models/source_monitor/item.rb, app/models/source_monitor/item_content.rb

  • Item has has_one :item_content (autosave: true)
  • ItemContent stores feed_word_count and scraped_word_count (computed)
  • Computation: feed_word_count from item.content, scraped_word_count from item_content.scraped_content
  • Both computed in before_save hook

7. Analytics Patterns

File: lib/source_monitor/analytics/sources_index_metrics.rb

  • Takes base_scope, result_scope, search_params
  • Computes distribution via SourceFetchIntervalDistribution
  • Results cached via private attributes
  • Used in controller index action to populate display data

8. Existing Bulk Action Patterns

File: app/controllers/source_monitor/source_bulk_scrapes_controller.rb

  • Single source bulk scrape: POST /sources/:source_id/bulk_scrape
  • Params: { bulk_scrape: { selection: :current/:unscraped/:all } }
  • Responds with Turbo Stream, uses SourceTurboResponses mixin

9. Routes

File: config/routes.rb

  • Sources resources with nested resource :bulk_scrape, only: :create
  • Dashboard: get "/dashboard" -> dashboard#index

10. Test Patterns

  • Uses create_source! factory, mocks services, tests Turbo Stream responses
  • Asserts turbo-stream tags with dom_id(source, :row)

Relevant Patterns

  1. Configuration: Add scrape_recommendation_threshold to ScrapingSettings with DEFAULT constant + reset
  2. Dashboard widget: Follow existing card pattern (header + divider + content)
  3. Word count filtering: Leverage existing Ransack ransackers (avg_feed_words)
  4. Bulk enablement: Follow bulk scrape pattern with Turbo Stream responder
  5. Presenter pattern: Source row uses locals for computed data
  6. Analytics class: Follow SourcesIndexMetrics pattern for recommendation computation

Risks

  1. Word count accuracy: avg_feed_words Ransacker joins ItemContent — need all items to have content records
  2. Bulk action feedback: BulkSourceScraper can rate-limit — UI must handle partial enqueuing
  3. Modal confirmation UX: Needs Stimulus controller for toggle
  4. Test-first scrape page: New route/view — must follow CRUD-everything pattern

Recommendations

  1. Add to ScrapingSettings with default 200 words
  2. Add Source.scrape_candidates scope using avg_feed_words threshold
  3. Dashboard widget: new card between stats and recent activity
  4. Index badge: warning indicator on rows where avg_feed_words < threshold
  5. Bulk selection: Stimulus checkboxes + new controller action
  6. Test-first: New route POST /sources/:id/test_scrape -> comparison page
  7. Create SourceMonitor::Analytics::ScrapeRecommendations for candidate computation
  8. Confirmation modal via Stimulus controller