Back to Source Monitor

Configuration + Model Scope + Scrape Candidate Query

.vbw-planning/milestones/ui-fixes-and-smart-scraping/phases/04-smart-scrape-recommendations/04-PLAN-01.md

0.13.04.3 KB
Original Source

Plan 01: Configuration + Model Scope + Scrape Candidate Query

Overview

Add the backend foundation for smart scrape recommendations: a configurable word count threshold, a model scope to find candidate sources, and a query object that computes recommendation data for the dashboard and sources index.


Task 1: Add scrape_recommendation_threshold to ScrapingSettings

Files:

  • lib/source_monitor/configuration/scraping_settings.rb

Description: Add scrape_recommendation_threshold attribute to ScrapingSettings following the existing pattern:

  • Add DEFAULT_SCRAPE_RECOMMENDATION_THRESHOLD = 200 constant
  • Add attr_accessor :scrape_recommendation_threshold
  • Initialize in reset! method
  • Add a setter with numeric normalization (use normalize_numeric since it's an integer word count)

Tests:

  • test/lib/source_monitor/configuration_test.rb -- Add tests for:
    • Default value is 200
    • Setting custom value via config.scraping.scrape_recommendation_threshold = 150
    • Reset restores default
    • Nil/empty values handled gracefully

Task 2: Add Source.scrape_candidates scope

Files:

  • app/models/source_monitor/source.rb

Description: Add a class method scrape_candidates that returns active sources where:

  • Average feed word count (from ItemContent) is below the configured threshold
  • Source has at least one item with a feed_word_count (to avoid flagging empty sources)
  • Scraping is NOT already enabled

Use the existing avg_feed_words ransacker SQL pattern as reference for the subquery. Accept an optional threshold parameter that defaults to SourceMonitor.config.scraping.scrape_recommendation_threshold.

ruby
def self.scrape_candidates(threshold: SourceMonitor.config.scraping.scrape_recommendation_threshold)
  threshold_value = threshold.to_i
  return none if threshold_value <= 0

  active
    .where(scraping_enabled: false)
    .where(
      "#{table_name}.id IN (
        SELECT i.source_id
        FROM #{Item.table_name} i
        INNER JOIN #{ItemContent.table_name} ic ON ic.item_id = i.id
        WHERE ic.feed_word_count IS NOT NULL
        GROUP BY i.source_id
        HAVING AVG(ic.feed_word_count) < ?
      )", threshold_value
    )
end

Tests:

  • test/models/source_monitor/source_test.rb -- Add tests for:
    • Returns sources below threshold with scraping disabled
    • Excludes sources with scraping already enabled
    • Excludes inactive sources
    • Excludes sources with no items/content
    • Excludes sources above threshold
    • Respects custom threshold parameter
    • Returns empty when threshold is 0 or negative

Task 3: Create Analytics::ScrapeRecommendations query object

Files:

  • lib/source_monitor/analytics/scrape_recommendations.rb

Description: Create a query object following the SourcesIndexMetrics pattern that computes scrape recommendation data:

ruby
module SourceMonitor
  module Analytics
    class ScrapeRecommendations
      def initialize(threshold: SourceMonitor.config.scraping.scrape_recommendation_threshold)
        @threshold = threshold.to_i
      end

      def candidates_count
        @candidates_count ||= Source.scrape_candidates(threshold: @threshold).count
      end

      def candidate_ids
        @candidate_ids ||= Source.scrape_candidates(threshold: @threshold).pluck(:id)
      end

      def candidate?(source_id)
        candidate_ids.include?(source_id)
      end

      private

      attr_reader :threshold
    end
  end
end

Add autoload :ScrapeRecommendations to the Analytics module or use require_relative.

Tests:

  • test/lib/source_monitor/analytics/scrape_recommendations_test.rb -- Add tests for:
    • candidates_count returns correct count
    • candidate_ids returns correct IDs
    • candidate? returns true/false correctly
    • Results are memoized (same object returned on repeat calls)
    • Respects threshold parameter

Task 4: Register autoload for ScrapeRecommendations

Files:

  • lib/source_monitor.rb (add autoload declaration only)

Description: Add autoload entry for SourceMonitor::Analytics::ScrapeRecommendations in the Analytics module section, following the existing autoload pattern used throughout lib/source_monitor.rb.

Tests:

  • Verify the class can be loaded: assert_kind_of Class, SourceMonitor::Analytics::ScrapeRecommendations