Back to Source Monitor

Roadmap

.vbw-planning/milestones/polish-and-reliability/phases/05-source-enhancements/.context-dev.md

0.13.011.6 KB
Original Source

Phase 05 Context

Goal

Not available

Skills Reference

Codebase Map Available

Codebase mapping exists in .vbw-planning/codebase/. Key files:

  • ARCHITECTURE.md
  • CONCERNS.md
  • PATTERNS.md
  • DEPENDENCIES.md
  • STRUCTURE.md
  • CONVENTIONS.md
  • TESTING.md
  • STACK.md

Read CONVENTIONS.md, PATTERNS.md, STRUCTURE.md, and DEPENDENCIES.md first to bootstrap codebase understanding.

Changed Files (Delta)

  • .vbw-planning/ROADMAP.md
  • .vbw-planning/STATE.md

Code Slices

.vbw-planning/ROADMAP.md (119 lines, first 30 shown)

# Roadmap

## Milestone: polish-and-reliability

### Phases

1. [x] **Backend Fixes** -- Fix browser User-Agent default, health check status transitions, and smarter scrape rate limiting
2. [x] **Favicon Support** -- Automatically save source favicons via Active Storage with background fetch job
3. [x] **Toast Stacking** -- Cap visible toast notifications with click-to-expand for bulk operation UX
4. [x] **Bug Fixes & Polish** -- Fix OPML import warning, toast positioning, dashboard alignment, source deletion, and published column
5. [ ] **Source Enhancements** -- Add pagination/filtering for sources, per-source scrape rate limiting, and word count metrics

### Phase Details

#### Phase 1: Backend Fixes

**Goal:** Fix three independent backend issues: bot-blocked feeds due to User-Agent, health check not updating status, and overly aggressive scrape limiting.

**Requirements:**
- REQ-UA-01: Change default User-Agent from "SourceMonitor/VERSION" to a browser-like string
- REQ-HC-01: After a successful manual health check on a declining/critical/warning source, trigger SourceHealthMonitor re-evaluation or directly transition status to "improving"
- REQ-SL-01: Refine max_in_flight_per_source to only count actively-running scrape jobs (not queued ones)

**Success Criteria:**
- [ ] Default UA string resembles a real browser (e.g., Mozilla/5.0 compatible)
- [ ] Successful manual health check on a declining source transitions it to improving
- [ ] Scrape limit counts only actively-running jobs, queued items don't count toward the cap
- [ ] All existing tests pass, new tests cover changed behavior
- [ ] RuboCop zero offenses, Brakeman zero warnings

.vbw-planning/STATE.md (32 lines)

# State

## Current Position

- **Milestone:** polish-and-reliability
- **Phase:** 5 -- Source Enhancements
- **Status:** Planned
- **Progress:** 80%
- **Plans:** 3

## Decisions

| Decision | Date | Context |
|----------|------|---------|
| Active Storage for favicons | 2026-02-20 | has_one_attached with guard, consistent with ItemContent pattern |
| Smarter scrape limit | 2026-02-20 | Count only running jobs, not queued; keeps safety but removes false bottleneck |
| Browser-like default UA | 2026-02-20 | Simple global fix for bot-blocked feeds like Uber |
| Health check triggers status update | 2026-02-20 | Successful manual health check should transition declining -> improving |
| Toast cap + hover expand | 2026-02-20 | Max 3 visible, +N more badge, hover to see all |

## Todos

- [x] Fix deprecation: `rails/tasks/statistics.rake` removed from Rakefile (2026-02-21)

## Metrics

- **Started:** 2026-02-20
- **Phases:** 5
- **Tests at start:** 1033

## Blockers
None

Active Plan


phase: 5 plan: 2 title: Per-Source Scrape Rate Limiting wave: 1 depends_on: [] must_haves:

  • "migration adds min_scrape_interval column (decimal, seconds) to sourcemon_sources with default nil"
  • "ScrapingSettings has min_scrape_interval attribute with DEFAULT_MIN_SCRAPE_INTERVAL = 1.0 (seconds)"
  • "Enqueuer derives last-scrape timestamp from scrape_logs MAX(started_at) per source"
  • "when rate-limited, ScrapeItemJob re-enqueues itself with set(wait:) for remaining interval"
  • "per-source min_scrape_interval overrides global ScrapingSettings.min_scrape_interval when present"
  • "all existing enqueuer and scrape_item_job tests pass, new tests cover rate limit behavior"
  • "RuboCop zero offenses" skills_used:
  • sm-engine-migration
  • sm-configuration-setting

Objective

Add time-based per-source scrape rate limiting. The system derives the last scrape timestamp from scrape_logs MAX(started_at) per source. When a scrape is attempted too soon, the job re-enqueues itself with a delay equal to the remaining interval. Each source can override the global minimum interval via a new min_scrape_interval column.

Context

  • @ lib/source_monitor/scraping/enqueuer.rb -- current rate limiting checks in-flight count only; need to add time-based check
  • @ lib/source_monitor/configuration/scraping_settings.rb -- current settings: max_in_flight_per_source, max_bulk_batch_size
  • @ app/jobs/source_monitor/scrape_item_job.rb -- performs scrape; needs re-enqueue-with-delay logic
  • @ app/models/source_monitor/scrape_log.rb -- has started_at column, belongs_to source
  • @ app/models/source_monitor/source.rb -- will get min_scrape_interval column (but model file not modified -- just migration)
  • @ .claude/skills/sm-engine-migration/SKILL.md -- migration conventions (sourcemon_ prefix)
  • @ .claude/skills/sm-configuration-setting/SKILL.md -- config setting conventions

Tasks

Task 1: Add min_scrape_interval column to sources

Files: db/migrate/TIMESTAMP_add_min_scrape_interval_to_sources.rb

Create migration adding min_scrape_interval (decimal, precision: 10, scale: 2, null: true, default: nil) to sourcemon_sources. No index needed -- this is a per-record configuration value, not a query filter. The nil default means "use global setting".

Task 2: Add min_scrape_interval to ScrapingSettings

Files: lib/source_monitor/configuration/scraping_settings.rb

Add attr_accessor :min_scrape_interval with DEFAULT_MIN_SCRAPE_INTERVAL = 1.0 (seconds). Add setter with normalize_numeric validation (same pattern as existing settings). Reset to default in reset!. This is the global fallback when a source's min_scrape_interval is nil.

Task 3: Add time-based rate check to Enqueuer

Files: lib/source_monitor/scraping/enqueuer.rb

Add private method time_rate_limited? that:

  1. Resolves effective interval: source.min_scrape_interval || SourceMonitor.config.scraping.min_scrape_interval
  2. Returns [false, nil] if interval is nil or <= 0
  3. Queries source.scrape_logs.maximum(:started_at) for last scrape time
  4. Returns [false, nil] if no prior scrape
  5. Calculates elapsed = Time.current - last_scrape_at
  6. If elapsed < interval: returns [true, { wait_seconds: (interval - elapsed).ceil, interval:, last_scrape_at: }]
  7. Otherwise returns [false, nil]

In #enqueue, call time_rate_limited? AFTER the existing rate_limit_exhausted? check (inside the lock block). If rate-limited, set time_limited = true and time_limit_info.

After the lock block, if time_limited: instead of returning a failure, re-enqueue the job with delay via job_class.set(wait: info[:wait_seconds].seconds).perform_later(item.id) and return a new Result with status: :deferred and descriptive message. Add deferred? method to Result struct.

Task 4: Add re-enqueue with delay to ScrapeItemJob

Files: app/jobs/source_monitor/scrape_item_job.rb

Modify #perform to check time-based rate limit before scraping. Add early check: resolve effective interval, query source.scrape_logs.maximum(:started_at), calculate elapsed. If too soon: clear in-flight state, re-enqueue self with self.class.set(wait: remaining.seconds).perform_later(item_id), log the deferral, and return early. This ensures even directly-enqueued jobs (bypassing Enqueuer) respect rate limits.

Task 5: Write rate limiting tests

Files: test/lib/source_monitor/scraping/enqueuer_test.rb, test/jobs/source_monitor/scrape_item_job_test.rb

Enqueuer tests: (1) allows scrape when no prior scrape exists, (2) allows scrape when elapsed > interval, (3) returns deferred status when elapsed < interval with correct wait_seconds, (4) per-source interval overrides global, (5) nil/zero interval disables time rate limiting, (6) deferred result re-enqueues job with delay.

ScrapeItemJob tests: (1) performs scrape when not rate-limited, (2) re-enqueues with delay when rate-limited, (3) clears in-flight state on deferral.

Run full test suite to verify no regressions.

Verification

bash
bin/rails test test/lib/source_monitor/scraping/enqueuer_test.rb test/jobs/source_monitor/scrape_item_job_test.rb
bin/rails test
bin/rubocop lib/source_monitor/scraping/enqueuer.rb lib/source_monitor/configuration/scraping_settings.rb app/jobs/source_monitor/scrape_item_job.rb

Success Criteria

  • Per-source scrape rate limiting derives last-scrape from scrape_logs MAX(started_at)
  • When rate-limited, job is re-enqueued with delay (remaining interval)
  • Source.min_scrape_interval overrides global ScrapingSettings.min_scrape_interval
  • Default global interval is 1 second
  • Nil/zero interval disables time rate limiting
  • All tests pass, RuboCop zero offenses

Research Findings

Phase 5 Research: Source Enhancements

Issue 1: Sources Pagination and Column Filtering

Files: app/controllers/source_monitor/sources_controller.rb (lines 20-41), app/views/source_monitor/sources/index.html.erb

Problem: No pagination — @sources = @q.result returns all sources. No page/limit applied. Paginator class already exists at SourceMonitor::Pagination::Paginator (used in ItemsController).

Fix: Add PER_PAGE constant and paginator call (same pattern as ItemsController). Add prev/next page controls to the view (same pattern as items/index.html.erb lines 122-144). Add column filtering controls (status, health status, etc.).

Existing pagination pattern (ItemsController):

  • PER_PAGE = 50
  • @paginator = SourceMonitor::Pagination::Paginator.new(scope: @q.result, page: params[:page], per_page: PER_PAGE)
  • View uses @paginator.records, @paginator.prev_page, @paginator.next_page

Issue 2: Scraping Rate Limit Per Source

Files: lib/source_monitor/scraping/enqueuer.rb, lib/source_monitor/configuration/scraping_settings.rb, app/jobs/source_monitor/scrape_item_job.rb

Problem: Per-source in-flight limit exists but is disabled by default (DEFAULT_MAX_IN_FLIGHT = nil). No time-based rate limiting exists (e.g., max 1 request per second per source).

Current mechanism:

  • ScrapingSettings.max_in_flight_per_source — counts in-flight scrape jobs per source
  • Enqueuer#rate_limit_exhausted? — checks count vs limit
  • Default is nil (no limit)

Fix: Add time-based rate limiting: track last scrape timestamp per source (via Source#last_scraped_at or scrape_logs), enforce minimum interval between scrapes (default 1 second). Add min_scrape_interval config setting to ScrapingSettings.

Issue 3: Word Count Metrics

Files: app/models/source_monitor/item_content.rb, app/models/source_monitor/item.rb, app/models/source_monitor/source.rb, app/views/source_monitor/items/index.html.erb, app/views/source_monitor/sources/index.html.erb, app/views/source_monitor/sources/_row.html.erb

Problem: No word_count column exists anywhere in the schema. Items have scraped_content in item_contents but no word count stored.

Fix (multi-step):

  1. Migration: Add word_count integer to sourcemon_item_contents table
  2. Model callback: Compute word_count in ItemContent when scraped_content is assigned: self.word_count = scraped_content.to_s.split.size
  3. Backfill: Populate for existing records
  4. Source average: Add average_word_count method on Source via joins query
  5. Items index view: Add "Words" column showing item.item_content&.word_count
  6. Sources index view: Add "Avg Words" column
  7. Source show items table: Add "Words" column