.vbw-planning/milestones/polish-and-reliability/phases/05-source-enhancements/.context-dev.md
Not available
Codebase mapping exists in .vbw-planning/codebase/. Key files:
ARCHITECTURE.mdCONCERNS.mdPATTERNS.mdDEPENDENCIES.mdSTRUCTURE.mdCONVENTIONS.mdTESTING.mdSTACK.mdRead CONVENTIONS.md, PATTERNS.md, STRUCTURE.md, and DEPENDENCIES.md first to bootstrap codebase understanding.
.vbw-planning/ROADMAP.md.vbw-planning/STATE.md.vbw-planning/ROADMAP.md (119 lines, first 30 shown)# Roadmap
## Milestone: polish-and-reliability
### Phases
1. [x] **Backend Fixes** -- Fix browser User-Agent default, health check status transitions, and smarter scrape rate limiting
2. [x] **Favicon Support** -- Automatically save source favicons via Active Storage with background fetch job
3. [x] **Toast Stacking** -- Cap visible toast notifications with click-to-expand for bulk operation UX
4. [x] **Bug Fixes & Polish** -- Fix OPML import warning, toast positioning, dashboard alignment, source deletion, and published column
5. [ ] **Source Enhancements** -- Add pagination/filtering for sources, per-source scrape rate limiting, and word count metrics
### Phase Details
#### Phase 1: Backend Fixes
**Goal:** Fix three independent backend issues: bot-blocked feeds due to User-Agent, health check not updating status, and overly aggressive scrape limiting.
**Requirements:**
- REQ-UA-01: Change default User-Agent from "SourceMonitor/VERSION" to a browser-like string
- REQ-HC-01: After a successful manual health check on a declining/critical/warning source, trigger SourceHealthMonitor re-evaluation or directly transition status to "improving"
- REQ-SL-01: Refine max_in_flight_per_source to only count actively-running scrape jobs (not queued ones)
**Success Criteria:**
- [ ] Default UA string resembles a real browser (e.g., Mozilla/5.0 compatible)
- [ ] Successful manual health check on a declining source transitions it to improving
- [ ] Scrape limit counts only actively-running jobs, queued items don't count toward the cap
- [ ] All existing tests pass, new tests cover changed behavior
- [ ] RuboCop zero offenses, Brakeman zero warnings
.vbw-planning/STATE.md (32 lines)# State
## Current Position
- **Milestone:** polish-and-reliability
- **Phase:** 5 -- Source Enhancements
- **Status:** Planned
- **Progress:** 80%
- **Plans:** 3
## Decisions
| Decision | Date | Context |
|----------|------|---------|
| Active Storage for favicons | 2026-02-20 | has_one_attached with guard, consistent with ItemContent pattern |
| Smarter scrape limit | 2026-02-20 | Count only running jobs, not queued; keeps safety but removes false bottleneck |
| Browser-like default UA | 2026-02-20 | Simple global fix for bot-blocked feeds like Uber |
| Health check triggers status update | 2026-02-20 | Successful manual health check should transition declining -> improving |
| Toast cap + hover expand | 2026-02-20 | Max 3 visible, +N more badge, hover to see all |
## Todos
- [x] Fix deprecation: `rails/tasks/statistics.rake` removed from Rakefile (2026-02-21)
## Metrics
- **Started:** 2026-02-20
- **Phases:** 5
- **Tests at start:** 1033
## Blockers
None
phase: 5 plan: 2 title: Per-Source Scrape Rate Limiting wave: 1 depends_on: [] must_haves:
Add time-based per-source scrape rate limiting. The system derives the last scrape timestamp from scrape_logs MAX(started_at) per source. When a scrape is attempted too soon, the job re-enqueues itself with a delay equal to the remaining interval. Each source can override the global minimum interval via a new min_scrape_interval column.
@ lib/source_monitor/scraping/enqueuer.rb -- current rate limiting checks in-flight count only; need to add time-based check@ lib/source_monitor/configuration/scraping_settings.rb -- current settings: max_in_flight_per_source, max_bulk_batch_size@ app/jobs/source_monitor/scrape_item_job.rb -- performs scrape; needs re-enqueue-with-delay logic@ app/models/source_monitor/scrape_log.rb -- has started_at column, belongs_to source@ app/models/source_monitor/source.rb -- will get min_scrape_interval column (but model file not modified -- just migration)@ .claude/skills/sm-engine-migration/SKILL.md -- migration conventions (sourcemon_ prefix)@ .claude/skills/sm-configuration-setting/SKILL.md -- config setting conventionsFiles: db/migrate/TIMESTAMP_add_min_scrape_interval_to_sources.rb
Create migration adding min_scrape_interval (decimal, precision: 10, scale: 2, null: true, default: nil) to sourcemon_sources. No index needed -- this is a per-record configuration value, not a query filter. The nil default means "use global setting".
Files: lib/source_monitor/configuration/scraping_settings.rb
Add attr_accessor :min_scrape_interval with DEFAULT_MIN_SCRAPE_INTERVAL = 1.0 (seconds). Add setter with normalize_numeric validation (same pattern as existing settings). Reset to default in reset!. This is the global fallback when a source's min_scrape_interval is nil.
Files: lib/source_monitor/scraping/enqueuer.rb
Add private method time_rate_limited? that:
source.min_scrape_interval || SourceMonitor.config.scraping.min_scrape_interval[false, nil] if interval is nil or <= 0source.scrape_logs.maximum(:started_at) for last scrape time[false, nil] if no prior scrapeelapsed = Time.current - last_scrape_atelapsed < interval: returns [true, { wait_seconds: (interval - elapsed).ceil, interval:, last_scrape_at: }][false, nil]In #enqueue, call time_rate_limited? AFTER the existing rate_limit_exhausted? check (inside the lock block). If rate-limited, set time_limited = true and time_limit_info.
After the lock block, if time_limited: instead of returning a failure, re-enqueue the job with delay via job_class.set(wait: info[:wait_seconds].seconds).perform_later(item.id) and return a new Result with status: :deferred and descriptive message. Add deferred? method to Result struct.
Files: app/jobs/source_monitor/scrape_item_job.rb
Modify #perform to check time-based rate limit before scraping. Add early check: resolve effective interval, query source.scrape_logs.maximum(:started_at), calculate elapsed. If too soon: clear in-flight state, re-enqueue self with self.class.set(wait: remaining.seconds).perform_later(item_id), log the deferral, and return early. This ensures even directly-enqueued jobs (bypassing Enqueuer) respect rate limits.
Files: test/lib/source_monitor/scraping/enqueuer_test.rb, test/jobs/source_monitor/scrape_item_job_test.rb
Enqueuer tests: (1) allows scrape when no prior scrape exists, (2) allows scrape when elapsed > interval, (3) returns deferred status when elapsed < interval with correct wait_seconds, (4) per-source interval overrides global, (5) nil/zero interval disables time rate limiting, (6) deferred result re-enqueues job with delay.
ScrapeItemJob tests: (1) performs scrape when not rate-limited, (2) re-enqueues with delay when rate-limited, (3) clears in-flight state on deferral.
Run full test suite to verify no regressions.
bin/rails test test/lib/source_monitor/scraping/enqueuer_test.rb test/jobs/source_monitor/scrape_item_job_test.rb
bin/rails test
bin/rubocop lib/source_monitor/scraping/enqueuer.rb lib/source_monitor/configuration/scraping_settings.rb app/jobs/source_monitor/scrape_item_job.rb
Files: app/controllers/source_monitor/sources_controller.rb (lines 20-41), app/views/source_monitor/sources/index.html.erb
Problem: No pagination — @sources = @q.result returns all sources. No page/limit applied. Paginator class already exists at SourceMonitor::Pagination::Paginator (used in ItemsController).
Fix: Add PER_PAGE constant and paginator call (same pattern as ItemsController). Add prev/next page controls to the view (same pattern as items/index.html.erb lines 122-144). Add column filtering controls (status, health status, etc.).
Existing pagination pattern (ItemsController):
PER_PAGE = 50@paginator = SourceMonitor::Pagination::Paginator.new(scope: @q.result, page: params[:page], per_page: PER_PAGE)@paginator.records, @paginator.prev_page, @paginator.next_pageFiles: lib/source_monitor/scraping/enqueuer.rb, lib/source_monitor/configuration/scraping_settings.rb, app/jobs/source_monitor/scrape_item_job.rb
Problem: Per-source in-flight limit exists but is disabled by default (DEFAULT_MAX_IN_FLIGHT = nil). No time-based rate limiting exists (e.g., max 1 request per second per source).
Current mechanism:
ScrapingSettings.max_in_flight_per_source — counts in-flight scrape jobs per sourceEnqueuer#rate_limit_exhausted? — checks count vs limitFix: Add time-based rate limiting: track last scrape timestamp per source (via Source#last_scraped_at or scrape_logs), enforce minimum interval between scrapes (default 1 second). Add min_scrape_interval config setting to ScrapingSettings.
Files: app/models/source_monitor/item_content.rb, app/models/source_monitor/item.rb, app/models/source_monitor/source.rb, app/views/source_monitor/items/index.html.erb, app/views/source_monitor/sources/index.html.erb, app/views/source_monitor/sources/_row.html.erb
Problem: No word_count column exists anywhere in the schema. Items have scraped_content in item_contents but no word count stored.
Fix (multi-step):
word_count integer to sourcemon_item_contents tableword_count in ItemContent when scraped_content is assigned: self.word_count = scraped_content.to_s.split.sizeaverage_word_count method on Source via joins queryitem.item_content&.word_count