Back to Source Monitor

State

.vbw-planning/milestones/ui-fixes-and-smart-scraping/phases/02-feed-reliability/.context-dev.md

0.13.015.5 KB
Original Source

Phase 02 Context

Goal

Not available

Codebase Map Available

Codebase mapping exists in .vbw-planning/codebase/. Key files:

  • ARCHITECTURE.md
  • CONCERNS.md
  • PATTERNS.md
  • DEPENDENCIES.md
  • STRUCTURE.md
  • CONVENTIONS.md
  • TESTING.md
  • STACK.md

Read CONVENTIONS.md, PATTERNS.md, STRUCTURE.md, and DEPENDENCIES.md first to bootstrap codebase understanding.

Changed Files (Delta)

  • .vbw-planning/STATE.md
  • Gemfile.lock

Code Slices

.vbw-planning/STATE.md (25 lines)

# State

**Project:** SourceMonitor
**Milestone:** ui-fixes-and-smart-scraping
**Phase:** 02 (Feed Reliability)
**Plans:** 4
**Progress:** 25%
**Status:** Planned

## Decisions

| Decision | Date | Context |
|----------|------|---------|
| Active Storage for favicons | 2026-02-20 | has_one_attached with guard, consistent with ItemContent pattern |
| Smarter scrape limit | 2026-02-20 | Count only running jobs, not queued; keeps safety but removes false bottleneck |
| Browser-like default UA | 2026-02-20 | Simple global fix for bot-blocked feeds like Uber |
| Health check triggers status update | 2026-02-20 | Successful manual health check should transition declining -> improving |
| Toast cap + hover expand | 2026-02-20 | Max 3 visible, +N more badge, hover to see all |

## Todos

- [x] Fix deprecation: `rails/tasks/statistics.rake` removed from Rakefile (2026-02-21)

## Blockers
None

Gemfile.lock (426 lines, first 30 shown)

PATH
  remote: .
  specs:
    source_monitor (0.10.2)
      cssbundling-rails (~> 1.4)
      faraday (~> 2.9)
      faraday-follow_redirects (~> 0.4)
      faraday-gzip (~> 3.0)
      faraday-retry (~> 2.2)
      feedjira (>= 3.2, < 5.0)
      jsbundling-rails (~> 1.3)
      nokolexbor (~> 0.5)
      rails (>= 8.0.3, < 10.0)
      ransack (~> 4.2)
      ruby-readability (~> 0.7)
      solid_cable (>= 3.0, < 4.0)
      solid_queue (>= 0.3, < 3.0)
      turbo-rails (~> 2.0)

GEM
  remote: https://rubygems.org/
  specs:
    action_text-trix (2.1.16)
      railties
    actioncable (8.1.2)
      actionpack (= 8.1.2)
      activesupport (= 8.1.2)
      nio4r (~> 2.0)
      websocket-driver (>= 0.6.1)
      zeitwerk (~> 2.6)

Active Plan


phase: "02" plan: "01" title: "Error Categorization and Blocked Feed Detection" wave: 1 depends_on: [] must_haves:

  • "BlockedError subclass of FetchError with CODE='blocked'"
  • "HTML body sniffing in FeedFetcher#parse_feed detects Cloudflare/login walls before Feedjira"
  • "error_category string column on FetchLog (network, parse, blocked, auth, unknown)"
  • "SourceUpdater maps error class to error_category when creating fetch logs"
  • "RetryPolicy handles BlockedError with appropriate retry/circuit policy"
  • "Tests for all new error paths"

Plan 01: Error Categorization and Blocked Feed Detection

Summary

Add structured error categorization to the fetch pipeline. Introduce BlockedError for Cloudflare/login wall detection by sniffing HTML response bodies before passing to Feedjira. Add error_category column to FetchLog for coarse filtering. Update RetryPolicy to handle blocked errors appropriately.

Tasks

Task 1: Add BlockedError to error hierarchy

Files to modify:

  • lib/source_monitor/fetching/fetch_error.rb

Steps:

  1. Add BlockedError < FetchError with CODE = "blocked" after the existing ParsingError class
  2. Add AuthenticationError < FetchError with CODE = "authentication" for 401/403 responses that are NOT blocked pages
  3. BlockedError should accept an optional blocked_by keyword (e.g., "cloudflare", "login_wall", "captcha", "unknown") stored as an attribute

Task 2: Add HTML body sniffing to detect blocked responses

Files to modify:

  • lib/source_monitor/fetching/feed_fetcher.rb

Steps:

  1. Add a new private method detect_blocked_response(body, response) that checks the response body BEFORE calling Feedjira.parse
  2. Detection markers for Cloudflare:
    • <title>Just a moment</title> or <title>Attention Required</title>
    • cf-challenge or cf-browser-verification in body
    • __cf_chl_ in body
    • data-ray= attribute (Cloudflare Ray ID)
  3. Detection markers for login walls / auth walls:
    • <title>Log in</title> or <title>Sign in</title> (case-insensitive)
    • Response has text/html content-type AND body starts with <!DOCTYPE html or <html AND contains <form with password input
  4. Detection markers for CAPTCHA:
    • g-recaptcha or h-captcha in body
  5. Modify parse_feed(body, response) to call detect_blocked_response(body, response) first. If detection returns a blocked_by value, raise BlockedError with that value instead of attempting Feedjira parse
  6. Keep a size limit on sniffing -- only inspect first 4KB of body to avoid scanning huge feeds

Task 3: Add error_category column to FetchLog

Files to create:

  • db/migrate/TIMESTAMP_add_error_category_to_fetch_logs.rb

Files to modify:

  • app/models/source_monitor/fetch_log.rb

Steps:

  1. Create migration adding error_category string column to sourcemon_fetch_logs (nullable, no default)
  2. Add index on error_category for filtering
  3. In FetchLog model, add validation: validates :error_category, inclusion: { in: %w[network parse blocked auth unknown], allow_nil: true }
  4. Add scope: scope :by_category, ->(category) { where(error_category: category) }
  5. Add error_category to ransackable_attributes if the method exists on FetchLog

Task 4: Map error classes to categories in SourceUpdater

Files to modify:

  • lib/source_monitor/fetching/feed_fetcher/source_updater.rb

Steps:

  1. Add a class method or constant ERROR_CATEGORY_MAP that maps error classes to category strings:
    • TimeoutError -> "network"
    • ConnectionError -> "network"
    • HTTPError -> categorize by status: 401/403 -> "auth", others -> "network"
    • ParsingError -> "parse"
    • BlockedError -> "blocked"
    • AuthenticationError -> "auth"
    • UnexpectedResponseError -> "unknown"
    • FetchError (base) -> "unknown"
  2. Add private method categorize_error(error) that uses the map
  3. Modify create_fetch_log to include error_category: categorize_error(error) when error is present

Task 5: Add BlockedError to RetryPolicy

Files to modify:

  • lib/source_monitor/fetching/retry_policy.rb

Steps:

  1. Add :blocked key to DEFAULTS: blocked: { attempts: 1, wait: 1.hour, circuit_wait: 4.hours } -- blocked feeds are unlikely to resolve quickly, so aggressive circuit break
  2. Update policy_key method to return :blocked for BlockedError
  3. Add :authentication key: authentication: { attempts: 1, wait: 1.hour, circuit_wait: 4.hours } -- auth failures also unlikely to self-resolve
  4. Update policy_key to return :authentication for AuthenticationError

Task 6: Update FeedFetcher perform_fetch to re-raise BlockedError

Files to modify:

  • lib/source_monitor/fetching/feed_fetcher.rb

Steps:

  1. In perform_fetch, add BlockedError and AuthenticationError to the list of errors that are re-raised directly (line 81): rescue TimeoutError, ConnectionError, HTTPError, ParsingError, BlockedError, AuthenticationError => error

Task 7: Tests

Files to create:

  • test/lib/source_monitor/fetching/blocked_error_test.rb
  • test/lib/source_monitor/fetching/html_detection_test.rb

Files to modify:

  • test/lib/source_monitor/fetching/feed_fetcher_test.rb
  • test/lib/source_monitor/fetching/retry_policy_test.rb
  • test/lib/source_monitor/fetching/source_updater_test.rb (or wherever source_updater tests live)
  • test/models/source_monitor/fetch_log_test.rb

Steps:

  1. blocked_error_test.rb: Test BlockedError has CODE="blocked", accepts blocked_by keyword, inherits from FetchError
  2. html_detection_test.rb: Test detect_blocked_response with:
    • Cloudflare challenge HTML -> returns "cloudflare"
    • Login wall HTML with password form -> returns "login_wall"
    • CAPTCHA page -> returns "captcha"
    • Valid RSS/Atom XML -> returns nil (no block detected)
    • HTML page without block markers -> returns nil
    • Large body only inspects first 4KB
  3. feed_fetcher_test.rb: Add test that when HTTP 200 returns CF challenge HTML, result is :failed with BlockedError (not ParsingError)
  4. retry_policy_test.rb: Add tests for :blocked and :authentication policy keys
  5. source_updater_test.rb: Test that error_category is correctly set on fetch_log for each error type
  6. fetch_log_test.rb: Test error_category validation, by_category scope

Acceptance Criteria

  • Cloudflare challenge HTML (200 OK) raises BlockedError, NOT ParsingError
  • FetchLog records include error_category for failed fetches
  • Categories correctly map: network errors -> "network", parse errors -> "parse", CF blocks -> "blocked", 401/403 -> "auth"
  • RetryPolicy applies aggressive circuit break (4h) for blocked feeds
  • Valid XML/RSS/Atom feeds are NOT falsely detected as blocked
  • All new code has test coverage
  • bin/rubocop passes with zero offenses
  • bin/rails test passes

Research Findings

Phase 02: Feed Reliability -- Research

Findings

1. Fetch Pipeline Architecture

  • FetchRunner (lib/source_monitor/fetching/fetch_runner.rb) coordinates fetches with PG advisory locks
  • FeedFetcher (lib/source_monitor/fetching/feed_fetcher.rb) performs HTTP request, parses with Feedjira, processes entries
  • AdvisoryLock (lib/source_monitor/fetching/advisory_lock.rb) wraps pg_try_advisory_lock — non-blocking, raises NotAcquiredError immediately
  • FetchRunner catches NotAcquiredError and re-raises as ConcurrencyError
  • FetchFeedJob (app/jobs/source_monitor/fetch_feed_job.rb) retries ConcurrencyError 5 times with 30s wait — this is the problematic path for force-fetch

2. Force-Fetch Flow

  • Triggered by SourceRetriesController#createFetchRunner.enqueue(source_id, force: true)
  • enqueue sets fetch_status: "queued" and enqueues FetchFeedJob.perform_later(source.id, force: true)
  • FetchFeedJob#perform passes force: true to FetchRunner.new — but force flag only affects circuit breaker check (skip_due_to_circuit), not lock behavior
  • When lock is busy, ConcurrencyError triggers retry_on (5 attempts, 30s each) — user sees "failed" after 2.5 min of retries

3. Error Hierarchy

  • FetchError base class with original_error, response, code, http_status
  • Subclasses: TimeoutError, ConnectionError, HTTPError, ParsingError, UnexpectedResponseError
  • Each has a CODE constant (e.g., "timeout", "connection", "parsing")
  • No Cloudflare/Blocked category exists — CF responses that return 200 with HTML challenge page go through parse_feedFeedjira.parse fails → ParsingError (misleading)

4. Response Handling Gap

  • FeedFetcher#parse_feed (line 212-216): calls Feedjira.parse(body) with no content-type or body inspection
  • When Cloudflare returns a 200 with an HTML challenge page, Feedjira raises "No valid XML parser" → wrapped as ParsingError
  • No HTML detection: response body is not checked for Cloudflare markers (<title>Just a moment</title>, cf-challenge, etc.)
  • This is the root cause of "No valid XML parser" errors for CF-blocked feeds

5. Health Status System

  • SourceHealthMonitor (lib/source_monitor/health/source_health_monitor.rb) uses rolling success rate from recent fetch_logs
  • Status values: healthy, warning, critical, declining, improving, auto_paused
  • Auto-pause based on success rate threshold (configurable), not consecutive failure count
  • Source fields: health_status, health_status_changed_at, rolling_success_rate, auto_paused_until, auto_paused_at
  • Gap: Discussion decided on "5 consecutive failures" auto-pause, but current system uses rate-based threshold

6. Fetch Log Storage

  • FetchLog model with success, items_created, items_updated, items_failed, http_response_headers
  • Created by SourceUpdater#create_fetch_log — stores response details, duration, error info
  • No structured error category field — errors stored as raw error class/message

7. HTTP Client

  • SourceMonitor::HTTP.client uses Faraday with retry middleware (4 retries by default)
  • Default UA: "Mozilla/5.0 (compatible; SourceMonitor/#{VERSION})" — already browser-like from Phase 1 decision
  • Supports custom headers per source via source.custom_headers
  • Conditional GET support: If-None-Match (etag), If-Modified-Since (last_modified)

8. Retry Policy

  • RetryPolicy maps error types to retry configs (attempts, wait, circuit_wait)
  • Parsing errors: 1 attempt, 30min wait, 2hr circuit — appropriate for genuine parse failures
  • CF-blocked feeds hitting parsing policy get circuit-broken after 1 retry (good, but wrong category)

Relevant Patterns

  • Error hierarchy: Extend FetchError for new error types (add BlockedError)
  • Source fields: Add via migration, use update_columns for status updates (matches existing pattern)
  • Health monitor: Uses consecutive_failures helper already (line 183) — can leverage for auto-pause by count
  • FetchLog syncs: after_save :sync_log_entry creates unified LogEntry — new fields propagate
  • Fetch status lifecycle: "idle" → "queued" → "fetching" → "idle"/"failed"
  • Circuit breaker pattern: Already implemented with fetch_circuit_opened_at, fetch_circuit_until

Risks

  1. Content-type detection false positives: Some feeds serve valid XML with text/html content type. Must check body content, not just content-type header.
  2. Auto-pause by consecutive failures vs rate: Current health monitor uses rate-based threshold. Need to decide whether to modify existing system or add parallel consecutive-failure check. Recommend adding consecutive_fetch_failures counter to Source, increment on failure, reset on success — simpler than changing rate-based system.
  3. Migration complexity: Adding error_category to FetchLog requires migration + backfill consideration (existing logs won't have category).
  4. Force-fetch UX: Skipping with "already in progress" message needs a Turbo Stream response to surface the message to the user immediately.

Recommendations

Plan 1: Error Categorization + HTML Detection

  • Add BlockedError < FetchError with CODE="blocked"
  • Add HTML body sniffing in FeedFetcher#parse_feed — before Feedjira, check for CF markers
  • Add error_category enum to FetchLog (network, parse, blocked, auth, unknown)
  • Categorize in SourceUpdater#create_fetch_log based on error class

Plan 2: Force-Fetch Lock Contention

  • Modify FetchFeedJob: when force: true, don't retry_on ConcurrencyError — rescue and return with user-facing message
  • Update FetchRunner.enqueue or add check: if source.fetch_status == "fetching", skip with message
  • Return "Fetch already in progress" via toast/Turbo Stream

Plan 3: Cloudflare Light Bypass

  • Before raising BlockedError: try common workarounds
    • Cookie jar persistence (re-request with cookies from initial response)
    • Alternate UA strings (rotate through a small list)
    • Add Cache-Control: no-cache header
  • If all workarounds fail: raise BlockedError, set "blocked" badge on source

Plan 4: Auto-Pause by Consecutive Failures

  • Add consecutive_fetch_failures integer to Source (migration)
  • Increment on failure, reset to 0 on success (in SourceUpdater or FetchRunner)
  • When count >= 5: trigger auto-pause (set auto_paused_until, auto_paused_at)
  • Notification: toast + log entry when auto-pause triggers
  • Integrate with existing health status transitions (declining → auto_paused)