.vbw-planning/milestones/ui-fixes-and-smart-scraping/phases/02-feed-reliability/01-PLAN.md
Add structured error categorization to the fetch pipeline. Introduce BlockedError for Cloudflare/login wall detection by sniffing HTML response bodies before passing to Feedjira. Add error_category column to FetchLog for coarse filtering. Update RetryPolicy to handle blocked errors appropriately.
Files to modify:
lib/source_monitor/fetching/fetch_error.rbSteps:
BlockedError < FetchError with CODE = "blocked" after the existing ParsingError classAuthenticationError < FetchError with CODE = "authentication" for 401/403 responses that are NOT blocked pagesblocked_by keyword (e.g., "cloudflare", "login_wall", "captcha", "unknown") stored as an attributeFiles to modify:
lib/source_monitor/fetching/feed_fetcher.rbSteps:
detect_blocked_response(body, response) that checks the response body BEFORE calling Feedjira.parse<title>Just a moment</title> or <title>Attention Required</title>cf-challenge or cf-browser-verification in body__cf_chl_ in bodydata-ray= attribute (Cloudflare Ray ID)<title>Log in</title> or <title>Sign in</title> (case-insensitive)text/html content-type AND body starts with <!DOCTYPE html or <html AND contains <form with password inputg-recaptcha or h-captcha in bodyparse_feed(body, response) to call detect_blocked_response(body, response) first. If detection returns a blocked_by value, raise BlockedError with that value instead of attempting Feedjira parseFiles to create:
db/migrate/TIMESTAMP_add_error_category_to_fetch_logs.rbFiles to modify:
app/models/source_monitor/fetch_log.rbSteps:
error_category string column to sourcemon_fetch_logs (nullable, no default)error_category for filteringvalidates :error_category, inclusion: { in: %w[network parse blocked auth unknown], allow_nil: true }scope :by_category, ->(category) { where(error_category: category) }error_category to ransackable_attributes if the method exists on FetchLogFiles to modify:
lib/source_monitor/fetching/feed_fetcher/source_updater.rbSteps:
ERROR_CATEGORY_MAP that maps error classes to category strings:
TimeoutError -> "network"ConnectionError -> "network"HTTPError -> categorize by status: 401/403 -> "auth", others -> "network"ParsingError -> "parse"BlockedError -> "blocked"AuthenticationError -> "auth"UnexpectedResponseError -> "unknown"FetchError (base) -> "unknown"categorize_error(error) that uses the mapcreate_fetch_log to include error_category: categorize_error(error) when error is presentFiles to modify:
lib/source_monitor/fetching/retry_policy.rbSteps:
:blocked key to DEFAULTS: blocked: { attempts: 1, wait: 1.hour, circuit_wait: 4.hours } -- blocked feeds are unlikely to resolve quickly, so aggressive circuit breakpolicy_key method to return :blocked for BlockedError:authentication key: authentication: { attempts: 1, wait: 1.hour, circuit_wait: 4.hours } -- auth failures also unlikely to self-resolvepolicy_key to return :authentication for AuthenticationErrorFiles to modify:
lib/source_monitor/fetching/feed_fetcher.rbSteps:
perform_fetch, add BlockedError and AuthenticationError to the list of errors that are re-raised directly (line 81): rescue TimeoutError, ConnectionError, HTTPError, ParsingError, BlockedError, AuthenticationError => errorFiles to create:
test/lib/source_monitor/fetching/blocked_error_test.rbtest/lib/source_monitor/fetching/html_detection_test.rbFiles to modify:
test/lib/source_monitor/fetching/feed_fetcher_test.rbtest/lib/source_monitor/fetching/retry_policy_test.rbtest/lib/source_monitor/fetching/source_updater_test.rb (or wherever source_updater tests live)test/models/source_monitor/fetch_log_test.rbSteps:
BlockedError, NOT ParsingErrorFetchLog records include error_category for failed fetchesRetryPolicy applies aggressive circuit break (4h) for blocked feedsbin/rubocop passes with zero offensesbin/rails test passes