.vbw-planning/milestones/ui-fixes-and-smart-scraping/phases/02-feed-reliability/.context-dev.md
Not available
Codebase mapping exists in .vbw-planning/codebase/. Key files:
ARCHITECTURE.mdCONCERNS.mdPATTERNS.mdDEPENDENCIES.mdSTRUCTURE.mdCONVENTIONS.mdTESTING.mdSTACK.mdRead CONVENTIONS.md, PATTERNS.md, STRUCTURE.md, and DEPENDENCIES.md first to bootstrap codebase understanding.
.vbw-planning/STATE.mdGemfile.lock.vbw-planning/STATE.md (25 lines)# State
**Project:** SourceMonitor
**Milestone:** ui-fixes-and-smart-scraping
**Phase:** 02 (Feed Reliability)
**Plans:** 4
**Progress:** 25%
**Status:** Planned
## Decisions
| Decision | Date | Context |
|----------|------|---------|
| Active Storage for favicons | 2026-02-20 | has_one_attached with guard, consistent with ItemContent pattern |
| Smarter scrape limit | 2026-02-20 | Count only running jobs, not queued; keeps safety but removes false bottleneck |
| Browser-like default UA | 2026-02-20 | Simple global fix for bot-blocked feeds like Uber |
| Health check triggers status update | 2026-02-20 | Successful manual health check should transition declining -> improving |
| Toast cap + hover expand | 2026-02-20 | Max 3 visible, +N more badge, hover to see all |
## Todos
- [x] Fix deprecation: `rails/tasks/statistics.rake` removed from Rakefile (2026-02-21)
## Blockers
None
Gemfile.lock (426 lines, first 30 shown)PATH
remote: .
specs:
source_monitor (0.10.2)
cssbundling-rails (~> 1.4)
faraday (~> 2.9)
faraday-follow_redirects (~> 0.4)
faraday-gzip (~> 3.0)
faraday-retry (~> 2.2)
feedjira (>= 3.2, < 5.0)
jsbundling-rails (~> 1.3)
nokolexbor (~> 0.5)
rails (>= 8.0.3, < 10.0)
ransack (~> 4.2)
ruby-readability (~> 0.7)
solid_cable (>= 3.0, < 4.0)
solid_queue (>= 0.3, < 3.0)
turbo-rails (~> 2.0)
GEM
remote: https://rubygems.org/
specs:
action_text-trix (2.1.16)
railties
actioncable (8.1.2)
actionpack (= 8.1.2)
activesupport (= 8.1.2)
nio4r (~> 2.0)
websocket-driver (>= 0.6.1)
zeitwerk (~> 2.6)
phase: "02" plan: "01" title: "Error Categorization and Blocked Feed Detection" wave: 1 depends_on: [] must_haves:
Add structured error categorization to the fetch pipeline. Introduce BlockedError for Cloudflare/login wall detection by sniffing HTML response bodies before passing to Feedjira. Add error_category column to FetchLog for coarse filtering. Update RetryPolicy to handle blocked errors appropriately.
Files to modify:
lib/source_monitor/fetching/fetch_error.rbSteps:
BlockedError < FetchError with CODE = "blocked" after the existing ParsingError classAuthenticationError < FetchError with CODE = "authentication" for 401/403 responses that are NOT blocked pagesblocked_by keyword (e.g., "cloudflare", "login_wall", "captcha", "unknown") stored as an attributeFiles to modify:
lib/source_monitor/fetching/feed_fetcher.rbSteps:
detect_blocked_response(body, response) that checks the response body BEFORE calling Feedjira.parse<title>Just a moment</title> or <title>Attention Required</title>cf-challenge or cf-browser-verification in body__cf_chl_ in bodydata-ray= attribute (Cloudflare Ray ID)<title>Log in</title> or <title>Sign in</title> (case-insensitive)text/html content-type AND body starts with <!DOCTYPE html or <html AND contains <form with password inputg-recaptcha or h-captcha in bodyparse_feed(body, response) to call detect_blocked_response(body, response) first. If detection returns a blocked_by value, raise BlockedError with that value instead of attempting Feedjira parseFiles to create:
db/migrate/TIMESTAMP_add_error_category_to_fetch_logs.rbFiles to modify:
app/models/source_monitor/fetch_log.rbSteps:
error_category string column to sourcemon_fetch_logs (nullable, no default)error_category for filteringvalidates :error_category, inclusion: { in: %w[network parse blocked auth unknown], allow_nil: true }scope :by_category, ->(category) { where(error_category: category) }error_category to ransackable_attributes if the method exists on FetchLogFiles to modify:
lib/source_monitor/fetching/feed_fetcher/source_updater.rbSteps:
ERROR_CATEGORY_MAP that maps error classes to category strings:
TimeoutError -> "network"ConnectionError -> "network"HTTPError -> categorize by status: 401/403 -> "auth", others -> "network"ParsingError -> "parse"BlockedError -> "blocked"AuthenticationError -> "auth"UnexpectedResponseError -> "unknown"FetchError (base) -> "unknown"categorize_error(error) that uses the mapcreate_fetch_log to include error_category: categorize_error(error) when error is presentFiles to modify:
lib/source_monitor/fetching/retry_policy.rbSteps:
:blocked key to DEFAULTS: blocked: { attempts: 1, wait: 1.hour, circuit_wait: 4.hours } -- blocked feeds are unlikely to resolve quickly, so aggressive circuit breakpolicy_key method to return :blocked for BlockedError:authentication key: authentication: { attempts: 1, wait: 1.hour, circuit_wait: 4.hours } -- auth failures also unlikely to self-resolvepolicy_key to return :authentication for AuthenticationErrorFiles to modify:
lib/source_monitor/fetching/feed_fetcher.rbSteps:
perform_fetch, add BlockedError and AuthenticationError to the list of errors that are re-raised directly (line 81): rescue TimeoutError, ConnectionError, HTTPError, ParsingError, BlockedError, AuthenticationError => errorFiles to create:
test/lib/source_monitor/fetching/blocked_error_test.rbtest/lib/source_monitor/fetching/html_detection_test.rbFiles to modify:
test/lib/source_monitor/fetching/feed_fetcher_test.rbtest/lib/source_monitor/fetching/retry_policy_test.rbtest/lib/source_monitor/fetching/source_updater_test.rb (or wherever source_updater tests live)test/models/source_monitor/fetch_log_test.rbSteps:
BlockedError, NOT ParsingErrorFetchLog records include error_category for failed fetchesRetryPolicy applies aggressive circuit break (4h) for blocked feedsbin/rubocop passes with zero offensesbin/rails test passesFetchRunner (lib/source_monitor/fetching/fetch_runner.rb) coordinates fetches with PG advisory locksFeedFetcher (lib/source_monitor/fetching/feed_fetcher.rb) performs HTTP request, parses with Feedjira, processes entriesAdvisoryLock (lib/source_monitor/fetching/advisory_lock.rb) wraps pg_try_advisory_lock — non-blocking, raises NotAcquiredError immediatelyFetchRunner catches NotAcquiredError and re-raises as ConcurrencyErrorFetchFeedJob (app/jobs/source_monitor/fetch_feed_job.rb) retries ConcurrencyError 5 times with 30s wait — this is the problematic path for force-fetchSourceRetriesController#create → FetchRunner.enqueue(source_id, force: true)enqueue sets fetch_status: "queued" and enqueues FetchFeedJob.perform_later(source.id, force: true)FetchFeedJob#perform passes force: true to FetchRunner.new — but force flag only affects circuit breaker check (skip_due_to_circuit), not lock behaviorConcurrencyError triggers retry_on (5 attempts, 30s each) — user sees "failed" after 2.5 min of retriesFetchError base class with original_error, response, code, http_statusTimeoutError, ConnectionError, HTTPError, ParsingError, UnexpectedResponseErrorCODE constant (e.g., "timeout", "connection", "parsing")parse_feed → Feedjira.parse fails → ParsingError (misleading)FeedFetcher#parse_feed (line 212-216): calls Feedjira.parse(body) with no content-type or body inspectionParsingError<title>Just a moment</title>, cf-challenge, etc.)SourceHealthMonitor (lib/source_monitor/health/source_health_monitor.rb) uses rolling success rate from recent fetch_logshealthy, warning, critical, declining, improving, auto_pausedhealth_status, health_status_changed_at, rolling_success_rate, auto_paused_until, auto_paused_atFetchLog model with success, items_created, items_updated, items_failed, http_response_headersSourceUpdater#create_fetch_log — stores response details, duration, error infoSourceMonitor::HTTP.client uses Faraday with retry middleware (4 retries by default)"Mozilla/5.0 (compatible; SourceMonitor/#{VERSION})" — already browser-like from Phase 1 decisionsource.custom_headersIf-None-Match (etag), If-Modified-Since (last_modified)RetryPolicy maps error types to retry configs (attempts, wait, circuit_wait)FetchError for new error types (add BlockedError)update_columns for status updates (matches existing pattern)consecutive_failures helper already (line 183) — can leverage for auto-pause by countafter_save :sync_log_entry creates unified LogEntry — new fields propagatefetch_circuit_opened_at, fetch_circuit_untiltext/html content type. Must check body content, not just content-type header.consecutive_fetch_failures counter to Source, increment on failure, reset on success — simpler than changing rate-based system.BlockedError < FetchError with CODE="blocked"FeedFetcher#parse_feed — before Feedjira, check for CF markerserror_category enum to FetchLog (network, parse, blocked, auth, unknown)SourceUpdater#create_fetch_log based on error classFetchFeedJob: when force: true, don't retry_on ConcurrencyError — rescue and return with user-facing messageFetchRunner.enqueue or add check: if source.fetch_status == "fetching", skip with messageCache-Control: no-cache headerconsecutive_fetch_failures integer to Source (migration)auto_paused_until, auto_paused_at)