Back to Source Monitor

Research: Fetch Throughput & Small Server Defaults

.vbw-planning/milestones/polish-and-reliability/phases/06-fetch-throughput-defaults/.context-lead.md

0.13.05.7 KB
Original Source

Phase 06 Context (Compiled)

Goal

Not available

Success Criteria

Not available

Requirements (Not available)

No matching requirements found

Active Decisions

None

Codebase Map Available

Codebase mapping exists in .vbw-planning/codebase/. Key files:

  • ARCHITECTURE.md
  • CONCERNS.md
  • PATTERNS.md
  • DEPENDENCIES.md
  • STRUCTURE.md
  • CONVENTIONS.md
  • TESTING.md
  • STACK.md

Read ARCHITECTURE.md, CONCERNS.md, and STRUCTURE.md first to bootstrap codebase understanding.

Research Findings

Research: Fetch Throughput & Small Server Defaults

Source: Debug Investigation (3 parallel debuggers, all HIGH confidence)

Finding 1: Queue Saturation (Debugger H1)

Current state:

  • fetch_queue_concurrency defaults to 2 (lib/source_monitor/configuration.rb:40)
  • DEFAULT_BATCH_SIZE = 100 in Scheduler (lib/source_monitor/scheduler.rb:8)
  • ScheduleFetchesJob runs every minute (test/dummy/config/recurring.yml)
  • Each fetch is I/O-bound: 5-15s per request (15s HTTP timeout, 5s open timeout)
  • ALL job types share the same fetch queue: FetchFeedJob, ScheduleFetchesJob, SourceHealthCheckJob, ImportOpmlJob, FaviconFetchJob, DownloadContentImagesJob, LogCleanupJob, ItemCleanupJob
  • No limits_concurrency from Solid Queue is used; advisory locks per-source only
  • Advisory lock contention causes 30-second retry wait (fetch_feed_job.rb:5,11-13)

Math: With concurrency=2 and ~2s avg fetch time, throughput is ~60 jobs/min. But with 100 jobs enqueued per batch cycle, backlog grows continuously.

Finding 2: Thundering Herd (Debugger H2)

Current state:

  • ImportOpmlJob#build_attributes does NOT set next_fetch_at -- all imported sources start as NULL (immediately due)
  • SourcesController#create also has no next_fetch_at initialization
  • Scheduler treats NULL as immediately due (table[:next_fetch_at].eq(nil).or(table[:next_fetch_at].lteq(now)))
  • Fixed-interval path has ZERO jitter: Time.current + fixed_minutes.minutes exactly
  • Adaptive jitter is ±10% (JITTER_PERCENT = 0.1) but insufficient when base times are nearly identical
  • Scheduler enqueues all 100 due sources in a tight loop with no delay between them
  • No queue-aware scheduling: adaptive interval never checks how many other sources are already scheduled

Finding 3: Stale Processing Status (Debugger H3)

Current state:

  • update_source_state! (fetch_runner.rb:83-91) rescues ALL StandardError including DB update failures
  • No ensure block guarantees fetch_status reset from "fetching" to "idle"/"failed"
  • FollowUpHandler#call has no error handling -- exceptions propagate past mark_complete!
  • StalledFetchReconciler recovers after 10 minutes (STALE_QUEUE_TIMEOUT = 10.minutes)
  • User's screenshot showed sources 9-10 minutes overdue -- exactly at the reconciler threshold

Findings

Existing Configuration Hooks (host app can already override)

  • config.fetch_queue_concurrency -- defaults to 2
  • config.fetch_queue_name / config.scrape_queue_name -- queue names
  • ENV["SOURCE_MONITOR_FETCH_CONCURRENCY"] -- env var override in example config
  • config.fetching.adaptive_enabled -- toggle adaptive intervals
  • config.fetching.increase_factor / decrease_factor -- interval tuning
  • config.fetching.min_interval / max_interval -- interval bounds
  • config.http.timeout / config.http.open_timeout -- HTTP timeouts
  • config.http.max_retries -- retry count

Missing Configuration Hooks

  • No batch size configuration (hardcoded DEFAULT_BATCH_SIZE = 100)
  • No stale queue timeout configuration (hardcoded 10.minutes)
  • No jitter percentage configuration (hardcoded JITTER_PERCENT = 0.1)
  • No option to stagger initial fetch times on import

Relevant Patterns

  1. Configuration DSL pattern: All settings use SourceMonitor.configure { |config| ... } -- new knobs should follow the same pattern via settings sub-objects
  2. FetchingSettings (lib/source_monitor/configuration/fetching_settings.rb): Already has adaptive interval knobs; batch size and jitter should live here
  3. SchedulerSettings: Does not exist yet -- scheduler has hardcoded constants
  4. AdvisoryLock pattern: Per-source locking prevents duplicate fetches but doesn't help with throughput

Risks

  1. Memory on 1-CPU/2GB: Increasing concurrency too high will exhaust memory. Each Solid Queue worker thread holds a DB connection. Sweet spot is likely 3-5 for this hardware.
  2. Connection pool: More concurrent workers need larger DB connection pool. Default pool is usually 5 -- must coordinate with Solid Queue config.
  3. Backward compatibility: Changing defaults could affect existing host apps. All changes should be opt-in or conservative new defaults.

Recommendations

Priority 1: Fix error handling (correctness)

  • Split rescue in update_source_state! to only catch broadcast errors
  • Add ensure block in FetchRunner#run for status safety net
  • Add rescue in FollowUpHandler#call

Priority 2: Add scheduling jitter/stagger (throughput)

  • Add jitter to fixed-interval path
  • Stagger next_fetch_at during OPML import
  • Make JITTER_PERCENT configurable via FetchingSettings
  • Stagger job enqueuing in Scheduler (spread across the minute window)

Priority 3: Optimize defaults for small servers (configuration)

  • Lower default fetch_queue_concurrency to 2 (keep current) -- it's actually appropriate for 1-CPU/2GB
  • Lower DEFAULT_BATCH_SIZE from 100 to 25 and make configurable
  • Lower STALE_QUEUE_TIMEOUT from 10 to 5 minutes
  • Separate utility jobs (cleanup, favicon) from fetch queue

Priority 4: Document scaling guidance

  • Add comments/docs showing how to tune for different server sizes
  • Provide example configurations for small/medium/large deployments