Back to Source Monitor

Research: Fetch Throughput & Small Server Defaults

.vbw-planning/milestones/polish-and-reliability/phases/06-fetch-throughput-defaults/06-RESEARCH.md

0.13.05.2 KB
Original Source

Research: Fetch Throughput & Small Server Defaults

Source: Debug Investigation (3 parallel debuggers, all HIGH confidence)

Finding 1: Queue Saturation (Debugger H1)

Current state:

  • fetch_queue_concurrency defaults to 2 (lib/source_monitor/configuration.rb:40)
  • DEFAULT_BATCH_SIZE = 100 in Scheduler (lib/source_monitor/scheduler.rb:8)
  • ScheduleFetchesJob runs every minute (test/dummy/config/recurring.yml)
  • Each fetch is I/O-bound: 5-15s per request (15s HTTP timeout, 5s open timeout)
  • ALL job types share the same fetch queue: FetchFeedJob, ScheduleFetchesJob, SourceHealthCheckJob, ImportOpmlJob, FaviconFetchJob, DownloadContentImagesJob, LogCleanupJob, ItemCleanupJob
  • No limits_concurrency from Solid Queue is used; advisory locks per-source only
  • Advisory lock contention causes 30-second retry wait (fetch_feed_job.rb:5,11-13)

Math: With concurrency=2 and ~2s avg fetch time, throughput is ~60 jobs/min. But with 100 jobs enqueued per batch cycle, backlog grows continuously.

Finding 2: Thundering Herd (Debugger H2)

Current state:

  • ImportOpmlJob#build_attributes does NOT set next_fetch_at -- all imported sources start as NULL (immediately due)
  • SourcesController#create also has no next_fetch_at initialization
  • Scheduler treats NULL as immediately due (table[:next_fetch_at].eq(nil).or(table[:next_fetch_at].lteq(now)))
  • Fixed-interval path has ZERO jitter: Time.current + fixed_minutes.minutes exactly
  • Adaptive jitter is ±10% (JITTER_PERCENT = 0.1) but insufficient when base times are nearly identical
  • Scheduler enqueues all 100 due sources in a tight loop with no delay between them
  • No queue-aware scheduling: adaptive interval never checks how many other sources are already scheduled

Finding 3: Stale Processing Status (Debugger H3)

Current state:

  • update_source_state! (fetch_runner.rb:83-91) rescues ALL StandardError including DB update failures
  • No ensure block guarantees fetch_status reset from "fetching" to "idle"/"failed"
  • FollowUpHandler#call has no error handling -- exceptions propagate past mark_complete!
  • StalledFetchReconciler recovers after 10 minutes (STALE_QUEUE_TIMEOUT = 10.minutes)
  • User's screenshot showed sources 9-10 minutes overdue -- exactly at the reconciler threshold

Findings

Existing Configuration Hooks (host app can already override)

  • config.fetch_queue_concurrency -- defaults to 2
  • config.fetch_queue_name / config.scrape_queue_name -- queue names
  • ENV["SOURCE_MONITOR_FETCH_CONCURRENCY"] -- env var override in example config
  • config.fetching.adaptive_enabled -- toggle adaptive intervals
  • config.fetching.increase_factor / decrease_factor -- interval tuning
  • config.fetching.min_interval / max_interval -- interval bounds
  • config.http.timeout / config.http.open_timeout -- HTTP timeouts
  • config.http.max_retries -- retry count

Missing Configuration Hooks

  • No batch size configuration (hardcoded DEFAULT_BATCH_SIZE = 100)
  • No stale queue timeout configuration (hardcoded 10.minutes)
  • No jitter percentage configuration (hardcoded JITTER_PERCENT = 0.1)
  • No option to stagger initial fetch times on import

Relevant Patterns

  1. Configuration DSL pattern: All settings use SourceMonitor.configure { |config| ... } -- new knobs should follow the same pattern via settings sub-objects
  2. FetchingSettings (lib/source_monitor/configuration/fetching_settings.rb): Already has adaptive interval knobs; batch size and jitter should live here
  3. SchedulerSettings: Does not exist yet -- scheduler has hardcoded constants
  4. AdvisoryLock pattern: Per-source locking prevents duplicate fetches but doesn't help with throughput

Risks

  1. Memory on 1-CPU/2GB: Increasing concurrency too high will exhaust memory. Each Solid Queue worker thread holds a DB connection. Sweet spot is likely 3-5 for this hardware.
  2. Connection pool: More concurrent workers need larger DB connection pool. Default pool is usually 5 -- must coordinate with Solid Queue config.
  3. Backward compatibility: Changing defaults could affect existing host apps. All changes should be opt-in or conservative new defaults.

Recommendations

Priority 1: Fix error handling (correctness)

  • Split rescue in update_source_state! to only catch broadcast errors
  • Add ensure block in FetchRunner#run for status safety net
  • Add rescue in FollowUpHandler#call

Priority 2: Add scheduling jitter/stagger (throughput)

  • Add jitter to fixed-interval path
  • Stagger next_fetch_at during OPML import
  • Make JITTER_PERCENT configurable via FetchingSettings
  • Stagger job enqueuing in Scheduler (spread across the minute window)

Priority 3: Optimize defaults for small servers (configuration)

  • Lower default fetch_queue_concurrency to 2 (keep current) -- it's actually appropriate for 1-CPU/2GB
  • Lower DEFAULT_BATCH_SIZE from 100 to 25 and make configurable
  • Lower STALE_QUEUE_TIMEOUT from 10 to 5 minutes
  • Separate utility jobs (cleanup, favicon) from fetch queue

Priority 4: Document scaling guidance

  • Add comments/docs showing how to tune for different server sizes
  • Provide example configurations for small/medium/large deployments