Research: Fetch Throughput & Small Server Defaults

Source: Debug Investigation (3 parallel debuggers, all HIGH confidence)

Finding 1: Queue Saturation (Debugger H1)

Current state:

fetch_queue_concurrency defaults to 2 (lib/source_monitor/configuration.rb:40)
DEFAULT_BATCH_SIZE = 100 in Scheduler (lib/source_monitor/scheduler.rb:8)
ScheduleFetchesJob runs every minute (test/dummy/config/recurring.yml)
Each fetch is I/O-bound: 5-15s per request (15s HTTP timeout, 5s open timeout)
ALL job types share the same fetch queue: FetchFeedJob, ScheduleFetchesJob, SourceHealthCheckJob, ImportOpmlJob, FaviconFetchJob, DownloadContentImagesJob, LogCleanupJob, ItemCleanupJob
No limits_concurrency from Solid Queue is used; advisory locks per-source only
Advisory lock contention causes 30-second retry wait (fetch_feed_job.rb:5,11-13)

Math: With concurrency=2 and ~2s avg fetch time, throughput is ~60 jobs/min. But with 100 jobs enqueued per batch cycle, backlog grows continuously.

Finding 2: Thundering Herd (Debugger H2)

Current state:

ImportOpmlJob#build_attributes does NOT set next_fetch_at -- all imported sources start as NULL (immediately due)
SourcesController#create also has no next_fetch_at initialization
Scheduler treats NULL as immediately due (table[:next_fetch_at].eq(nil).or(table[:next_fetch_at].lteq(now)))
Fixed-interval path has ZERO jitter: Time.current + fixed_minutes.minutes exactly
Adaptive jitter is ±10% (JITTER_PERCENT = 0.1) but insufficient when base times are nearly identical
Scheduler enqueues all 100 due sources in a tight loop with no delay between them
No queue-aware scheduling: adaptive interval never checks how many other sources are already scheduled

Finding 3: Stale Processing Status (Debugger H3)

Current state:

update_source_state! (fetch_runner.rb:83-91) rescues ALL StandardError including DB update failures
No ensure block guarantees fetch_status reset from "fetching" to "idle"/"failed"
FollowUpHandler#call has no error handling -- exceptions propagate past mark_complete!
StalledFetchReconciler recovers after 10 minutes (STALE_QUEUE_TIMEOUT = 10.minutes)
User's screenshot showed sources 9-10 minutes overdue -- exactly at the reconciler threshold

Findings

Existing Configuration Hooks (host app can already override)

config.fetch_queue_concurrency -- defaults to 2
config.fetch_queue_name / config.scrape_queue_name -- queue names
ENV["SOURCE_MONITOR_FETCH_CONCURRENCY"] -- env var override in example config
config.fetching.adaptive_enabled -- toggle adaptive intervals
config.fetching.increase_factor / decrease_factor -- interval tuning
config.fetching.min_interval / max_interval -- interval bounds
config.http.timeout / config.http.open_timeout -- HTTP timeouts
config.http.max_retries -- retry count

Missing Configuration Hooks

No batch size configuration (hardcoded DEFAULT_BATCH_SIZE = 100)
No stale queue timeout configuration (hardcoded 10.minutes)
No jitter percentage configuration (hardcoded JITTER_PERCENT = 0.1)
No option to stagger initial fetch times on import

Relevant Patterns

Configuration DSL pattern: All settings use SourceMonitor.configure { |config| ... } -- new knobs should follow the same pattern via settings sub-objects
FetchingSettings (lib/source_monitor/configuration/fetching_settings.rb): Already has adaptive interval knobs; batch size and jitter should live here
SchedulerSettings: Does not exist yet -- scheduler has hardcoded constants
AdvisoryLock pattern: Per-source locking prevents duplicate fetches but doesn't help with throughput

Risks

Memory on 1-CPU/2GB: Increasing concurrency too high will exhaust memory. Each Solid Queue worker thread holds a DB connection. Sweet spot is likely 3-5 for this hardware.
Connection pool: More concurrent workers need larger DB connection pool. Default pool is usually 5 -- must coordinate with Solid Queue config.
Backward compatibility: Changing defaults could affect existing host apps. All changes should be opt-in or conservative new defaults.

Recommendations

Priority 1: Fix error handling (correctness)

Split rescue in update_source_state! to only catch broadcast errors
Add ensure block in FetchRunner#run for status safety net
Add rescue in FollowUpHandler#call

Priority 2: Add scheduling jitter/stagger (throughput)

Add jitter to fixed-interval path
Stagger next_fetch_at during OPML import
Make JITTER_PERCENT configurable via FetchingSettings
Stagger job enqueuing in Scheduler (spread across the minute window)

Priority 3: Optimize defaults for small servers (configuration)

Lower default fetch_queue_concurrency to 2 (keep current) -- it's actually appropriate for 1-CPU/2GB
Lower DEFAULT_BATCH_SIZE from 100 to 25 and make configurable
Lower STALE_QUEUE_TIMEOUT from 10 to 5 minutes
Separate utility jobs (cleanup, favicon) from fetch queue

Priority 4: Document scaling guidance

Add comments/docs showing how to tune for different server sizes
Provide example configurations for small/medium/large deployments