SourceMonitor Engine - Project Overview & Capabilities

Project Overview

SourceMonitor Engine is a production-ready Rails 8 mountable engine for aggregating, monitoring, and managing RSS/Atom/JSON feeds at scale. Built with Rails 8 defaults and minimal dependencies, it provides a complete solution for feed ingestion, content scraping, and real-time monitoring.

Core Philosophy

Rails 8 Native: Leverages Solid Queue, Solid Cache, and Turbo
Minimal Dependencies: Rails + Tailwind + essential feed/HTTP gems only
Testable Vertical Slices: Every phase delivers working, testable functionality
Host App Extensibility: Hooks and callbacks for custom behavior
Production Ready: Built-in observability, error recovery, and performance optimization
TDD & Full Test Coverage: Uses minitest and rails testing best practices to maximize reliability

Technology Stack

Framework: Rails 8
Background Jobs: Solid Queue (Rails 8 default)
Caching: Solid Cache (Rails 8 default)
Real-time Updates: Turbo (Rails 8 default)
Styling: Tailwind CSS
HTTP Client: Faraday with retry middleware
Feed Parsing: Feedjira
Content Extraction: Ruby Readability
Testing: MiniTest with VCR/WebMock

Core Capabilities

1. Feed Source Management

Add and Configure Multiple Sources

Create unlimited feed sources with custom settings
Configure per-source fetch intervals (hourly to weekly)
Set custom headers for authentication/API keys
Store source metadata in flexible JSONB fields
Validate and normalize feed URLs automatically

Source Health Monitoring

Track success/failure rates over time
Automatic backoff for failing sources
Auto-pause sources after repeated failures
Auto-recovery detection and resume
Visual health status indicators in UI

Source Status Control

Manually pause/resume individual sources
Bulk enable/disable operations
Schedule future activation times
Track last fetch timestamp and duration
Monitor HTTP status codes and errors

2. Feed Fetching & Processing

Intelligent HTTP Fetching

Automatic feed format detection (RSS 0.9-2.0, Atom, JSON Feed)
Conditional GET with ETag and Last-Modified support
304 Not Modified handling to reduce bandwidth
Configurable timeouts and retry strategies
Per-host rate limiting to respect server limits
Automatic redirect following with limits
Gzip compression support

Scheduled Fetching

Fixed interval scheduling (every N hours)
Adaptive scheduling based on posting frequency
Exponential backoff for failing sources
Jitter to prevent thundering herd
SELECT FOR UPDATE SKIP LOCKED for concurrency
Manual fetch triggering from UI or API

Item Creation & Deduplication

GUID-based duplicate prevention
Content fingerprint fallback for feeds without GUIDs
Idempotent upsert operations
Duplicate detection across refetches
Counter cache updates for performance
Comprehensive metadata extraction

3. Content Metadata Extraction

Standard Feed Fields

Title, URL, GUID, canonical URL
Author name and authors array
Publication and update timestamps
Summary and full content (HTML/text)
Language and copyright information

Extended Metadata

Categories, tags, and keywords
Media enclosures (podcasts, videos, images)
Media thumbnails and content objects
Comments URL and count
Custom metadata in JSONB storage
Multiple namespace support (DC, Media RSS, etc.)

Feed-Level Metadata

Feed format and version detection
Feed-level metadata storage
HTTP response headers capture
Feed size tracking
Item counts per fetch

4. Content Scraping

Full Article Extraction

Toggle content scraping per source
Automatic or manual scraping triggers
Multiple storage layers (raw HTML + extracted content)
Readability-based content extraction
Custom CSS selector support
JavaScript rendering support configuration

Scraper Adapters

Extensible adapter interface
Built-in Readability adapter
Custom adapter support via configuration
Adapter-specific settings per source
Fallback strategies for extraction failures

Scraping Control

Auto-scrape new items on fetch
Manual scrape individual items from UI
Bulk scraping operations
Scrape retry logic with backoff
Per-source scraping configuration
Scraping status tracking per item

5. Admin Interface

Dashboard

Real-time statistics overview
Source counts by status (active, paused, failed)
Recent activity feed (latest fetches and scrapes)
Quick action buttons for common tasks
Health metrics at a glance
Links to job monitoring (Mission Control)

Source Management Views

Table listing with status indicators
Detailed source page with all settings
Create/edit forms with validation
Fetch history display per source
Items list per source
Scraping configuration panel
Manual fetch/scrape triggers

Item Browser

Paginated item listing across all sources
Filter items by source
Search by title
Sort by publication date
Item detail view with all content versions
Display feed content, scraped HTML, and extracted content
Scraping status indicators
Manual scrape buttons

Log Viewers

Fetch logs with success/failure indicators
Scrape logs with adapter information
Chronological activity timeline
Error message display with full backtraces
Performance metrics (duration, size)
Filter logs by success/failure
Detailed log view with all metadata

6. Background Job Processing

Solid Queue Integration

Rails 8 native background processing
Configurable queue priorities
Job retry with exponential backoff
Advisory locks for concurrent fetch prevention
Job status tracking in UI
Mission Control dashboard integration

Job Types

FetchFeedJob: Fetch and process feed sources
ScrapeItemJob: Extract full article content
SchedulerJob: Find and enqueue due sources (optional)
ItemCleanupJob: Remove old items per retention policy
LogCleanupJob: Clean up old fetch/scrape logs

Job Features

Idempotent job design
Error handling with structured logging
Automatic retry on transient failures
Circuit breaker for persistent failures
Performance tracking per job
Callback hooks after job completion

7. Scheduling System

Flexible Scheduler Architecture

Single entry point: Scheduler.run
Invokable via rake task, cron, or systemd timer
Optional recurring job via Solid Queue
Manual trigger from dashboard
Database-level locking for distributed environments

Scheduling Strategies

Fixed Interval: Fetch every N hours
Adaptive: Adjust based on posting frequency
Exponential Backoff: Increase interval after failures
Jitter: Random delays to distribute load
Backoff Until: Honor temporary suspension times

Source Scheduling

Per-source next_fetch_at timestamps
Due source query with indexed lookups
Configurable fetch interval per source
Pause/resume affects scheduling
Manual override via UI

8. Real-time Updates

Turbo Streams

Live dashboard updates without polling
Real-time item count updates
Fetch status changes broadcast instantly
New item notifications
Log streaming to UI
Job completion notifications

Progressive Enhancement

Turbo Drive for fast page transitions
Turbo Frames for isolated updates
Stimulus controllers for interactivity
Graceful degradation without JavaScript
No WebSocket configuration required (Rails 8 default)

Interactive UI Components

Auto-refresh dashboard
Infinite scroll for item lists
Toggle switches for source settings
Loading states for async actions
Toast notifications for feedback

9. Data Management

Retention Policies

Configurable retention by age (days)
Retention by maximum item count per source
Global and per-source retention settings
Soft delete option for items
Cascade delete for source removal

Cleanup Automation

Scheduled cleanup jobs via Solid Queue
Manual cleanup via rake tasks
Log retention separate from items
Orphaned data cleanup
Performance-optimized bulk deletes

Data Integrity

Foreign key constraints
Unique constraints on GUID and fingerprint
Database-level validations
Transaction support for atomic operations
Counter cache consistency

10. Observability

ActiveSupport Notifications

source_monitor.fetch.start - Before fetch begins
source_monitor.fetch.finish - After fetch completes
source_monitor.scrape.start - Before scraping
source_monitor.scrape.finish - After scraping
Custom event support for host apps

Health Monitoring

/health endpoint with system status
Database connection checks
Job queue health monitoring
Source health aggregation
Performance metrics collection

Logging

Structured fetch logs with full metadata
Scrape logs with adapter details
Error capture with backtraces
HTTP response headers storage
Duration tracking for all operations
Success/failure statistics

Metrics & Analytics

Fetch success/failure rates over time
Scraping performance metrics
Job queue depth monitoring
Error pattern analysis
Time-series data with Chart.js
Exportable metrics data

11. Error Recovery

Smart Retry Logic

Per-error-type retry strategies
Exponential backoff with jitter
Maximum retry limits
Circuit breaker pattern
Automatic interval adjustment on failure

Self-Healing Features

Auto-pause failing sources
Auto-recovery detection
Backoff period enforcement
Failure count tracking
Manual retry from UI

Error Tracking

Structured error class hierarchy
Full error messages and backtraces
Error timestamps and context
Integration points for error tracking services
Alert threshold configuration

Alerting System

Configurable alert thresholds
Webhook notification support
Error tracking service integration
Alert management UI
Per-source alert configuration

12. Host Application Integration

Configuration DSL

ruby

SourceMonitor.configure do |config|
  config.fetch_timeout = 30
  config.scrape_timeout = 60
  config.user_agent = "MyApp Bot"
  config.default_fetch_interval = 6
  config.retention_days = 30
end

Event Callbacks

after_item_created - Process new items
after_item_scraped - Handle scraped content
after_fetch_completed - React to fetch events
Custom item processors
Integration with host app workflows

Model Extensions

Override model methods via concerns
Add custom validations
Extend with STI for source types
Custom scopes and queries
Add polymorphic associations

Custom Fields

Use metadata JSONB for custom data
Table name prefixing support
Migration generation for custom columns
Flexible schema extensions

13. Performance & Scalability

Database Optimization

Strategic indexing on all queries
Counter cache for item counts
Efficient pagination with Kaminari/Pagy
N+1 query elimination
Batch insert operations
Query performance monitoring

Caching Strategy

Solid Cache for Rails 8
Fragment caching for expensive views
Dashboard statistics caching
Feed response caching with TTL
Cache invalidation on updates

Concurrent Processing

Parallel source fetching
Advisory locks prevent duplicate work
SELECT FOR UPDATE SKIP LOCKED
Job queue prioritization
Batch operations for efficiency

Scalability Features

Horizontal scaling support
Distributed job processing
Database connection pooling
Efficient memory usage
Tested with 1000+ sources

14. Security

Input Validation

URL validation and normalization
HTML sanitization
SQL injection prevention via ActiveRecord
CSRF protection (Rails default)
Mass assignment protection

SSRF Protection

Private IP range blocking
Allowlist/denylist support
Request timeout enforcement
Redirect limit enforcement
SSL/TLS verification required

Authentication & Authorization

Host app authentication integration
Before action filters for access control
Role-based permissions support
API token authentication support
Configurable authorization callbacks

Security Headers

Content Security Policy
X-Frame-Options
X-Content-Type-Options
Strict-Transport-Security
Security audit via Brakeman

15. Installation & Setup

Generator Tasks

One-command installation
Automatic migration generation
Route mounting with configuration
Initializer creation
Tailwind CSS setup
Solid Queue configuration

Configuration Options

Mount path customization
Namespace isolation
Custom controller inheritance
Layout customization
Helper method configuration

Example Applications

Basic integration example
Advanced customization example
Custom adapter example
Docker deployment configuration
Production deployment guides

Data Models

Source

Complete feed source configuration and state tracking.

Key Fields:

Feed URL, website URL, name
Active status, fetch interval, next fetch time
HTTP caching (ETag, Last-Modified)
Error tracking (failure count, last error, backoff)
Scraping settings (enabled, auto-scrape, adapter, custom CSS)
Retention policies (days, max items)
Custom headers, metadata (JSONB)

Item

Aggregated content with multiple storage layers.

Key Fields:

GUID, content fingerprint
Title, URL, canonical URL
Author(s), publication timestamps
Summary, full content (from feed)
Scraped HTML, extracted content
Scraping status and timestamp
Categories, tags, keywords (JSONB)
Media enclosures, thumbnails (JSONB)
Comments URL/count, metadata (JSONB)

FetchLog

Complete audit trail of fetch operations.

Key Fields:

Success/failure status
Items created/updated/failed counts
HTTP status, response headers (JSONB)
Duration, feed size, items in feed
Error class, message, backtrace
Job ID, timestamps

ScrapeLog

Audit trail of scraping operations.

Key Fields:

Success/failure status
Scraper adapter used
HTTP status, duration, content length
Error class, message
Metadata (JSONB)

Extension Points

Custom Scrapers

Implement the Scrapers::Base interface to add new extraction methods:

Custom HTML parsing logic
JavaScript rendering support
API-based content retrieval
Specialized content extraction

Event Handlers

Hook into the feed lifecycle:

Send items to external systems
Trigger notifications
Update search indexes
Generate summaries with AI
Content classification

Custom Processing

Extend item processing:

Content transformation
Metadata enrichment
Categorization/tagging
Spam filtering
Quality scoring

UI Customization

Modify the admin interface:

Custom layouts and styling
Additional views and actions
Integration with host app navigation
White-label branding
Stimulus controllers ship as ES modules via Importmap (source_monitor/application); host apps can override controller registrations with importmap.rb
Dropdown interactions use stimulus-use transitions when available and automatically fall back to class toggling when the module is not pinned
bin/rails app:source_monitor:assets:build and app:source_monitor:assets:verify keep Tailwind builds current; verification runs before rails test

Testing Support

Test Infrastructure

MiniTest configuration
VCR cassettes for HTTP mocking
WebMock for request stubbing
System tests with Capybara
Factory support for test data

Test Coverage

Unit tests for all services
Integration tests for workflows
System tests for UI interactions
Contract tests for adapters
Performance benchmarks

Edge Cases

Malformed feed handling
Missing GUID scenarios
Invalid date parsing
Network timeout simulation
Concurrent access tests

Production Readiness

Deployment Support

Docker configuration
Environment variable configuration
Database migration management
Background worker setup
Monitoring integration

Operational Features

Health check endpoints
Metrics collection
Log aggregation support
Error tracking integration
Performance monitoring

Documentation

Comprehensive README
Installation guide
Configuration reference
API documentation
Troubleshooting guide
Deployment best practices

Release Strategy

Semantic Versioning: Follow MAJOR.MINOR.PATCH. Breaking changes bump MAJOR, additive features bump MINOR, bugfixes and documentation-only changes bump PATCH.
Release Cadence: Target monthly MINOR releases with PATCH releases on demand for urgent fixes.
Release Checklist:
1. rbenv exec bundle exec rails test
2. rbenv exec bundle exec rubocop
3. rbenv exec bundle exec rake app:source_monitor:assets:verify
4. rbenv exec bundle exec gem build source_monitor.gemspec
5. Update CHANGELOG.md with release notes and tag commit (git tag vX.Y.Z)
6. Push tag and publish gem (rbenv exec gem push pkg/source_monitor-X.Y.Z.gem)
7. Announce release in project README/CHANGELOG summary as needed

Use Cases

Content Aggregation

News aggregation platforms
Blog aggregators
Podcast directories
Video feed aggregation
Multi-source content hubs

Monitoring & Tracking

Competitor content monitoring
Brand mention tracking
Industry news monitoring
Topic-specific content tracking
Research feed aggregation

Content Processing

Feed-to-email newsletters
Social media auto-posting
Content curation systems
Search index population
ML training data collection

Internal Tools

Company blog aggregation
Team knowledge bases
Documentation aggregation
Product update tracking
Industry research dashboards

Getting Started

Add to Gemfile: gem 'source_monitor'
Run: rails g source_monitor:install
Run: rails db:migrate
Mount in routes: mount SourceMonitor::Engine => "/source_monitor"
Visit: http://localhost:3000/source_monitor
Add your first source and click "Fetch Now"

The engine is production-ready from day one with sensible defaults and can be customized extensively for specific needs.