Link Checker for Markdown Files

This directory contains a comprehensive link checker that validates all links in Markdown files, including anchor verification.

Features

✅ Comprehensive Link Detection: Finds links in [text](url), <url>, and bare URL formats
📁 Relative Link Checking: Validates relative file paths (e.g., docs/images/logo.svg)
🔗 Anchor Verification: Validates that anchor links (#section) actually exist in the target page
🚀 Concurrent Processing: Multi-threaded checking for faster execution
⚙️ Configurable: Supports exclusion lists and custom settings
🤖 CI/CD Integration: GitHub Action for automated weekly checks
📊 Detailed Reporting: Clear output with line numbers and failure reasons

Files

check_links.py - Main link checker script
link-check-config.json - Configuration file
requirements-linkcheck.txt - Python dependencies
test_link_checker.py - Test script
README-linkcheck.md - This documentation

Installation

bash

# Install dependencies
uv pip install -r scripts/requirements-linkcheck.txt

# Make script executable (Linux/Mac)
chmod +x scripts/check_links.py

Usage

Basic Usage

bash

# Check all markdown files in current directory (failures only)
python scripts/check_links.py

# Check with verbose output (show all links)
python scripts/check_links.py --verbose

# Check specific directory
python scripts/check_links.py docs/

# Use custom configuration
python scripts/check_links.py --config my-config.json

Command Line Options

bash

python scripts/check_links.py [directory] [options]

Options:
  --config FILE         Configuration file (default: scripts/link-check-config.json)
  --timeout SECONDS     Request timeout (overrides config)
  --max-workers N       Concurrent workers (overrides config)
  --delay SECONDS       Delay between requests (overrides config)
  --verbose, -v         Show all links (including successful ones)
  --help               Show help message

Configuration File

The link-check-config.json file allows you to customize the checker behavior:

json

{
  "exclude_urls": [
    "https://example.com/placeholder",
    "https://localhost",
    "http://localhost"
  ],
  "exclude_link_patterns": [
    ".*\\.local.*",
    ".*127\\.0\\.0\\.1.*",
    ".*\\$\\{.*\\}.*"
  ],
  "exclude_directories": [
    "bin", "deps", "tests", ".github"
  ],
  "timeout": 15,
  "max_workers": 5,
  "delay": 0.2,
  "user_agent": "Mozilla/5.0 (compatible; RediSearch-LinkChecker/1.0)"
}

Configuration Options:

exclude_urls: List of exact URLs to skip
exclude_link_patterns: List of regex patterns for resolved link URLs/paths to skip
- Note: Patterns starting with ^/ match absolute file system paths (for relative links)
- Note: Other patterns match any part of URLs (for absolute links)
exclude_directories: List of directory names to skip when scanning for markdown files
timeout: Request timeout in seconds
max_workers: Number of concurrent threads
delay: Delay between requests (be respectful to servers)
user_agent: User agent string for requests

GitHub Action

The link checker runs automatically:

Weekly: Every Sunday at 20:20 UTC (with benchmarks)
On PRs: When markdown files, link checker script, dependencies, or workflow are modified
On-demand: Add the check-links label to any PR to trigger validation
Manual: Can be triggered manually from GitHub Actions tab

Workflow Features

🔄 Automatic Issue Creation: Creates GitHub issues for broken links found in weekly runs
💬 PR Comments: Comments on PRs when link check fails
📁 Artifact Upload: Saves detailed logs for failed runs
⚡ Smart Throttling: Uses conservative settings to avoid overwhelming servers
🏷️ Label Trigger: Add check-links label to any PR to run validation on-demand

Using the Label Trigger

To run link checking on any PR (even if it doesn't modify markdown files):

Add the label: Go to the PR and add the check-links label
Workflow runs: The link checker will automatically run
Results: Check the Actions tab for results and any PR comments

Testing

Run the test script to verify functionality:

bash

python scripts/test_link_checker.py

This creates temporary markdown files with various link types and tests the checker.

How It Works

Discovery: Recursively finds all .md files in the specified directory
Extraction: Uses regex patterns to extract links from markdown content
Classification: Determines if links are absolute URLs or relative file paths
Validation:
- Absolute URLs: Makes HTTP requests to verify accessibility
- Relative Paths: Checks file system existence relative to the markdown file
Anchor Check: For links with anchors, parses HTML to verify anchor exists
Reporting: Provides detailed results with line numbers, link types, and error messages

Anchor Verification

The checker validates anchors by:

Looking for elements with matching id attributes
Checking for <a name="anchor"> tags
Searching for GitHub-style header anchors (<h1 id="anchor">)
Parsing HTML content to find anchor targets

Smart Request Handling

The link checker uses a hybrid approach for maximum reliability and efficiency:

Primary Method - HTTP Session:

Uses requests.Session() for connection pooling and faster performance
Maintains cookies and headers across requests
Supports full anchor verification by parsing HTML content
Efficient for checking multiple links from the same domain

Fallback Method - cURL:

Automatically falls back to curl when sites block automated requests
Uses browser-like headers to bypass bot detection
Handles sites that specifically block Python requests library
Examples: Package registries (crates.io, npm), some corporate sites

Example scenarios:

✅ github.com links → Fast session-based checking with anchor verification
✅ crates.io links → Fallback to curl when requests are blocked
✅ docs.rs links → Session works, full anchor checking available

Exclusion Pattern Logic

The exclude_link_patterns work differently for different link types:

For Relative Links (resolved to absolute file paths):

^/.*/node_modules/.* - Excludes /path/to/project/node_modules/package.json
^/.*/build/.* - Excludes /path/to/project/build/output.js
Won't affect URLs like https://redis.io/docs/build/guide

For Absolute URLs:

.*\\.local.* - Excludes https://myapp.local/api
.*127\\.0\\.0\\.1.* - Excludes http://127.0.0.1:8080
Won't affect file paths like /home/user/build/file.txt

Best Practices

For Documentation Authors

Use Descriptive Link Text: Avoid "click here" or generic text
Test Locally: Run the checker before committing changes
Keep Links Current: Regularly review and update external links
Use Relative Links: For internal documentation, prefer relative paths
Check File Paths: Ensure relative links point to existing files
Verify Images: Make sure image links point to actual image files

For Maintainers

Review Weekly Reports: Check automated issues for broken links
Update Exclusions: Add problematic but valid URLs to exclusion list
Monitor Performance: Adjust delay and max_workers if needed
Keep Dependencies Updated: Regularly update Python packages

Troubleshooting

Common Issues

False Positives: Some sites block automated requests

Solution: The checker automatically tries curl as fallback, but you can also add persistent blockers to exclude_urls

Timeouts: Slow or unreliable sites

Solution: Increase timeout value or exclude the URL

Rate Limiting: Too many requests too quickly

Solution: Increase delay or reduce max_workers

Anchor Not Found: Valid anchor reported as missing

Solution: Check if site uses JavaScript to generate anchors (may need exclusion)

Debug Mode

For detailed debugging, modify the script to add verbose logging:

python

import logging
logging.basicConfig(level=logging.DEBUG)

Contributing

When modifying the link checker:

Test changes with test_link_checker.py
Update configuration examples if adding new options
Update this README for new features
Test the GitHub Action in a fork before merging

Dependencies

requests: HTTP client library
beautifulsoup4: HTML parsing for anchor verification
lxml: Fast XML/HTML parser (optional but recommended)

All dependencies are pinned in requirements-linkcheck.txt for reproducible builds.