scripts/README-linkcheck.md
This directory contains a comprehensive link checker that validates all links in Markdown files, including anchor verification.
[text](url), <url>, and bare URL formatsdocs/images/logo.svg)#section) actually exist in the target pagecheck_links.py - Main link checker scriptlink-check-config.json - Configuration filerequirements-linkcheck.txt - Python dependenciestest_link_checker.py - Test scriptREADME-linkcheck.md - This documentation# Install dependencies
uv pip install -r scripts/requirements-linkcheck.txt
# Make script executable (Linux/Mac)
chmod +x scripts/check_links.py
# Check all markdown files in current directory (failures only)
python scripts/check_links.py
# Check with verbose output (show all links)
python scripts/check_links.py --verbose
# Check specific directory
python scripts/check_links.py docs/
# Use custom configuration
python scripts/check_links.py --config my-config.json
python scripts/check_links.py [directory] [options]
Options:
--config FILE Configuration file (default: scripts/link-check-config.json)
--timeout SECONDS Request timeout (overrides config)
--max-workers N Concurrent workers (overrides config)
--delay SECONDS Delay between requests (overrides config)
--verbose, -v Show all links (including successful ones)
--help Show help message
The link-check-config.json file allows you to customize the checker behavior:
{
"exclude_urls": [
"https://example.com/placeholder",
"https://localhost",
"http://localhost"
],
"exclude_link_patterns": [
".*\\.local.*",
".*127\\.0\\.0\\.1.*",
".*\\$\\{.*\\}.*"
],
"exclude_directories": [
"bin", "deps", "tests", ".github"
],
"timeout": 15,
"max_workers": 5,
"delay": 0.2,
"user_agent": "Mozilla/5.0 (compatible; RediSearch-LinkChecker/1.0)"
}
Configuration Options:
exclude_urls: List of exact URLs to skipexclude_link_patterns: List of regex patterns for resolved link URLs/paths to skip
^/ match absolute file system paths (for relative links)exclude_directories: List of directory names to skip when scanning for markdown filestimeout: Request timeout in secondsmax_workers: Number of concurrent threadsdelay: Delay between requests (be respectful to servers)user_agent: User agent string for requestsThe link checker runs automatically:
check-links label to any PR to trigger validationcheck-links label to any PR to run validation on-demandTo run link checking on any PR (even if it doesn't modify markdown files):
check-links labelRun the test script to verify functionality:
python scripts/test_link_checker.py
This creates temporary markdown files with various link types and tests the checker.
.md files in the specified directoryThe checker validates anchors by:
id attributes<a name="anchor"> tags<h1 id="anchor">)The link checker uses a hybrid approach for maximum reliability and efficiency:
Primary Method - HTTP Session:
requests.Session() for connection pooling and faster performanceFallback Method - cURL:
curl when sites block automated requestsrequests libraryExample scenarios:
github.com links → Fast session-based checking with anchor verificationcrates.io links → Fallback to curl when requests are blockeddocs.rs links → Session works, full anchor checking availableThe exclude_link_patterns work differently for different link types:
For Relative Links (resolved to absolute file paths):
^/.*/node_modules/.* - Excludes /path/to/project/node_modules/package.json^/.*/build/.* - Excludes /path/to/project/build/output.jshttps://redis.io/docs/build/guideFor Absolute URLs:
.*\\.local.* - Excludes https://myapp.local/api.*127\\.0\\.0\\.1.* - Excludes http://127.0.0.1:8080/home/user/build/file.txtdelay and max_workers if neededFalse Positives: Some sites block automated requests
curl as fallback, but you can also add persistent blockers to exclude_urlsTimeouts: Slow or unreliable sites
timeout value or exclude the URLRate Limiting: Too many requests too quickly
delay or reduce max_workersAnchor Not Found: Valid anchor reported as missing
For detailed debugging, modify the script to add verbose logging:
import logging
logging.basicConfig(level=logging.DEBUG)
When modifying the link checker:
test_link_checker.pyAll dependencies are pinned in requirements-linkcheck.txt for reproducible builds.