optional-skills/research/osint-investigation/references/sources/wayback.md
The Internet Archive's Wayback Machine has captured ~900B+ web pages since 1996. The CDX server API indexes those captures by URL, timestamp, and content hash. Free, anonymous, no auth.
https://web.archive.org/cdx/search/cdxhttps://web.archive.org/web/<timestamp>/<url>https://web.archive.org/save/<url> (different API)Key fields emitted by fetch_wayback.py:
| Column | Type | Description |
|---|---|---|
url | str | Original URL captured |
timestamp | str | YYYYMMDDHHMMSS (CDX format) |
wayback_url | str | Direct replay URL |
mimetype | str | Content-type at capture |
status | str | HTTP status (typically 200) |
digest | str | SHA1 of capture content (collapse-friendly) |
length | str | Byte length of capture |
The Wayback CDX is most useful as a content-recovery layer when other sources point to URLs that no longer exist.
Path: scripts/fetch_wayback.py
# All captures of a specific URL
python3 SKILL_DIR/scripts/fetch_wayback.py --url "https://example.com/page" \
--out data/wb.csv
# All captures of a host
python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
--match host --out data/wb.csv
# All captures of a domain + subdomains
python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
--match domain --out data/wb.csv
# Only unique-content captures within a date window
python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
--match host --collapse digest \
--from-date 2020-01-01 --to-date 2023-12-31 \
--out data/wb.csv