Back to Hermes Agent

Wayback Machine — Internet Archive CDX

optional-skills/research/osint-investigation/references/sources/wayback.md

2026.6.53.1 KB
Original Source

Wayback Machine — Internet Archive CDX

1. Summary

The Internet Archive's Wayback Machine has captured ~900B+ web pages since 1996. The CDX server API indexes those captures by URL, timestamp, and content hash. Free, anonymous, no auth.

2. Access Methods

  • CDX server: https://web.archive.org/cdx/search/cdx
  • Wayback URL: https://web.archive.org/web/<timestamp>/<url>
  • Save Page Now (write): https://web.archive.org/save/<url> (different API)
  • Auth: None
  • Rate limit: Generous; be polite (~1 req/s)

3. Data Schema

Key fields emitted by fetch_wayback.py:

ColumnTypeDescription
urlstrOriginal URL captured
timestampstrYYYYMMDDHHMMSS (CDX format)
wayback_urlstrDirect replay URL
mimetypestrContent-type at capture
statusstrHTTP status (typically 200)
digeststrSHA1 of capture content (collapse-friendly)
lengthstrByte length of capture

4. Coverage

  • 1996 → present
  • ~900B+ captures across ~700M domains
  • Updated continuously by automated crawls + manual saves
  • Some domains have aggressive coverage (news), others sparse (private)

5. Cross-Reference Potential

  • Wikipedia ↔ Reverse-lookup pages cited as references that have since disappeared
  • News URLs ↔ Original article content when present-day URLs 404
  • Corporate websites ↔ Historical "About" pages, executive bios that have been scrubbed

The Wayback CDX is most useful as a content-recovery layer when other sources point to URLs that no longer exist.

6. Data Quality

  • robots.txt-blocked domains may have spotty or no coverage
  • Captures vary in completeness (HTML may be saved without CSS/JS)
  • Some content is excluded by domain owner request (DMCA, etc.)
  • Coverage of "deep links" (URLs with query strings) is uneven
  • Time resolution is per-capture, not continuous — gaps are common

7. Acquisition Script

Path: scripts/fetch_wayback.py

bash
# All captures of a specific URL
python3 SKILL_DIR/scripts/fetch_wayback.py --url "https://example.com/page" \
    --out data/wb.csv

# All captures of a host
python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
    --match host --out data/wb.csv

# All captures of a domain + subdomains
python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
    --match domain --out data/wb.csv

# Only unique-content captures within a date window
python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
    --match host --collapse digest \
    --from-date 2020-01-01 --to-date 2023-12-31 \
    --out data/wb.csv
  • Internet Archive captures are made under fair-use research provisions
  • Replay URLs are stable references — citing them is encouraged
  • Internet Archive non-profit terms of use govern content
  • Some content is rights-restricted; replay may be blocked even if the CDX entry shows it as captured

9. References