Back to Hermes Agent

Wikipedia + Wikidata

optional-skills/research/osint-investigation/references/sources/wikipedia.md

2026.6.54.1 KB
Original Source

Wikipedia + Wikidata

1. Summary

Wikipedia is the canonical narrative-bio source for notable people, places, and organizations. Wikidata is its structured-data counterpart: ~110M items, each with claims, dates, identifiers, and cross-references to external authorities (VIAF, ISNI, ORCID, GRID, etc.).

Together they're a high-precision entity-resolution layer — the bar for inclusion is real, but anything past that bar is well-cross-referenced.

2. Access Methods

  • Wikipedia OpenSearch: https://en.wikipedia.org/w/api.php?action=opensearch
  • Wikipedia REST summary: https://en.wikipedia.org/api/rest_v1/page/summary/<title>
  • Wikidata Action API: https://www.wikidata.org/w/api.php?action=wbgetentities
  • Wikidata SPARQL: https://query.wikidata.org/sparql (more powerful but aggressively rate-limited)
  • Auth: None, but a meaningful User-Agent is required

Set HERMES_OSINT_UA to something identifying (e.g. your-app/1.0 ([email protected])). Wikimedia returns HTTP 429 to generic UAs.

3. Data Schema

Key fields emitted by fetch_wikipedia.py:

ColumnTypeDescription
sourcestrwikipedia or wikipedia+wikidata
labelstrWikipedia article title
descriptionstrShort Wikidata description
qidstrWikidata QID (e.g. Q2283 for Microsoft)
wikipedia_title, wikipedia_urlstrArticle identifier + URL
wikidata_urlstrWikidata entity URL
instance_ofstrWhat kind of thing it is (P31)
countrystrCountry (P17 for orgs/places, P27 for people)
occupationstrP106
employerstrP108
date_of_birthstrP569, YYYY-MM-DD
place_of_birthstrP19
summarystrWikipedia REST extract (~1000 chars)

The fetch script uses Wikidata's Action API (NOT SPARQL) for structured facts — far more lenient on rate limits.

4. Coverage

  • Wikipedia EN: ~7M articles
  • Wikidata: ~110M items, ~1.5B statements
  • Updated continuously; abuse filters and bots run constantly
  • High notability bar — most private individuals are not in Wikipedia

5. Cross-Reference Potential

  • All sourceslabel (entity identity resolution)
  • SEC EDGARlabel (public companies)
  • CourtListenerlabel (parties to notable litigation)
  • Wikidata external identifiers (not currently in this fetcher's output) link to VIAF, ISNI, ORCID, GRID, GitHub, Twitter, IMDb, ...

Join key: Wikidata QID is canonical. Wikipedia titles are stable for most articles but can be renamed.

6. Data Quality

  • Notability filter — only notable entities (criteria vary by topic)
  • Recency lag — current events take days to weeks to be reflected
  • POV / vandalism — moderated, but edits between sweeps can be bad
  • Living-persons biographies have stricter sourcing requirements
  • Wikidata claims have qualifiers and references — the fetch script doesn't currently export them

7. Acquisition Script

Path: scripts/fetch_wikipedia.py

bash
# Look up a notable entity
python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Microsoft" --out data/wp.csv

# A specific person
python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Bill Gates" --out data/wp_bg.csv

# Skip the Wikidata enrichment for speed
python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Microsoft" --no-wikidata \
    --limit 5 --out data/wp.csv

The OpenSearch is fuzzy — --limit 5 returns the top 5 Wikipedia article matches. Each is enriched with the QID + structured facts unless --no-wikidata is passed.

  • Wikipedia text: CC-BY-SA-3.0 / GFDL
  • Wikidata claims: CC0 (public domain)
  • API ToS: respect rate limits, identify your agent
  • Commercial use allowed with attribution

9. References