optional-skills/research/osint-investigation/references/sources/wikipedia.md
Wikipedia is the canonical narrative-bio source for notable people, places, and organizations. Wikidata is its structured-data counterpart: ~110M items, each with claims, dates, identifiers, and cross-references to external authorities (VIAF, ISNI, ORCID, GRID, etc.).
Together they're a high-precision entity-resolution layer — the bar for inclusion is real, but anything past that bar is well-cross-referenced.
https://en.wikipedia.org/w/api.php?action=opensearchhttps://en.wikipedia.org/api/rest_v1/page/summary/<title>https://www.wikidata.org/w/api.php?action=wbgetentitieshttps://query.wikidata.org/sparql (more powerful but aggressively rate-limited)Set HERMES_OSINT_UA to something identifying (e.g. your-app/1.0 ([email protected])).
Wikimedia returns HTTP 429 to generic UAs.
Key fields emitted by fetch_wikipedia.py:
| Column | Type | Description |
|---|---|---|
source | str | wikipedia or wikipedia+wikidata |
label | str | Wikipedia article title |
description | str | Short Wikidata description |
qid | str | Wikidata QID (e.g. Q2283 for Microsoft) |
wikipedia_title, wikipedia_url | str | Article identifier + URL |
wikidata_url | str | Wikidata entity URL |
instance_of | str | What kind of thing it is (P31) |
country | str | Country (P17 for orgs/places, P27 for people) |
occupation | str | P106 |
employer | str | P108 |
date_of_birth | str | P569, YYYY-MM-DD |
place_of_birth | str | P19 |
summary | str | Wikipedia REST extract (~1000 chars) |
The fetch script uses Wikidata's Action API (NOT SPARQL) for structured facts — far more lenient on rate limits.
label (entity identity resolution)label (public companies)label (parties to notable litigation)Join key: Wikidata QID is canonical. Wikipedia titles are stable for most articles but can be renamed.
Path: scripts/fetch_wikipedia.py
# Look up a notable entity
python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Microsoft" --out data/wp.csv
# A specific person
python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Bill Gates" --out data/wp_bg.csv
# Skip the Wikidata enrichment for speed
python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Microsoft" --no-wikidata \
--limit 5 --out data/wp.csv
The OpenSearch is fuzzy — --limit 5 returns the top 5 Wikipedia article
matches. Each is enriched with the QID + structured facts unless
--no-wikidata is passed.