Back to Hermes Agent

GDELT — Global News Monitoring

optional-skills/research/osint-investigation/references/sources/gdelt.md

2026.6.53.8 KB
Original Source

GDELT — Global News Monitoring

1. Summary

GDELT (Global Database of Events, Language, and Tone) monitors world news in 100+ languages with full-text indexing. Updated every 15 minutes. ~2015 → present, ~1B+ articles indexed. Free anonymous access.

GDELT is wider than Google News (more international, more long-tail sources) and indexed by tone/sentiment, themes (CAMEO codes), people, and organizations.

2. Access Methods

  • DOC 2.0 API: https://api.gdeltproject.org/api/v2/doc/doc
  • Events / GKG 2.0: https://api.gdeltproject.org/api/v2/events/events
  • Auth: None
  • Rate limit: 1 request per 5 seconds for the DOC API — strict

The fetch script automatically retries after a 6-second sleep when a 429 is received.

3. Data Schema

Key fields emitted by fetch_gdelt.py:

ColumnTypeDescription
titlestrArticle title
urlstrArticle URL
seen_datestrWhen GDELT first saw the article (UTC)
domainstrPublisher domain
languagestrSource language
source_countrystr2-letter country code
tonestrGDELT-computed tone score (negative = negative coverage)
social_imagestrOpen Graph image URL when available

4. Coverage

  • Worldwide news in 100+ languages
  • ~2015 → present (Events back to 1979 via a separate stream)
  • Update frequency: 15 minutes
  • Bias: heavily Anglophone in volume but very wide source list overall

5. Cross-Reference Potential

  • All sourcestitle / url (news context for any subject)
  • Wikipedia ↔ event timeline for notable entities
  • Wayback Machine ↔ recover articles whose URLs have died
  • OFAC SDN ↔ news context for sanctions designations
  • SEC EDGAR ↔ news context for 8-K material events

Join key: entity name appearing in article title or full-text. GDELT also extracts named entities into a separate stream (GKG) not exposed by this fetcher — query GDELT directly for entity-level filtering.

6. Data Quality

  • Title extraction is automated and can be wrong (sometimes captures the site name + delimiter + article title; sometimes a generic page title)
  • Sentiment / tone is computed by GDELT, not source-supplied
  • Some domains are oversampled (newswires, aggregators)
  • Source country is inferred from domain registration / TLD — can be wrong for international news sites with country-neutral domains
  • Article URLs can rot — pair with Wayback Machine to preserve content

7. Acquisition Script

Path: scripts/fetch_gdelt.py

bash
# Recent news mentioning an entity
python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Nous Research" \
    --timespan 6m --out data/gdelt.csv

# Phrase-exact (use double quotes inside single quotes for the shell)
python3 SKILL_DIR/scripts/fetch_gdelt.py --query '"Dillon Rolnick"' \
    --timespan 1y --out data/gdelt.csv

# Filter to a country / language
python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \
    --source-country US --source-lang English --out data/gdelt.csv

# Date range
python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \
    --start 2024-01-01 --end 2024-12-31 --out data/gdelt.csv

GDELT supports its own query operators: phrase quoting, AND/OR/NOT, sourcecountry:US, theme:ECON_BANKRUPTCY, tone<-5, etc. See https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/ for syntax.

  • GDELT data is provided free for academic and journalistic use
  • Article URLs link out to original publishers — copyright remains with the publisher
  • GDELT is NOT a content archive; it's a metadata index

9. References