GDELT — Global News Monitoring

1. Summary

GDELT (Global Database of Events, Language, and Tone) monitors world news in 100+ languages with full-text indexing. Updated every 15 minutes. ~2015 → present, ~1B+ articles indexed. Free anonymous access.

GDELT is wider than Google News (more international, more long-tail sources) and indexed by tone/sentiment, themes (CAMEO codes), people, and organizations.

2. Access Methods

DOC 2.0 API: https://api.gdeltproject.org/api/v2/doc/doc
Events / GKG 2.0: https://api.gdeltproject.org/api/v2/events/events
Auth: None
Rate limit: 1 request per 5 seconds for the DOC API — strict

The fetch script automatically retries after a 6-second sleep when a 429 is received.

3. Data Schema

Key fields emitted by fetch_gdelt.py:

Column	Type	Description
`title`	str	Article title
`url`	str	Article URL
`seen_date`	str	When GDELT first saw the article (UTC)
`domain`	str	Publisher domain
`language`	str	Source language
`source_country`	str	2-letter country code
`tone`	str	GDELT-computed tone score (negative = negative coverage)
`social_image`	str	Open Graph image URL when available

4. Coverage

Worldwide news in 100+ languages
~2015 → present (Events back to 1979 via a separate stream)
Update frequency: 15 minutes
Bias: heavily Anglophone in volume but very wide source list overall

5. Cross-Reference Potential

All sources ↔ title / url (news context for any subject)
Wikipedia ↔ event timeline for notable entities
Wayback Machine ↔ recover articles whose URLs have died
OFAC SDN ↔ news context for sanctions designations
SEC EDGAR ↔ news context for 8-K material events

Join key: entity name appearing in article title or full-text. GDELT also extracts named entities into a separate stream (GKG) not exposed by this fetcher — query GDELT directly for entity-level filtering.

6. Data Quality

Title extraction is automated and can be wrong (sometimes captures the site name + delimiter + article title; sometimes a generic page title)
Sentiment / tone is computed by GDELT, not source-supplied
Some domains are oversampled (newswires, aggregators)
Source country is inferred from domain registration / TLD — can be wrong for international news sites with country-neutral domains
Article URLs can rot — pair with Wayback Machine to preserve content

7. Acquisition Script

Path: scripts/fetch_gdelt.py

bash

# Recent news mentioning an entity
python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Nous Research" \
    --timespan 6m --out data/gdelt.csv

# Phrase-exact (use double quotes inside single quotes for the shell)
python3 SKILL_DIR/scripts/fetch_gdelt.py --query '"Dillon Rolnick"' \
    --timespan 1y --out data/gdelt.csv

# Filter to a country / language
python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \
    --source-country US --source-lang English --out data/gdelt.csv

# Date range
python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \
    --start 2024-01-01 --end 2024-12-31 --out data/gdelt.csv

GDELT supports its own query operators: phrase quoting, AND/OR/NOT, sourcecountry:US, theme:ECON_BANKRUPTCY, tone<-5, etc. See https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/ for syntax.

8. Legal & Licensing

GDELT data is provided free for academic and journalistic use
Article URLs link out to original publishers — copyright remains with the publisher
GDELT is NOT a content archive; it's a metadata index

9. References

DOC 2.0 API: https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/
Themes & query syntax: https://blog.gdeltproject.org/gkg-2-0-our-global-knowledge-graph-2-0-amazing-data-at-your-fingertips/
Project home: https://www.gdeltproject.org/