Back to Readest

Word Lens data attributions

apps/readest-app/data/wordlens/ATTRIBUTION.md

0.11.122.8 KB
Original Source

Word Lens data attributions

Note: The committed en-zh.json (top 30,000 by COCA frequency) and zh-en.json (top 12,000, HSK-ranked) are frequency-trimmed derivatives generated from the datasets below via scripts/build-wordlens-data.mjs. The full source corpora (ECDICT ~66 MB, CC-CEDICT) are not committed — regenerate with the commands below. Glosses are cleaned (POS tags, […] annotations, and CC-CEDICT CL: clauses stripped) and capped to the first 1–2 short senses.

The en↔中文 packs use dedicated dictionaries (highest quality there); all other pairs (es/fr/de/pt/it/ru ↔ en) use the lightweight WikDict + FrequencyWords stack.

Sources

The committed *.json packs are frequency-trimmed derivatives generated by scripts/build-wordlens-data.mjs. Redistribution complies with the above licenses; the CC-BY-SA packs (everything except en-zh.json) carry their license + attribution in each pack's meta. Per-pack sizes: 0.3–2.6 MB.

Regenerating the assets (maintainer step)

bash
# English → 中文 (ECDICT csv) and 中文 → English (CC-CEDICT + HSK)
node scripts/build-wordlens-data.mjs en-zh /path/to/ecdict.csv 30000
node scripts/build-wordlens-data.mjs zh-en /path/to/cedict.txt /path/to/hsk.json 12000

# Any other pair (one side English) from WikDict + FrequencyWords. Needs the `sqlite3`
# CLI. Download <pair>.sqlite3 from https://download.wikdict.com/dictionaries/sqlite/2/
# and <src>_50k.txt from FrequencyWords/content/2018/<src>/.
node scripts/build-wordlens-data.mjs build-wikdict es en es_50k.txt es-en.sqlite3 20000
node scripts/build-wordlens-data.mjs build-wikdict en es en_50k.txt en-es.sqlite3 20000
# (repeat for fr/de/pt/it/ru ↔ en)

# For maximum coverage, the kaikki Wiktionary dump can be used instead of WikDict:
node scripts/build-wordlens-data.mjs build es en es_50k.txt /path/to/es-extract.jsonl 20000