apps/readest-app/data/wordlens/ATTRIBUTION.md
Note: The committed
en-zh.json(top 30,000 by COCA frequency) andzh-en.json(top 12,000, HSK-ranked) are frequency-trimmed derivatives generated from the datasets below viascripts/build-wordlens-data.mjs. The full source corpora (ECDICT ~66 MB, CC-CEDICT) are not committed — regenerate with the commands below. Glosses are cleaned (POS tags,[…]annotations, and CC-CEDICTCL:clauses stripped) and capped to the first 1–2 short senses.
The en↔中文 packs use dedicated dictionaries (highest quality there); all other pairs (es/fr/de/pt/it/ru ↔ en) use the lightweight WikDict + FrequencyWords stack.
corriendo→correr) — compiled from Wiktionary + open lemma
data. https://github.com/michmech/lemmatization-lists (English form→lemma comes from
ECDICT's exchange field instead.)The committed *.json packs are frequency-trimmed derivatives generated by
scripts/build-wordlens-data.mjs. Redistribution complies with the above licenses; the
CC-BY-SA packs (everything except en-zh.json) carry their license + attribution in each
pack's meta. Per-pack sizes: 0.3–2.6 MB.
# English → 中文 (ECDICT csv) and 中文 → English (CC-CEDICT + HSK)
node scripts/build-wordlens-data.mjs en-zh /path/to/ecdict.csv 30000
node scripts/build-wordlens-data.mjs zh-en /path/to/cedict.txt /path/to/hsk.json 12000
# Any other pair (one side English) from WikDict + FrequencyWords. Needs the `sqlite3`
# CLI. Download <pair>.sqlite3 from https://download.wikdict.com/dictionaries/sqlite/2/
# and <src>_50k.txt from FrequencyWords/content/2018/<src>/.
node scripts/build-wordlens-data.mjs build-wikdict es en es_50k.txt es-en.sqlite3 20000
node scripts/build-wordlens-data.mjs build-wikdict en es en_50k.txt en-es.sqlite3 20000
# (repeat for fr/de/pt/it/ru ↔ en)
# For maximum coverage, the kaikki Wiktionary dump can be used instead of WikDict:
node scripts/build-wordlens-data.mjs build es en es_50k.txt /path/to/es-extract.jsonl 20000