Word Lens data attributions

Note: The committed en-zh.json (top 30,000 by COCA frequency) and zh-en.json (top 12,000, HSK-ranked) are frequency-trimmed derivatives generated from the datasets below via scripts/build-wordlens-data.mjs. The full source corpora (ECDICT ~66 MB, CC-CEDICT) are not committed — regenerate with the commands below. Glosses are cleaned (POS tags, […] annotations, and CC-CEDICT CL: clauses stripped) and capped to the first 1–2 short senses.

The en↔中文 packs use dedicated dictionaries (highest quality there); all other pairs (es/fr/de/pt/it/ru ↔ en) use the lightweight WikDict + FrequencyWords stack.

Sources

ECDICT (English→中文 glosses, frequency ranks, inflections) — MIT License. https://github.com/skywind3000/ECDICT
CC-CEDICT (中文→English glosses) — CC-BY-SA 4.0. https://cc-cedict.org/
HSK vocabulary levels (中文 difficulty ranking) — open dataset. https://github.com/drkameleon/complete-hsk-vocabulary
WikDict (bilingual glosses for es/fr/de/pt/it/ru ↔ en) — CC-BY-SA 3.0, derived from DBnary / Wiktionary. https://www.wikdict.com/
FrequencyWords (difficulty ranking; word-frequency lists from OpenSubtitles/OPUS) — CC-BY-SA 4.0. https://github.com/hermitdave/FrequencyWords
lemmatization-lists (form→lemma mappings used to lemmatize non-English source words, e.g. Spanish corriendo→correr) — compiled from Wiktionary + open lemma data. https://github.com/michmech/lemmatization-lists (English form→lemma comes from ECDICT's exchange field instead.)

The committed *.json packs are frequency-trimmed derivatives generated by scripts/build-wordlens-data.mjs. Redistribution complies with the above licenses; the CC-BY-SA packs (everything except en-zh.json) carry their license + attribution in each pack's meta. Per-pack sizes: 0.3–2.6 MB.

Regenerating the assets (maintainer step)

bash

# English → 中文 (ECDICT csv) and 中文 → English (CC-CEDICT + HSK)
node scripts/build-wordlens-data.mjs en-zh /path/to/ecdict.csv 30000
node scripts/build-wordlens-data.mjs zh-en /path/to/cedict.txt /path/to/hsk.json 12000

# Any other pair (one side English) from WikDict + FrequencyWords. Needs the `sqlite3`
# CLI. Download <pair>.sqlite3 from https://download.wikdict.com/dictionaries/sqlite/2/
# and <src>_50k.txt from FrequencyWords/content/2018/<src>/.
node scripts/build-wordlens-data.mjs build-wikdict es en es_50k.txt es-en.sqlite3 20000
node scripts/build-wordlens-data.mjs build-wikdict en es en_50k.txt en-es.sqlite3 20000
# (repeat for fr/de/pt/it/ru ↔ en)

# For maximum coverage, the kaikki Wiktionary dump can be used instead of WikDict:
node scripts/build-wordlens-data.mjs build es en es_50k.txt /path/to/es-extract.jsonl 20000