Back to Mingpt

chargpt

projects/chargpt/readme.md

latest686 B
Original Source

chargpt

chargpt trains a character-level language model.

We support three settings: 1 convenience setting and 2 "benchmark" settings that have acedemic literature results:

  • a user specified input.txt file that we train an LM on (e.g. get tiny-shakespear (1.1MB of data) here)
  • TODO text8: also derived from Wikipedia text but all XML is removed and is lowercased to only 26 characters of
  • TODO enwik8 benchmark ("Hutter Prize"), first 100M bytes of a Wikipedia XML dump, with 205 unique tokensEnglish plus spaces