data/datasets/gutenberg/README.md
A collection of 7907 non-english (about 75-80% of all the ES, DE, FR, NL, IT,
PT, HU books available on the site) and 48 285 english (80%+) language ebooks
from the Project Gutenberg site with metadata removed. The two datasets are:
gutenberg_multilang and gutenberg_english.
| LANG | EBOOKS |
|---|---|
| EN | 48 285 |
| FR | 2863 |
| DE | 1735 |
| NL | 904 |
| ES | 717 |
| IT | 692 |
| PT | 501 |
| HU | 495 |
The METADATA column contains catalogue meta information on each book as a serialized JSON:
| key | original column |
|---|---|
| language | - |
| text_id | Text# unique book identifier on Prject Gutenberg as int |
| title | Title of the book as string |
| issued | Issued date as string |
| authors | Authors as string, comma separated sometimes with dates |
| subjects | Subjects as string, various formats |
| locc | LoCC code as string |
| bookshelves | Bookshelves as string, optional |
Please READ the site's TOS before running the crawler Notebook and follow these instructions:
catalog() to access the list of available E-books.
For more information, visit: https://www.gutenberg.org/ebooks/feeds.htmlparse()
function on each textNOTE: the crawler will create parquet files that are different from the current dataset format (the resulting dataframe will contain Text + all catalogue metadata columns).
How was the data generated?
project_gutenberg_crawler.ipynb was used to download the raw HTML code for
each eBook based on Text# id in the Gutenberg catalogue (if available)Copyright notice: