Dataset Card for Project Gutenberg - Multilanguage eBooks

A collection of 7907 non-english (about 75-80% of all the ES, DE, FR, NL, IT, PT, HU books available on the site) and 48 285 english (80%+) language ebooks from the Project Gutenberg site with metadata removed. The two datasets are: gutenberg_multilang and gutenberg_english.

LANG	EBOOKS
EN	48 285
FR	2863
DE	1735
NL	904
ES	717
IT	692
PT	501
HU	495

The METADATA column contains catalogue meta information on each book as a serialized JSON:

key	original column
language	-
text_id	Text# unique book identifier on Prject Gutenberg as int
title	Title of the book as string
issued	Issued date as string
authors	Authors as string, comma separated sometimes with dates
subjects	Subjects as string, various formats
locc	LoCC code as string
bookshelves	Bookshelves as string, optional

Source data

Please READ the site's TOS before running the crawler Notebook and follow these instructions:

The website will IP ban crawlers for going through each book's metadata page separately. Instead use catalog() to access the list of available E-books. For more information, visit: https://www.gutenberg.org/ebooks/feeds.html
You can avoid running the crawler by mirroring the entire database of Project Gutenberg or use one of their FTPs instead, and then call the parse() function on each text
For more on robot access see: https://www.gutenberg.org/policy/robot_access.html

NOTE: the crawler will create parquet files that are different from the current dataset format (the resulting dataframe will contain Text + all catalogue metadata columns).

How was the data generated?

project_gutenberg_crawler.ipynb was used to download the raw HTML code for each eBook based on Text# id in the Gutenberg catalogue (if available)
The metadata and the body of text are not clearly separated so a parser included in the notebook attempts to split them, then remove transcriber's notes and e-book related information from the body of text (text clearly marked as copyrighted or malformed was skipped and not collected)
The body of cleaned TEXT as well as the catalogue METADATA is then saved as a parquet file, with all columns being strings

Copyright notice:

Some of the books are copyrighted! The crawler ignored all books with an english copyright header by utilizing a regex expression, but make sure to check out the metadata for each book manually to ensure they are okay to use in your country! More information on copyright: https://www.gutenberg.org/help/copyright.html and https://www.gutenberg.org/policy/permission.html
Project Gutenberg has the following requests when using books without metadata: Books obtained from the Project Gutenberg site should have the following legal note next to them: "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost" no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook."

README

Dataset Card for Project Gutenberg - Multilanguage eBooks

Source data