Back to Open Assistant

README

data/datasets/gutenberg/README.md

0.0.13.7 KB
Original Source

Dataset Card for Project Gutenberg - Multilanguage eBooks

A collection of 7907 non-english (about 75-80% of all the ES, DE, FR, NL, IT, PT, HU books available on the site) and 48 285 english (80%+) language ebooks from the Project Gutenberg site with metadata removed. The two datasets are: gutenberg_multilang and gutenberg_english.

LANGEBOOKS
EN48 285
FR2863
DE1735
NL904
ES717
IT692
PT501
HU495

The METADATA column contains catalogue meta information on each book as a serialized JSON:

keyoriginal column
language-
text_idText# unique book identifier on Prject Gutenberg as int
titleTitle of the book as string
issuedIssued date as string
authorsAuthors as string, comma separated sometimes with dates
subjectsSubjects as string, various formats
loccLoCC code as string
bookshelvesBookshelves as string, optional

Source data

Please READ the site's TOS before running the crawler Notebook and follow these instructions:

  • The website will IP ban crawlers for going through each book's metadata page separately. Instead use catalog() to access the list of available E-books. For more information, visit: https://www.gutenberg.org/ebooks/feeds.html
  • You can avoid running the crawler by mirroring the entire database of Project Gutenberg or use one of their FTPs instead, and then call the parse() function on each text
  • For more on robot access see: https://www.gutenberg.org/policy/robot_access.html

NOTE: the crawler will create parquet files that are different from the current dataset format (the resulting dataframe will contain Text + all catalogue metadata columns).

How was the data generated?

  • project_gutenberg_crawler.ipynb was used to download the raw HTML code for each eBook based on Text# id in the Gutenberg catalogue (if available)
  • The metadata and the body of text are not clearly separated so a parser included in the notebook attempts to split them, then remove transcriber's notes and e-book related information from the body of text (text clearly marked as copyrighted or malformed was skipped and not collected)
  • The body of cleaned TEXT as well as the catalogue METADATA is then saved as a parquet file, with all columns being strings

Copyright notice:

  • Some of the books are copyrighted! The crawler ignored all books with an english copyright header by utilizing a regex expression, but make sure to check out the metadata for each book manually to ensure they are okay to use in your country! More information on copyright: https://www.gutenberg.org/help/copyright.html and https://www.gutenberg.org/policy/permission.html
  • Project Gutenberg has the following requests when using books without metadata: Books obtained from the Project Gutenberg site should have the following legal note next to them: "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost" no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook."