BookCorpus

BookCorpus
BookCorpus
Size	7,000 self-published books - 985 million words
Creator(s)	University of Toronto and MIT
Date of release	2015

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.^[1] It was the main corpus used to train the initial GPT model by OpenAI,^[2] and has been used as training data for other early large language models including Google's BERT.^[3] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.^[3]

Quick facts Size, Creator(s) ...

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books".^[4] The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service.^[5] The dataset was initially hosted on a University of Toronto webpage.^[5] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.^[1] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.^[5]^[1]

[1]

[2]

[3]

[4]

[5]

BookCorpus

References

Wikiwand - on