Top Qs
Timeline
Chat
Perspective

BookCorpus

Book dataset From Wikipedia, the free encyclopedia

Remove ads

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.[1] It was the main corpus used to train the initial GPT model by OpenAI,[2] and has been used as training data for other early large language models including Google's BERT.[3] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.[3]

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books".[4] The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service.[5] The dataset was initially hosted on a University of Toronto webpage.[5] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.[1] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.[5][1]

Remove ads

References

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.

Remove ads