Book dataset / From Wikipedia, the free encyclopedia

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 11,000 unpublished books scraped from the Internet. It was the main corpus used to train the initial GPT model by OpenAI,[1] and has been used as training data for other early large language models including Google's BERT.[2] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.[2]

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors".[3][4] The dataset was initially hosted on a University of Toronto webpage.[4] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.[5] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.[4][5]