Top Qs
Timeline
Chat
Perspective
Common Crawl
Nonprofit web crawling and archive organization From Wikipedia, the free encyclopedia
Remove ads
Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public.[1][2]
Common Crawl was founded by Gil Elbaz.[1][2] It is funded by the Elbaz Family Foundation Trust and significant donations from the AI industry.[3]
Contents archived by Common Crawl are mirrored[4][better source needed] and made available online[5][non-primary source needed] in the Wayback Machine. They are used by researchers, as well as AI companies to train large language models.[3]
In November 2025, an investigation by The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases.[6][3]
Remove ads
History
Summarize
Perspective
Advisors to the non-profit have included Peter Norvig and Joi Ito.[7]
By 2013, sites like TinEye were building their products off of Common Crawl.[8]
As of 2016, the Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the Common Crawl dataset to work around copyright law in other legal jurisdictions.[9][better source needed]
A filtered version of Common Crawl was used to train OpenAI's GPT-3 language model, announced in 2020.[10][better source needed] In 2023, it began receiving significant financial support from AI companies, including Anthropic and OpenAI, each of which donated $250,000.[3]
As of 2024, Common Crawl had been cited in more than 10,000 academic studies.[11]
In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases.[3] It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies.[3]
Remove ads
Colossal Clean Crawled Corpus
Google's version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4 for short. It was constructed for the training of the T5 language model series in 2019.[12][better source needed] There are some concerns over copyrighted content in the C4.[13] One study found that 45% of content was now explicitly restricted by websites who do not want it to be scraped without compensation to be used for purposes like AI training by for-profit companies.[11]
Remove ads
See also
References
External links
Wikiwand - on
Seamless Wikipedia browsing. On steroids.
Remove ads