Common Crawl

Common Crawl
Type of business	501(c)(3) non-profit
Founded	2007
Headquarters	San Francisco, California; Los Angeles, California, United States
Founder	Gil Elbaz
Key people	Peter Norvig, Rich Skrenta, Eva Ho
URL	commoncrawl.org
Content license	Apache 2.0 (software) [clarification needed]

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public.^[1]^[2]

Quick facts Type of business, Founded ...

Common Crawl was founded by Gil Elbaz.^[1]^[2] It is funded by the Elbaz Family Foundation Trust and significant donations from the AI industry.^[3]

Contents archived by Common Crawl are mirrored^[4]^{[better source needed]} and made available online^[5]^{[non-primary source needed]} in the Wayback Machine. They are used by researchers, as well as AI companies to train large language models.^[3]

In November 2025, an investigation by The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases.^[6]^[3]

[1]

[2]

[3]

[4]

[5]

[6]

Common Crawl

History

Colossal Clean Crawled Corpus

See also

References

External links

Wikiwand - on