Common Crawl is a freely accessible nonprofit repository containing petabytes of web crawl data collected since 2008, with over 300 billion pages available on Amazon Web Services and academic cloud platforms.
Users can analyze data directly through Amazon's cloud platform, download datasets in whole or part, or search pages using the Common Crawl URL Index. The corpus includes raw HTML, metadata extracts, and text extracts used for research, machine learning, and AI applications.