You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently when we're crawling with CommonCrawl data, we use a fetcher for robots.txt that always returns 404 for any request. But we could use the CommonCrawl robots.txt data to better simulate what a real crawl would get.
This data is stored separately from the actual crawl data. In the latest crawl, these are gzipped WARC files located at:
There are 19 segments for the 2017-39 crawl, and each such path above has about 720 files, with names like CC-MAIN-20170920194628-20170920214628-00719.warc.gz.
Each such file is around 2MB compressed, and 5MB uncompressed. So estimate of total data size is about 68GB (uncompressed).
The one file I examined had 1663 responses, so looks like maybe 23M total robots responses.
Unfortunately there's no index, so we'd have to do some significant data crunching to turn this into something useable.
The WARC data contains records that look like this (for valid responses):
Currently when we're crawling with CommonCrawl data, we use a fetcher for robots.txt that always returns 404 for any request. But we could use the CommonCrawl robots.txt data to better simulate what a real crawl would get.
This data is stored separately from the actual crawl data. In the latest crawl, these are gzipped WARC files located at:
s3://commoncrawl/crawl-data/CC-MAIN-2017-39/segments/<segment>/robotstxt/
There are 19 segments for the
2017-39
crawl, and each such path above has about 720 files, with names likeCC-MAIN-20170920194628-20170920214628-00719.warc.gz
.Each such file is around 2MB compressed, and 5MB uncompressed. So estimate of total data size is about 68GB (uncompressed).
The one file I examined had 1663 responses, so looks like maybe 23M total robots responses.
Unfortunately there's no index, so we'd have to do some significant data crunching to turn this into something useable.
The WARC data contains records that look like this (for valid responses):
The text was updated successfully, but these errors were encountered: