Use CommonCrawl robots.txt data if we're in common crawl mode #44

kkrugler · 2017-09-20T23:10:56Z

Currently when we're crawling with CommonCrawl data, we use a fetcher for robots.txt that always returns 404 for any request. But we could use the CommonCrawl robots.txt data to better simulate what a real crawl would get.

This data is stored separately from the actual crawl data. In the latest crawl, these are gzipped WARC files located at:

s3://commoncrawl/crawl-data/CC-MAIN-2017-39/segments/<segment>/robotstxt/

There are 19 segments for the 2017-39 crawl, and each such path above has about 720 files, with names like CC-MAIN-20170920194628-20170920214628-00719.warc.gz.

Each such file is around 2MB compressed, and 5MB uncompressed. So estimate of total data size is about 68GB (uncompressed).

The one file I examined had 1663 responses, so looks like maybe 23M total robots responses.

Unfortunately there's no index, so we'd have to do some significant data crunching to turn this into something useable.

The WARC data contains records that look like this (for valid responses):

WARC/1.0
WARC-Type: response
WARC-Date: 2017-09-20T19:47:49Z
WARC-Record-ID: <urn:uuid:40762295-3b8b-4b61-8a36-b740a861675b>
Content-Length: 670
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:5d5dc166-c888-4784-9cc3-1f24b050a33f>
WARC-Concurrent-To: <urn:uuid:4d6a2942-9f48-4415-9d69-d9e91bd0d3ea>
WARC-IP-Address: 122.10.65.165
WARC-Target-URI: http://06j93p.fbcnz.com/robots.txt
WARC-Payload-Digest: sha1:YSNCCOGJQ3TBD65CXDNULZQ2EQHMVAXF
WARC-Block-Digest: sha1:E2X4ERHREMDPMZBJ34KLXSUQQ45FGYSI
WARC-Identified-Payload-Type: text/plain

HTTP/1.1 200 OK
Content-Type: text/plain
Last-Modified: Thu, 18 Aug 2016 05:08:30 GMT
Accept-Ranges: bytes
ETag: "50918a8fef9d11:0"
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Wed, 20 Sep 2017 19:52:24 GMT
Connection: close
Content-Length: 404

User-agent: Baiduspider
Allow: /
User-agent: Googlebot
Disallow: /
User-agent: Bingbot
Disallow: /
User-agent: Slurp
Disallow: /
User-agent: Teoma
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: twiceler
Disallow: /
User-agent: MSNBot
Disallow: /
User-agent: Scrubby
Disallow: /
User-agent: Robozilla
Disallow: /
User-agent: Gigabot
Disallow: /
User-agent: *
Disallow:

The text was updated successfully, but these errors were encountered:

kkrugler added the enhancement label Sep 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CommonCrawl robots.txt data if we're in common crawl mode #44

Use CommonCrawl robots.txt data if we're in common crawl mode #44

kkrugler commented Sep 20, 2017

Use CommonCrawl robots.txt data if we're in common crawl mode #44

Use CommonCrawl robots.txt data if we're in common crawl mode #44

Comments

kkrugler commented Sep 20, 2017