Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CommonCrawl robots.txt data if we're in common crawl mode #44

Open
kkrugler opened this issue Sep 20, 2017 · 0 comments
Open

Use CommonCrawl robots.txt data if we're in common crawl mode #44

kkrugler opened this issue Sep 20, 2017 · 0 comments

Comments

@kkrugler
Copy link
Member

Currently when we're crawling with CommonCrawl data, we use a fetcher for robots.txt that always returns 404 for any request. But we could use the CommonCrawl robots.txt data to better simulate what a real crawl would get.

This data is stored separately from the actual crawl data. In the latest crawl, these are gzipped WARC files located at:

s3://commoncrawl/crawl-data/CC-MAIN-2017-39/segments/<segment>/robotstxt/

There are 19 segments for the 2017-39 crawl, and each such path above has about 720 files, with names like CC-MAIN-20170920194628-20170920214628-00719.warc.gz.

Each such file is around 2MB compressed, and 5MB uncompressed. So estimate of total data size is about 68GB (uncompressed).

The one file I examined had 1663 responses, so looks like maybe 23M total robots responses.

Unfortunately there's no index, so we'd have to do some significant data crunching to turn this into something useable.

The WARC data contains records that look like this (for valid responses):

WARC/1.0
WARC-Type: response
WARC-Date: 2017-09-20T19:47:49Z
WARC-Record-ID: <urn:uuid:40762295-3b8b-4b61-8a36-b740a861675b>
Content-Length: 670
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:5d5dc166-c888-4784-9cc3-1f24b050a33f>
WARC-Concurrent-To: <urn:uuid:4d6a2942-9f48-4415-9d69-d9e91bd0d3ea>
WARC-IP-Address: 122.10.65.165
WARC-Target-URI: http://06j93p.fbcnz.com/robots.txt
WARC-Payload-Digest: sha1:YSNCCOGJQ3TBD65CXDNULZQ2EQHMVAXF
WARC-Block-Digest: sha1:E2X4ERHREMDPMZBJ34KLXSUQQ45FGYSI
WARC-Identified-Payload-Type: text/plain

HTTP/1.1 200 OK
Content-Type: text/plain
Last-Modified: Thu, 18 Aug 2016 05:08:30 GMT
Accept-Ranges: bytes
ETag: "50918a8fef9d11:0"
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Wed, 20 Sep 2017 19:52:24 GMT
Connection: close
Content-Length: 404

User-agent: Baiduspider
Allow: /
User-agent: Googlebot
Disallow: /
User-agent: Bingbot
Disallow: /
User-agent: Slurp
Disallow: /
User-agent: Teoma
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: twiceler
Disallow: /
User-agent: MSNBot
Disallow: /
User-agent: Scrubby
Disallow: /
User-agent: Robozilla
Disallow: /
User-agent: Gigabot
Disallow: /
User-agent: *
Disallow: 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant