For each page in CommonCrawl we hold as metadata
- The location of the raw download in S3. This is given as a URL plus offset and length.
- The languages discovered by CLD2 in the raw text of the page. This is given as a list of (Language / Number of bytes) pairs.
Example: crawl 2015_40
For recent crawls we use the data files from the common crawl index. These are just 300 gzipped files containing key-value pairs.
mkdir 2015_40
cd 2015_40
for i in `seq -w 0 299`; do wget -c https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2015-40/indexes/cdx-00${i}.gz; done
Restart the above until all files are completed.
cd ~/net/build
git clone [email protected]:facebook/rocksdb.git
cd rocksdb
PORTABLE=1 make shared_lib static_lib
mkdir -p /home/buck/net/cc/meta/db/2015_40
pv 2015_40/cdx-00???.gz | \
gzip -cd | \
python ~/net/build/DataCollection/metadata/metadatabase.py --cdx 2015_40 cdx | \
/home/buck/net/build/DataCollection/metadata/rocksdb/insertkv /home/buck/net/cc/meta/db/2015_40/
Install pyrocksdb following these instructions: http://pyrocksdb.readthedocs.org/en/latest/installation.html Instead of
make shared_lib
Run
PORTABLE=1 make shared_lib
as above to make the binary independent of the underlying CPU revision
Run the server with
/home/buck/net/build/DataCollection/metadata/md_server.py /PATH_TO_DBS/db/rdb_201*/ -ip 129.215.197.184 -port 8080
(change IP and Port)
Language distribution in computed by feeding raw text dumps (the .wet files) into CLD2.
Example for May 2015 crawl:
# Get filelist
mkdir 2015_22
cd 2015_22
wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2015-22/wet.paths.gz
# Convert to HTTPS URLs
zcat wet.paths.gz | sed 's/^/https:\/\/commoncrawl.s3.amazonaws.com\//' > $wet.paths.http
# Make subdirectories
for f in `zcat wet.paths.gz | cut -d '/' -f 5 | sort | uniq`; do mkdir -p $f; done;
# Collect monolingual data
# Set number of jobs to roughly half the number of cores, e.g. -j8 on a 16-core machine.
cat wet.paths.http | \
parallel --nice 19 --progress --sshloginfile -j 12 --wd `pwd` /path/to/DataCollection/metadata/extract_monolingual.sh
This will take a few days even on a multicore machine. Re-run the last line to make sure all files are properly processed. Finished files will not be processed again.
Compile the updatekv executeable in DataCollection/metadata/leveldb.
xzcat 2015_22/*.internal.warc.wet.gz.langsplit.xz | \
/path/to/DataCollection/metadata/langstats2kv.py 2015_22 | \
gzip -9 > 2015_22.kv.gz
The metadata API allows querying with a partial URL prefix and (optionally) a crawl name.
$ curl "http://data.statmt.org:8080/query_prefix?url=hettahuskies&crawl=2013_20&pretty=1&max_results=2"
{
"query_path": "",
"query_crawl": "2013_20",
"locations": {
"http://hettahuskies.com/": [
{
"filename": "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2013-20/segments/1368699684236/warc/CC-MAIN-20130516102124-00033-ip-10-60-113-184.ec2.internal.warc.gz",
"length": "4194",
"mime": "UNKNOWN",
"offset": "127252350",
"crawl": "2013_20"
}
],
"http://hettahuskies.com/Location/Areaattractions/aaintro.php": [
{
"filename": "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2013-20/segments/1368700212265/warc/CC-MAIN-20130516103012-00009-ip-10-60-113-184.ec2.internal.warc.gz",
"length": "4140",
"mime": "UNKNOWN",
"offset": "123806691",
"crawl": "2013_20"
}
]
},
"query_domain": "hettahuskies",
"db_key": "hettahuskies http://hettahuskies",
"skipped_keys": []
}
If 'crawl' is not specified, we get results from all crawls at once. Get a list by using the crawls endpoint:
$ curl "http://data.statmt.org:8080/crawls?pretty=1"
{
"crawls": [
"2012",
"2013_20",
"2014_15",
"2014_23",
"2014_35",
"2014_41",
"2014_42",
"2014_49",
"2014_52",
"2015_06",
"2015_11",
"2015_14",
"2015_18",
"2015_22",
"2015_27"
]
}
The 'locations' field hold url-location pairs the point into Amazon S3 where the full pages are stored. This includes the full location of the source file, and the length and offset at which the data is located. For example to get the last entry (http://hettahuskies.com/Location/Areaattractions/aaintro.php) compute the data range as (offset, offset + length -1) and download using e.g. curl:
$ curl -r 123806691-123810830 https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2013-20/segments/1368700212265/warc/CC-MAIN-20130516103012-00009-ip-10-60-113-184.ec2.internal.warc.gz | \
zcat | head -n 25
WARC/1.0
WARC-Type: response
WARC-Date: 2013-05-21T16:36:12Z
WARC-Record-ID: <urn:uuid:fa940850-e840-4dbc-9dda-8a12a983ea7f>
Content-Length: 11265
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:0e0dc031-834b-4214-8734-436e417840b7>
WARC-Concurrent-To: <urn:uuid:1c9a8321-669c-44af-9d26-162048519f6d>
WARC-IP-Address: 85.13.248.54
WARC-Target-URI: http://hettahuskies.com/Location/Areaattractions/aaintro.php
WARC-Payload-Digest: sha1:FJYCE63U6QKI5EH7N5NVMBQB2GJA4TRE
WARC-Block-Digest: sha1:SBSHBG6E2ED3BPT5CU3QLMFLSF5WXF5G
HTTP/1.0 200 OK
Date: Tue, 21 May 2013 16:36:05 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Server: Apache
X-Powered-By: PHP/5.2.17
<U+FEFF>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dt
d">
<html [...]
Format is warc header, empty line, http response header, empty line, html content. Check out download_candidates.py for downloading code in python using connection pools.
Results from the metadata API are limited to 10000 per request by default, just to keep the result size reasonable. Set max_results to a higher value to increase this limit. For batch processing we can access the DB locally and use the C++ interface.