Skip to content

Commit

Permalink
Add commoncrawl robots code
Browse files Browse the repository at this point in the history
  • Loading branch information
kylematoba committed Nov 21, 2024
1 parent 0b81df3 commit 100c923
Show file tree
Hide file tree
Showing 18 changed files with 954 additions and 0 deletions.
80 changes: 80 additions & 0 deletions commoncrawl_robots/README_km.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@

# Raw robots file, 10.2021 - 10.2024
Kyle Matoba, 01.11.2024

This dataset contains the `robots.txt` files from all commoncrawl `response`s between October 2021 and October 2024:

Here are the constituent crawls, with the number of 200-like responses and the time period covered.
```
crawl name # row first datetime last datetime
CC-MAIN-2021-43 88150767 2021-10-15T19:25:17Z - 2021-10-28T22:35:51Z
CC-MAIN-2021-49 86901347 2021-11-26T22:41:26Z - 2021-12-09T15:24:47Z
CC-MAIN-2022-05 81774170 2022-01-16T09:32:03Z - 2022-01-29T15:17:10Z
CC-MAIN-2022-21 88321885 2022-05-16T04:14:19Z - 2022-05-29T13:38:39Z
CC-MAIN-2022-27 86843004 2022-06-24T21:39:49Z - 2022-07-07T18:40:45Z
CC-MAIN-2022-33 89962948 2022-08-07T15:09:49Z - 2022-08-20T07:24:07Z
CC-MAIN-2022-40 82487425 2022-09-24T15:16:04Z - 2022-10-08T00:04:49Z
CC-MAIN-2022-49 87284951 2022-11-26T08:07:55Z - 2022-12-10T10:42:38Z
CC-MAIN-2023-06 78733681 2023-01-26T21:09:11Z - 2023-02-09T14:25:02Z
CC-MAIN-2023-14 81150267 2023-03-20T08:35:35Z - 2023-04-02T13:50:18Z
CC-MAIN-2023-23 85454018 2023-05-27T22:35:39Z - 2023-06-11T02:30:16Z
CC-MAIN-2023-40 90317330 2023-09-21T07:37:33Z - 2023-10-05T04:20:01Z
CC-MAIN-2023-50 101028224 2023-11-28T08:35:09Z - 2023-12-12T00:04:04Z
CC-MAIN-2024-10 94733131 2024-02-20T21:11:18Z - 2024-03-05T15:40:46Z
CC-MAIN-2024-18 92143840 2024-04-12T10:14:20Z - 2024-04-25T16:02:22Z
CC-MAIN-2024-22 97644816 2024-05-17T23:31:46Z - 2024-05-31T00:00:41Z
CC-MAIN-2024-26 99715321 2024-06-12T14:04:48Z - 2024-06-25T23:27:59Z
CC-MAIN-2024-30 94443660 2024-07-12T09:42:39Z - 2024-07-25T20:50:31Z
CC-MAIN-2024-33 90738661 2024-08-02T23:45:30Z - 2024-08-16T05:58:25Z
CC-MAIN-2024-38 91196581 2024-09-07T09:59:21Z - 2024-09-21T04:50:53Z
CC-MAIN-2024-42 90254842 2024-10-03T09:40:44Z - 2024-10-16T09:06:42Z
```

It consists of 1879108073 rows, about 1.3TB in the format below.

Below I document the steps to recreate it.

# Download commoncrawl dumps

- Get the `aws` command, via `pip install awscli`.
- Then run some variant of `aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2024-42/segments/ --recursive | awk '{print $4}' | grep 'robotstxt/.*\.warc\.gz$' | xargs -P 100 -I {} aws s3 cp s3://commoncrawl/{} .`
- It is free to access, cf. `https://commoncrawl.org/get-started`. Note that the number of parallel downloaders (`-P` argument above), should not be too high, otherwise you risk being throttled by AWS. I found that I was never rate limited with 100, and that I was sometimes with 100, but ymmv.
- Each crawl contains about 90,000 warc files lately, each of which is around 2mb.
- Note that in principle it is possible to adapt the pipeline to read directly from S3, see the bottom of
https://resiliparse.chatnoir.eu/en/stable/man/fastwarc.html. We didn't know this beforehard, but if starting from scratch it's probably better.


# Build a database per dump
It would be nice to build giant key-value store from the union of all folders in a parallelized fashion.
However, concurrent writes to sqlite do not work well.
Thus, we construct a distinct databases for each crawl as an intermediate step.
This is done with the `build_db.py` script.
It features a simple resumption mechanism in case, of interruption, in the `metadata` subfolder.

Each database contains a single table `robots`, which has five columns:
- `url`: string, e.g. `http://0v.tittrtb.cn/robots.txt`.
- `timestamp`: string (should be int), seconds from epoch in utc, you can convert a timestamp `t` to a tz aware datetime with `recovered = dt.datetime.fromtimestamp(t, dt.UTC)` for `dt = datetime`.
- `status_code`: string:
- `response`: string (bytes), the content of the webpage `b'User-agent: * \r\nDisallow: /plus/ad_js.php\r\nDisallow: /plus/advancedsearch.php\r\nDisallow: /plus/car.php\r\nDisallow: /plus/carbuyaction.php\r\nDisallow: /plus/shops_buyaction.php\r\nDisallow: /plus/erraddsave.php\r\nDisallow: /plus/posttocar.php\r\nDisallow: /plus/disdls.php\r\nDisallow: /plus/feedback_js.php\r\nDisallow: /plus/mytag_js.php\r\nDisallow: /plus/rss.php\r\nDisallow: /plus/search.php\r\nDisallow: /plus/recommend.php\r\nDisallow: /plus/stow.php\r\nDisallow: /plus/count.php\r\nDisallow: /include\r\nDisallow: /templets'`
- `provenance`: string, the name of the warc file from which the . E.g. `CC-MAIN-20211015192439-20211015222439-00005`.

We keep only 200-like responses.

There is no key encoded into the database, but logically it is keyed by `(url, timestamp)`.

# Concatenate databases

Next, concatenate the databases together. I found this to take 7-8 minutes per insertion. It could be sped up by concatenating pairs in parallel.

# Fix large database
We are interested in the last `robots.txt` by url. So we run a query on the resultant database to keep only the entry with the latest date. The exact SQL call is in `build_latest_robots.py`, and in particular creates a new table called `latest_robots`. It contains around 250,000,000 rows.

# Use database to build the last entry per `url`:

To do in next run (already implemented, just need to run)
- Local, not absolute path in provenance
- integer Timestamp, integer return code
-
55 changes: 55 additions & 0 deletions commoncrawl_robots/analyze_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@

import os
import argparse
import functools
import sqlite3
from multiprocessing import Pool
from collections import Counter


def get_num_rows(db_fullfilename: str) -> int:
with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
# print(f"Opened SQLite database with version {sqlite3.sqlite_version}.")
cursor = conn.cursor()
# num_rows = cursor.execute("SELECT COUNT(1) FROM robots;").fetchall()[0][0]
num_rows = cursor.execute("select max(RowId) from robotS;").fetchall()[0][0]
# num_rows = cursor.execute("SELECT COUNT(*) FROM robots;").fetchall()[0][0]
return num_rows


def get_first_rows(db_fullfilename: str) -> int:
with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
print(conn.tables)
cursor = conn.cursor()
response = cursor.execute("SELECT * FROM robots LIMIT 3;").fetchall()
return response


if __name__ == "__main__":
version = 4
base_dir = "/store/swissai/a06/datasets_raw/commoncrawl_kyle/"
db_dir = os.path.join(base_dir, "databases")
out_dir = os.path.join(base_dir, "out")

if False:
db_files = list(filter(lambda _: _.endswith(f"_v{version}.db"), os.listdir(db_dir)))
print(db_files)
for db_file in db_files:
db_fullfilename = os.path.join(db_dir, db_file)
print(db_file, get_num_rows(db_fullfilename), get_first_rows(db_fullfilename))


out_filename = f"everything_v{version}.db"
out_fullfilename = os.path.join(base_dir, "out", out_filename)
print(out_fullfilename)

# query = "SELECT name FROM sqlite_master WHERE type='table';"
query = "SELECT * from robots limit 3;"
with sqlite3.connect(out_fullfilename, timeout=30.0) as conn:
cursor = conn.cursor()
cursor.execute(query)
print(cursor.fetchall())


print(f"Total rows = {get_num_rows(out_fullfilename)}")

30 changes: 30 additions & 0 deletions commoncrawl_robots/analyze_robots_latest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import os
import datetime as dt
import argparse
import functools
import sqlite3
from multiprocessing import Pool
from collections import Counter

if __name__ == "__main__":
version = 4
out_dir = "/store/swissai/a06/datasets_raw/commoncrawl_kyle/out"
db_fullfilename = os.path.join(out_dir, f"everything_v{version}.db")

print(dt.datetime.now(dt.UTC))
with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
cursor = conn.cursor()
response = cursor.execute("SELECT * FROM latest_robots LIMIT 3;").fetchall()
print(response)

num_rows = cursor.execute("select max(RowId) from latest_robots;").fetchall()[0][0]
print(num_rows)

num_rows = cursor.execute("SELECT COUNT(*) FROM latest_robots;").fetchall()
print(num_rows)

cursor.close()

print(dt.datetime.now(dt.UTC))
print("done")

66 changes: 66 additions & 0 deletions commoncrawl_robots/build_calls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
import os
import time

all_idents = [
"CC-MAIN-2024-42",
"CC-MAIN-2024-38",
"CC-MAIN-2024-33",
"CC-MAIN-2024-30",
"CC-MAIN-2024-26",
"CC-MAIN-2024-22",
"CC-MAIN-2024-18",
"CC-MAIN-2024-10",
"CC-MAIN-2023-50",
"CC-MAIN-2023-40",
"CC-MAIN-2023-23",
"CC-MAIN-2023-14",
"CC-MAIN-2023-06",
"CC-MAIN-2022-49",
"CC-MAIN-2022-40",
"CC-MAIN-2022-33",
"CC-MAIN-2022-27",
"CC-MAIN-2022-21",
"CC-MAIN-2022-05",
"CC-MAIN-2021-49",
"CC-MAIN-2021-43"
]
all_idents = list(sorted(all_idents))


if __name__ == "__main__":
basedir = "/store/swissai/a06/datasets_raw/commoncrawl_kyle"
logs_dir = os.path.join(basedir, "logs")
run_dir = os.path.join(basedir, "run")
db_version = 4

for ident in all_idents:
job_name = f"{ident}".replace("CC-MAIN-", "")
ident_logs_dir = os.path.join(logs_dir, ident)

os.makedirs(ident_logs_dir, exist_ok=True)
output_file = os.path.join(ident_logs_dir, "output_%x_%j.log")
error_file = os.path.join(ident_logs_dir, "output_%x_%j.err")

sbatch_file = f"""#!/bin/bash
#SBATCH --job-name={job_name}
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=05:59:00
#SBATCH --output={output_file}
#SBATCH --error={error_file}
#SBATCH --no-requeue
#SBATCH --mem=460000
#SBATCH --uenv=prgenv-gnu/24.7:v3
##uenv start --view=default prgenv-gnu/24.7:v3
source ~/localenv312/bin/activate
python3 {basedir}/commoncrawl_robots/build_db.py --crawl_fullfilename={basedir}/{ident} --out_dir={basedir}/databases --db_version={db_version}
"""
sbatch_fullfilename = os.path.join(run_dir, f"{ident}.sbatch")
with open(sbatch_fullfilename, "w") as f:
f.write(sbatch_file)

os.system(f"sbatch {sbatch_fullfilename}")
time.sleep(.15)
print("Done")

144 changes: 144 additions & 0 deletions commoncrawl_robots/build_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# https://docs.python.org/3/library/sqlite3.html
# https://www.sqlitetutorial.net/sqlite-python/creating-database/
# https://www.ionos.com/digitalguide/websites/web-development/sqlite3-python/
import os
import argparse
import logging
import sqlite3
import datetime as dt

import tqdm
import fastwarc

import log_config

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(log_config.get_standard_streamhandler())


def is_okay_statusline(statusline: str) -> bool:
return statusline.startswith("200")


def extract_fast(warc_fullfilename: str) -> tuple:
ok_codes = [200]

_, provenance = os.path.split(warc_fullfilename)
stream = fastwarc.stream_io.GZipStream(fastwarc.stream_io.FileStream(warc_fullfilename, 'rb'))
record_types = fastwarc.warc.WarcRecordType.response
func_filter = lambda _: (_ is not None) and \
(_.http_headers is not None) and \
(_.http_headers.status_code in ok_codes)
archive_iterator = fastwarc.warc.ArchiveIterator(stream,
# func_filter=func_filter,
record_types=record_types,
parse_http=True)
for idx, record in enumerate(archive_iterator):
timestamp = record.record_date
url = record.headers['WARC-Target-URI']
status_code = record.http_headers.status_code
if status_code in ok_codes:
content = record.reader.read()
else:
content = ""

row = (url, timestamp, status_code, content, provenance)
yield row
# print(row)
# print("Done")


def initialize_db(db_fullfilename):
with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
conn.execute('PRAGMA journal_mode=OFF;')
logger.info(f"Opened SQLite database with version {sqlite3.sqlite_version}.")
cursor = conn.cursor()
command = """
CREATE TABLE IF NOT EXISTS robots (
url TEXT,
timestamp TEXT,
response TEXT,
content TEXT,
provenance TEXT
);
"""
cursor.execute(command)

cursor.execute('PRAGMA journal_mode=OFF;')
result = cursor.fetchone()
logger.info("Journal mode set to:", result[0])
cursor.close()
conn.commit()


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--crawl_fullfilename', type=str, required=True)
parser.add_argument('--out_dir', type=str, required=True)
parser.add_argument('--db_version', type=str, required=True)
args = parser.parse_args()

crawl_fullfilename = args.crawl_fullfilename
out_dir = args.out_dir
os.makedirs(out_dir, exist_ok=True)
db_version = args.db_version

commit_every = 5

_, crawl_ident = os.path.split(crawl_fullfilename)
db_filename = f"{crawl_ident}_v{db_version}.db"
db_fullfilename = os.path.join(out_dir, db_filename)
initialize_db(db_fullfilename)
warc_list = sorted(os.listdir(crawl_fullfilename))
total_file_num = len(warc_list)
metadata_filename = f"{crawl_ident}_v{db_version}.txt"
metadata_dir = os.path.join(out_dir, "metadata")
os.makedirs(metadata_dir, exist_ok=True)
metadata_fullfilename = os.path.join(metadata_dir, metadata_filename)

if os.path.exists(metadata_fullfilename):
with open(metadata_fullfilename, 'r') as f:
line = f.readline()
line_split = line.split(",")
assert 2 == len(line_split)
last_file_num = int(line_split[0])
read_total_file_num = int(line_split[1])
assert total_file_num == read_total_file_num
logger.info(f"Found metadata file {last_file_num} / {total_file_num}")
else:
last_file_num = 0

with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
for idx, warc_filename in tqdm.tqdm(enumerate(warc_list), total=total_file_num):
if idx < last_file_num:
continue
warc_fullfilename = os.path.join(crawl_fullfilename, warc_filename)
# values_old = extract(warc_fullfilename)
values = extract_fast(warc_fullfilename)
if False:
tp0 = next(values_old)
tp1 = next(values)
print(tp0)
print(tp1)
timestamp = tp1[1].timestamp()
recovered = dt.datetime.fromtimestamp(timestamp, dt.UTC)
cursor = conn.cursor()
cursor.executemany("INSERT OR REPLACE INTO robots VALUES (?, ?, ?, ?, ?)", values)
cursor.close()

if idx % commit_every == 0:
conn.commit()

with open(metadata_fullfilename, 'w') as f:
line_to_write = f"{idx},{total_file_num}"
f.write(line_to_write)

cursor = conn.cursor()
num_rows = cursor.execute("SELECT COUNT(*) FROM robots;").fetchall()[0][0]

logger.info(f"Done building {db_fullfilename}")
logger.info(f"{len(warc_list)} warc files")
logger.info(f"{num_rows} total records")
logger.info(f"{int(num_rows / len(warc_list))} records / warc file")
logger.info("Done")
Loading

0 comments on commit 100c923

Please sign in to comment.