forked from huggingface/datatrove
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
0b81df3
commit 100c923
Showing
18 changed files
with
954 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
|
||
# Raw robots file, 10.2021 - 10.2024 | ||
Kyle Matoba, 01.11.2024 | ||
|
||
This dataset contains the `robots.txt` files from all commoncrawl `response`s between October 2021 and October 2024: | ||
|
||
Here are the constituent crawls, with the number of 200-like responses and the time period covered. | ||
``` | ||
crawl name # row first datetime last datetime | ||
CC-MAIN-2021-43 88150767 2021-10-15T19:25:17Z - 2021-10-28T22:35:51Z | ||
CC-MAIN-2021-49 86901347 2021-11-26T22:41:26Z - 2021-12-09T15:24:47Z | ||
CC-MAIN-2022-05 81774170 2022-01-16T09:32:03Z - 2022-01-29T15:17:10Z | ||
CC-MAIN-2022-21 88321885 2022-05-16T04:14:19Z - 2022-05-29T13:38:39Z | ||
CC-MAIN-2022-27 86843004 2022-06-24T21:39:49Z - 2022-07-07T18:40:45Z | ||
CC-MAIN-2022-33 89962948 2022-08-07T15:09:49Z - 2022-08-20T07:24:07Z | ||
CC-MAIN-2022-40 82487425 2022-09-24T15:16:04Z - 2022-10-08T00:04:49Z | ||
CC-MAIN-2022-49 87284951 2022-11-26T08:07:55Z - 2022-12-10T10:42:38Z | ||
CC-MAIN-2023-06 78733681 2023-01-26T21:09:11Z - 2023-02-09T14:25:02Z | ||
CC-MAIN-2023-14 81150267 2023-03-20T08:35:35Z - 2023-04-02T13:50:18Z | ||
CC-MAIN-2023-23 85454018 2023-05-27T22:35:39Z - 2023-06-11T02:30:16Z | ||
CC-MAIN-2023-40 90317330 2023-09-21T07:37:33Z - 2023-10-05T04:20:01Z | ||
CC-MAIN-2023-50 101028224 2023-11-28T08:35:09Z - 2023-12-12T00:04:04Z | ||
CC-MAIN-2024-10 94733131 2024-02-20T21:11:18Z - 2024-03-05T15:40:46Z | ||
CC-MAIN-2024-18 92143840 2024-04-12T10:14:20Z - 2024-04-25T16:02:22Z | ||
CC-MAIN-2024-22 97644816 2024-05-17T23:31:46Z - 2024-05-31T00:00:41Z | ||
CC-MAIN-2024-26 99715321 2024-06-12T14:04:48Z - 2024-06-25T23:27:59Z | ||
CC-MAIN-2024-30 94443660 2024-07-12T09:42:39Z - 2024-07-25T20:50:31Z | ||
CC-MAIN-2024-33 90738661 2024-08-02T23:45:30Z - 2024-08-16T05:58:25Z | ||
CC-MAIN-2024-38 91196581 2024-09-07T09:59:21Z - 2024-09-21T04:50:53Z | ||
CC-MAIN-2024-42 90254842 2024-10-03T09:40:44Z - 2024-10-16T09:06:42Z | ||
``` | ||
|
||
It consists of 1879108073 rows, about 1.3TB in the format below. | ||
|
||
Below I document the steps to recreate it. | ||
|
||
# Download commoncrawl dumps | ||
|
||
- Get the `aws` command, via `pip install awscli`. | ||
- Then run some variant of `aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2024-42/segments/ --recursive | awk '{print $4}' | grep 'robotstxt/.*\.warc\.gz$' | xargs -P 100 -I {} aws s3 cp s3://commoncrawl/{} .` | ||
- It is free to access, cf. `https://commoncrawl.org/get-started`. Note that the number of parallel downloaders (`-P` argument above), should not be too high, otherwise you risk being throttled by AWS. I found that I was never rate limited with 100, and that I was sometimes with 100, but ymmv. | ||
- Each crawl contains about 90,000 warc files lately, each of which is around 2mb. | ||
- Note that in principle it is possible to adapt the pipeline to read directly from S3, see the bottom of | ||
https://resiliparse.chatnoir.eu/en/stable/man/fastwarc.html. We didn't know this beforehard, but if starting from scratch it's probably better. | ||
|
||
|
||
# Build a database per dump | ||
It would be nice to build giant key-value store from the union of all folders in a parallelized fashion. | ||
However, concurrent writes to sqlite do not work well. | ||
Thus, we construct a distinct databases for each crawl as an intermediate step. | ||
This is done with the `build_db.py` script. | ||
It features a simple resumption mechanism in case, of interruption, in the `metadata` subfolder. | ||
|
||
Each database contains a single table `robots`, which has five columns: | ||
- `url`: string, e.g. `http://0v.tittrtb.cn/robots.txt`. | ||
- `timestamp`: string (should be int), seconds from epoch in utc, you can convert a timestamp `t` to a tz aware datetime with `recovered = dt.datetime.fromtimestamp(t, dt.UTC)` for `dt = datetime`. | ||
- `status_code`: string: | ||
- `response`: string (bytes), the content of the webpage `b'User-agent: * \r\nDisallow: /plus/ad_js.php\r\nDisallow: /plus/advancedsearch.php\r\nDisallow: /plus/car.php\r\nDisallow: /plus/carbuyaction.php\r\nDisallow: /plus/shops_buyaction.php\r\nDisallow: /plus/erraddsave.php\r\nDisallow: /plus/posttocar.php\r\nDisallow: /plus/disdls.php\r\nDisallow: /plus/feedback_js.php\r\nDisallow: /plus/mytag_js.php\r\nDisallow: /plus/rss.php\r\nDisallow: /plus/search.php\r\nDisallow: /plus/recommend.php\r\nDisallow: /plus/stow.php\r\nDisallow: /plus/count.php\r\nDisallow: /include\r\nDisallow: /templets'` | ||
- `provenance`: string, the name of the warc file from which the . E.g. `CC-MAIN-20211015192439-20211015222439-00005`. | ||
|
||
We keep only 200-like responses. | ||
|
||
There is no key encoded into the database, but logically it is keyed by `(url, timestamp)`. | ||
|
||
# Concatenate databases | ||
|
||
Next, concatenate the databases together. I found this to take 7-8 minutes per insertion. It could be sped up by concatenating pairs in parallel. | ||
|
||
# Fix large database | ||
We are interested in the last `robots.txt` by url. So we run a query on the resultant database to keep only the entry with the latest date. The exact SQL call is in `build_latest_robots.py`, and in particular creates a new table called `latest_robots`. It contains around 250,000,000 rows. | ||
|
||
# Use database to build the last entry per `url`: | ||
|
||
To do in next run (already implemented, just need to run) | ||
- Local, not absolute path in provenance | ||
- integer Timestamp, integer return code | ||
- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
|
||
import os | ||
import argparse | ||
import functools | ||
import sqlite3 | ||
from multiprocessing import Pool | ||
from collections import Counter | ||
|
||
|
||
def get_num_rows(db_fullfilename: str) -> int: | ||
with sqlite3.connect(db_fullfilename, timeout=30.0) as conn: | ||
# print(f"Opened SQLite database with version {sqlite3.sqlite_version}.") | ||
cursor = conn.cursor() | ||
# num_rows = cursor.execute("SELECT COUNT(1) FROM robots;").fetchall()[0][0] | ||
num_rows = cursor.execute("select max(RowId) from robotS;").fetchall()[0][0] | ||
# num_rows = cursor.execute("SELECT COUNT(*) FROM robots;").fetchall()[0][0] | ||
return num_rows | ||
|
||
|
||
def get_first_rows(db_fullfilename: str) -> int: | ||
with sqlite3.connect(db_fullfilename, timeout=30.0) as conn: | ||
print(conn.tables) | ||
cursor = conn.cursor() | ||
response = cursor.execute("SELECT * FROM robots LIMIT 3;").fetchall() | ||
return response | ||
|
||
|
||
if __name__ == "__main__": | ||
version = 4 | ||
base_dir = "/store/swissai/a06/datasets_raw/commoncrawl_kyle/" | ||
db_dir = os.path.join(base_dir, "databases") | ||
out_dir = os.path.join(base_dir, "out") | ||
|
||
if False: | ||
db_files = list(filter(lambda _: _.endswith(f"_v{version}.db"), os.listdir(db_dir))) | ||
print(db_files) | ||
for db_file in db_files: | ||
db_fullfilename = os.path.join(db_dir, db_file) | ||
print(db_file, get_num_rows(db_fullfilename), get_first_rows(db_fullfilename)) | ||
|
||
|
||
out_filename = f"everything_v{version}.db" | ||
out_fullfilename = os.path.join(base_dir, "out", out_filename) | ||
print(out_fullfilename) | ||
|
||
# query = "SELECT name FROM sqlite_master WHERE type='table';" | ||
query = "SELECT * from robots limit 3;" | ||
with sqlite3.connect(out_fullfilename, timeout=30.0) as conn: | ||
cursor = conn.cursor() | ||
cursor.execute(query) | ||
print(cursor.fetchall()) | ||
|
||
|
||
print(f"Total rows = {get_num_rows(out_fullfilename)}") | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
import os | ||
import datetime as dt | ||
import argparse | ||
import functools | ||
import sqlite3 | ||
from multiprocessing import Pool | ||
from collections import Counter | ||
|
||
if __name__ == "__main__": | ||
version = 4 | ||
out_dir = "/store/swissai/a06/datasets_raw/commoncrawl_kyle/out" | ||
db_fullfilename = os.path.join(out_dir, f"everything_v{version}.db") | ||
|
||
print(dt.datetime.now(dt.UTC)) | ||
with sqlite3.connect(db_fullfilename, timeout=30.0) as conn: | ||
cursor = conn.cursor() | ||
response = cursor.execute("SELECT * FROM latest_robots LIMIT 3;").fetchall() | ||
print(response) | ||
|
||
num_rows = cursor.execute("select max(RowId) from latest_robots;").fetchall()[0][0] | ||
print(num_rows) | ||
|
||
num_rows = cursor.execute("SELECT COUNT(*) FROM latest_robots;").fetchall() | ||
print(num_rows) | ||
|
||
cursor.close() | ||
|
||
print(dt.datetime.now(dt.UTC)) | ||
print("done") | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
import os | ||
import time | ||
|
||
all_idents = [ | ||
"CC-MAIN-2024-42", | ||
"CC-MAIN-2024-38", | ||
"CC-MAIN-2024-33", | ||
"CC-MAIN-2024-30", | ||
"CC-MAIN-2024-26", | ||
"CC-MAIN-2024-22", | ||
"CC-MAIN-2024-18", | ||
"CC-MAIN-2024-10", | ||
"CC-MAIN-2023-50", | ||
"CC-MAIN-2023-40", | ||
"CC-MAIN-2023-23", | ||
"CC-MAIN-2023-14", | ||
"CC-MAIN-2023-06", | ||
"CC-MAIN-2022-49", | ||
"CC-MAIN-2022-40", | ||
"CC-MAIN-2022-33", | ||
"CC-MAIN-2022-27", | ||
"CC-MAIN-2022-21", | ||
"CC-MAIN-2022-05", | ||
"CC-MAIN-2021-49", | ||
"CC-MAIN-2021-43" | ||
] | ||
all_idents = list(sorted(all_idents)) | ||
|
||
|
||
if __name__ == "__main__": | ||
basedir = "/store/swissai/a06/datasets_raw/commoncrawl_kyle" | ||
logs_dir = os.path.join(basedir, "logs") | ||
run_dir = os.path.join(basedir, "run") | ||
db_version = 4 | ||
|
||
for ident in all_idents: | ||
job_name = f"{ident}".replace("CC-MAIN-", "") | ||
ident_logs_dir = os.path.join(logs_dir, ident) | ||
|
||
os.makedirs(ident_logs_dir, exist_ok=True) | ||
output_file = os.path.join(ident_logs_dir, "output_%x_%j.log") | ||
error_file = os.path.join(ident_logs_dir, "output_%x_%j.err") | ||
|
||
sbatch_file = f"""#!/bin/bash | ||
#SBATCH --job-name={job_name} | ||
#SBATCH --nodes=1 | ||
#SBATCH --ntasks-per-node=1 | ||
#SBATCH --time=05:59:00 | ||
#SBATCH --output={output_file} | ||
#SBATCH --error={error_file} | ||
#SBATCH --no-requeue | ||
#SBATCH --mem=460000 | ||
#SBATCH --uenv=prgenv-gnu/24.7:v3 | ||
##uenv start --view=default prgenv-gnu/24.7:v3 | ||
source ~/localenv312/bin/activate | ||
python3 {basedir}/commoncrawl_robots/build_db.py --crawl_fullfilename={basedir}/{ident} --out_dir={basedir}/databases --db_version={db_version} | ||
""" | ||
sbatch_fullfilename = os.path.join(run_dir, f"{ident}.sbatch") | ||
with open(sbatch_fullfilename, "w") as f: | ||
f.write(sbatch_file) | ||
|
||
os.system(f"sbatch {sbatch_fullfilename}") | ||
time.sleep(.15) | ||
print("Done") | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
# https://docs.python.org/3/library/sqlite3.html | ||
# https://www.sqlitetutorial.net/sqlite-python/creating-database/ | ||
# https://www.ionos.com/digitalguide/websites/web-development/sqlite3-python/ | ||
import os | ||
import argparse | ||
import logging | ||
import sqlite3 | ||
import datetime as dt | ||
|
||
import tqdm | ||
import fastwarc | ||
|
||
import log_config | ||
|
||
logger = logging.getLogger(__name__) | ||
logger.setLevel(logging.DEBUG) | ||
logger.addHandler(log_config.get_standard_streamhandler()) | ||
|
||
|
||
def is_okay_statusline(statusline: str) -> bool: | ||
return statusline.startswith("200") | ||
|
||
|
||
def extract_fast(warc_fullfilename: str) -> tuple: | ||
ok_codes = [200] | ||
|
||
_, provenance = os.path.split(warc_fullfilename) | ||
stream = fastwarc.stream_io.GZipStream(fastwarc.stream_io.FileStream(warc_fullfilename, 'rb')) | ||
record_types = fastwarc.warc.WarcRecordType.response | ||
func_filter = lambda _: (_ is not None) and \ | ||
(_.http_headers is not None) and \ | ||
(_.http_headers.status_code in ok_codes) | ||
archive_iterator = fastwarc.warc.ArchiveIterator(stream, | ||
# func_filter=func_filter, | ||
record_types=record_types, | ||
parse_http=True) | ||
for idx, record in enumerate(archive_iterator): | ||
timestamp = record.record_date | ||
url = record.headers['WARC-Target-URI'] | ||
status_code = record.http_headers.status_code | ||
if status_code in ok_codes: | ||
content = record.reader.read() | ||
else: | ||
content = "" | ||
|
||
row = (url, timestamp, status_code, content, provenance) | ||
yield row | ||
# print(row) | ||
# print("Done") | ||
|
||
|
||
def initialize_db(db_fullfilename): | ||
with sqlite3.connect(db_fullfilename, timeout=30.0) as conn: | ||
conn.execute('PRAGMA journal_mode=OFF;') | ||
logger.info(f"Opened SQLite database with version {sqlite3.sqlite_version}.") | ||
cursor = conn.cursor() | ||
command = """ | ||
CREATE TABLE IF NOT EXISTS robots ( | ||
url TEXT, | ||
timestamp TEXT, | ||
response TEXT, | ||
content TEXT, | ||
provenance TEXT | ||
); | ||
""" | ||
cursor.execute(command) | ||
|
||
cursor.execute('PRAGMA journal_mode=OFF;') | ||
result = cursor.fetchone() | ||
logger.info("Journal mode set to:", result[0]) | ||
cursor.close() | ||
conn.commit() | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument('--crawl_fullfilename', type=str, required=True) | ||
parser.add_argument('--out_dir', type=str, required=True) | ||
parser.add_argument('--db_version', type=str, required=True) | ||
args = parser.parse_args() | ||
|
||
crawl_fullfilename = args.crawl_fullfilename | ||
out_dir = args.out_dir | ||
os.makedirs(out_dir, exist_ok=True) | ||
db_version = args.db_version | ||
|
||
commit_every = 5 | ||
|
||
_, crawl_ident = os.path.split(crawl_fullfilename) | ||
db_filename = f"{crawl_ident}_v{db_version}.db" | ||
db_fullfilename = os.path.join(out_dir, db_filename) | ||
initialize_db(db_fullfilename) | ||
warc_list = sorted(os.listdir(crawl_fullfilename)) | ||
total_file_num = len(warc_list) | ||
metadata_filename = f"{crawl_ident}_v{db_version}.txt" | ||
metadata_dir = os.path.join(out_dir, "metadata") | ||
os.makedirs(metadata_dir, exist_ok=True) | ||
metadata_fullfilename = os.path.join(metadata_dir, metadata_filename) | ||
|
||
if os.path.exists(metadata_fullfilename): | ||
with open(metadata_fullfilename, 'r') as f: | ||
line = f.readline() | ||
line_split = line.split(",") | ||
assert 2 == len(line_split) | ||
last_file_num = int(line_split[0]) | ||
read_total_file_num = int(line_split[1]) | ||
assert total_file_num == read_total_file_num | ||
logger.info(f"Found metadata file {last_file_num} / {total_file_num}") | ||
else: | ||
last_file_num = 0 | ||
|
||
with sqlite3.connect(db_fullfilename, timeout=30.0) as conn: | ||
for idx, warc_filename in tqdm.tqdm(enumerate(warc_list), total=total_file_num): | ||
if idx < last_file_num: | ||
continue | ||
warc_fullfilename = os.path.join(crawl_fullfilename, warc_filename) | ||
# values_old = extract(warc_fullfilename) | ||
values = extract_fast(warc_fullfilename) | ||
if False: | ||
tp0 = next(values_old) | ||
tp1 = next(values) | ||
print(tp0) | ||
print(tp1) | ||
timestamp = tp1[1].timestamp() | ||
recovered = dt.datetime.fromtimestamp(timestamp, dt.UTC) | ||
cursor = conn.cursor() | ||
cursor.executemany("INSERT OR REPLACE INTO robots VALUES (?, ?, ?, ?, ?)", values) | ||
cursor.close() | ||
|
||
if idx % commit_every == 0: | ||
conn.commit() | ||
|
||
with open(metadata_fullfilename, 'w') as f: | ||
line_to_write = f"{idx},{total_file_num}" | ||
f.write(line_to_write) | ||
|
||
cursor = conn.cursor() | ||
num_rows = cursor.execute("SELECT COUNT(*) FROM robots;").fetchall()[0][0] | ||
|
||
logger.info(f"Done building {db_fullfilename}") | ||
logger.info(f"{len(warc_list)} warc files") | ||
logger.info(f"{num_rows} total records") | ||
logger.info(f"{int(num_rows / len(warc_list))} records / warc file") | ||
logger.info("Done") |
Oops, something went wrong.