Add commoncrawl robots code

swiss-ai · Nov 21, 2024 · 100c923 · 100c923
1 parent 0b81df3
commit 100c923
Show file tree

Hide file tree

Showing 18 changed files with 954 additions and 0 deletions.
diff --git a/commoncrawl_robots/README_km.md b/commoncrawl_robots/README_km.md
@@ -0,0 +1,80 @@
+
+# Raw robots file, 10.2021 - 10.2024
+Kyle Matoba, 01.11.2024
+
+This dataset contains the `robots.txt` files from all commoncrawl `response`s between October 2021 and October 2024: 
+
+Here are the constituent crawls, with the number of 200-like responses and the time period covered.
+```
+crawl name       # row    first datetime         last datetime
+CC-MAIN-2021-43  88150767 2021-10-15T19:25:17Z - 2021-10-28T22:35:51Z
+CC-MAIN-2021-49  86901347 2021-11-26T22:41:26Z - 2021-12-09T15:24:47Z
+
+CC-MAIN-2022-05  81774170 2022-01-16T09:32:03Z - 2022-01-29T15:17:10Z
+CC-MAIN-2022-21  88321885 2022-05-16T04:14:19Z - 2022-05-29T13:38:39Z
+CC-MAIN-2022-27  86843004 2022-06-24T21:39:49Z - 2022-07-07T18:40:45Z
+CC-MAIN-2022-33  89962948 2022-08-07T15:09:49Z - 2022-08-20T07:24:07Z
+CC-MAIN-2022-40  82487425 2022-09-24T15:16:04Z - 2022-10-08T00:04:49Z
+CC-MAIN-2022-49  87284951 2022-11-26T08:07:55Z - 2022-12-10T10:42:38Z
+
+CC-MAIN-2023-06  78733681 2023-01-26T21:09:11Z - 2023-02-09T14:25:02Z
+CC-MAIN-2023-14  81150267 2023-03-20T08:35:35Z - 2023-04-02T13:50:18Z
+CC-MAIN-2023-23  85454018 2023-05-27T22:35:39Z - 2023-06-11T02:30:16Z
+CC-MAIN-2023-40  90317330 2023-09-21T07:37:33Z - 2023-10-05T04:20:01Z
+CC-MAIN-2023-50 101028224 2023-11-28T08:35:09Z - 2023-12-12T00:04:04Z
+
+CC-MAIN-2024-10  94733131 2024-02-20T21:11:18Z - 2024-03-05T15:40:46Z
+CC-MAIN-2024-18  92143840 2024-04-12T10:14:20Z - 2024-04-25T16:02:22Z
+CC-MAIN-2024-22  97644816 2024-05-17T23:31:46Z - 2024-05-31T00:00:41Z
+CC-MAIN-2024-26  99715321 2024-06-12T14:04:48Z - 2024-06-25T23:27:59Z
+CC-MAIN-2024-30  94443660 2024-07-12T09:42:39Z - 2024-07-25T20:50:31Z
+CC-MAIN-2024-33  90738661 2024-08-02T23:45:30Z - 2024-08-16T05:58:25Z
+CC-MAIN-2024-38  91196581 2024-09-07T09:59:21Z - 2024-09-21T04:50:53Z
+CC-MAIN-2024-42  90254842 2024-10-03T09:40:44Z - 2024-10-16T09:06:42Z
+```
+
+It consists of 1879108073 rows, about 1.3TB in the format below. 
+
+Below I document the steps to recreate it. 
+
+# Download commoncrawl dumps
+
+ - Get the `aws` command, via `pip install awscli`.
+ - Then run some variant of `aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2024-42/segments/ --recursive |  awk '{print $4}' | grep 'robotstxt/.*\.warc\.gz$' | xargs -P 100 -I {} aws s3 cp s3://commoncrawl/{} .`
+ - It is free to access, cf. `https://commoncrawl.org/get-started`. Note that the number of parallel downloaders (`-P` argument above), should not be too high, otherwise you risk being throttled by AWS. I found that I was never rate limited with 100, and that I was sometimes with 100, but ymmv.
+ - Each crawl contains about 90,000 warc files lately, each of which is around 2mb.
+ - Note that in principle it is possible to adapt the pipeline to read directly from S3, see the bottom of 
+https://resiliparse.chatnoir.eu/en/stable/man/fastwarc.html. We didn't know this beforehard, but if starting from scratch it's probably better. 
+
+
+# Build a database per dump
+  It would be nice to build giant key-value store from the union of all folders in a parallelized fashion.
+  However, concurrent writes to sqlite do not work well. 
+  Thus, we construct a distinct databases for each crawl as an intermediate step. 
+  This is done with the `build_db.py` script.
+  It features a simple resumption mechanism in case, of interruption, in the `metadata` subfolder.  
+
+Each database contains a single table `robots`, which has five columns:
+ - `url`: string, e.g. `http://0v.tittrtb.cn/robots.txt`.
+ - `timestamp`: string (should be int), seconds from epoch in utc, you can convert a timestamp `t` to a tz aware datetime with `recovered = dt.datetime.fromtimestamp(t, dt.UTC)` for `dt = datetime`.
+ - `status_code`: string: 
+ - `response`: string (bytes), the content of the webpage `b'User-agent: * \r\nDisallow: /plus/ad_js.php\r\nDisallow: /plus/advancedsearch.php\r\nDisallow: /plus/car.php\r\nDisallow: /plus/carbuyaction.php\r\nDisallow: /plus/shops_buyaction.php\r\nDisallow: /plus/erraddsave.php\r\nDisallow: /plus/posttocar.php\r\nDisallow: /plus/disdls.php\r\nDisallow: /plus/feedback_js.php\r\nDisallow: /plus/mytag_js.php\r\nDisallow: /plus/rss.php\r\nDisallow: /plus/search.php\r\nDisallow: /plus/recommend.php\r\nDisallow: /plus/stow.php\r\nDisallow: /plus/count.php\r\nDisallow: /include\r\nDisallow: /templets'`
+ - `provenance`: string, the name of the warc file from which the  . E.g. `CC-MAIN-20211015192439-20211015222439-00005`.
+
+We keep only 200-like responses.
+
+There is no key encoded into the database, but logically it is keyed by `(url, timestamp)`.
+
+# Concatenate databases
+
+Next, concatenate the databases together. I found this to take 7-8 minutes per insertion. It could be sped up by concatenating pairs in parallel. 
+
+# Fix large database
+We are interested in the last `robots.txt` by url. So we run a query on the resultant database to keep only the entry with the latest date. The exact SQL call is in `build_latest_robots.py`, and in particular creates a new table called `latest_robots`. It contains around 250,000,000 rows. 
+
+# Use database to build the last entry per `url`:
+
+To do in next run (already implemented, just need to run)
+ - Local, not absolute path in provenance
+ - integer Timestamp, integer return code
+ - 
diff --git a/commoncrawl_robots/analyze_db.py b/commoncrawl_robots/analyze_db.py
@@ -0,0 +1,55 @@
+
+import os
+import argparse
+import functools
+import sqlite3
+from multiprocessing import Pool
+from collections import Counter
+
+
+def get_num_rows(db_fullfilename: str) -> int:
+    with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
+        # print(f"Opened SQLite database with version {sqlite3.sqlite_version}.")
+        cursor = conn.cursor()
+        # num_rows = cursor.execute("SELECT COUNT(1) FROM robots;").fetchall()[0][0]
+        num_rows = cursor.execute("select max(RowId) from robotS;").fetchall()[0][0]
+        # num_rows = cursor.execute("SELECT COUNT(*) FROM robots;").fetchall()[0][0]
+    return num_rows
+
+
+def get_first_rows(db_fullfilename: str) -> int:
+    with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
+        print(conn.tables)
+        cursor = conn.cursor()
+        response = cursor.execute("SELECT * FROM robots LIMIT 3;").fetchall()
+    return response
+
+
+if __name__ == "__main__":
+    version = 4
+    base_dir = "/store/swissai/a06/datasets_raw/commoncrawl_kyle/"
+    db_dir = os.path.join(base_dir, "databases")
+    out_dir = os.path.join(base_dir, "out")
+
+    if False:
+        db_files = list(filter(lambda _: _.endswith(f"_v{version}.db"), os.listdir(db_dir)))
+        print(db_files)
+        for db_file in db_files:
+            db_fullfilename = os.path.join(db_dir, db_file)
+            print(db_file, get_num_rows(db_fullfilename), get_first_rows(db_fullfilename))
+
+
+    out_filename = f"everything_v{version}.db"
+    out_fullfilename = os.path.join(base_dir, "out", out_filename)
+    print(out_fullfilename)
+
+    # query = "SELECT name FROM sqlite_master WHERE type='table';"
+    query = "SELECT * from robots limit 3;"
+    with sqlite3.connect(out_fullfilename, timeout=30.0) as conn:
+        cursor = conn.cursor()
+        cursor.execute(query)
+        print(cursor.fetchall())
+
+
+    print(f"Total rows = {get_num_rows(out_fullfilename)}")
+
diff --git a/commoncrawl_robots/analyze_robots_latest.py b/commoncrawl_robots/analyze_robots_latest.py
@@ -0,0 +1,30 @@
+import os
+import datetime as dt
+import argparse
+import functools
+import sqlite3
+from multiprocessing import Pool
+from collections import Counter
+
+if __name__ == "__main__":
+    version = 4
+    out_dir = "/store/swissai/a06/datasets_raw/commoncrawl_kyle/out"
+    db_fullfilename = os.path.join(out_dir, f"everything_v{version}.db")
+
+    print(dt.datetime.now(dt.UTC))
+    with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
+        cursor = conn.cursor()
+        response = cursor.execute("SELECT * FROM latest_robots LIMIT 3;").fetchall()
+        print(response)
+
+        num_rows = cursor.execute("select max(RowId) from latest_robots;").fetchall()[0][0]
+        print(num_rows)
+
+        num_rows = cursor.execute("SELECT COUNT(*) FROM latest_robots;").fetchall()
+        print(num_rows)
+
+        cursor.close()
+
+    print(dt.datetime.now(dt.UTC))
+    print("done") 
+
diff --git a/commoncrawl_robots/build_calls.py b/commoncrawl_robots/build_calls.py
@@ -0,0 +1,66 @@
+import os
+import time
+
+all_idents = [
+    "CC-MAIN-2024-42",
+    "CC-MAIN-2024-38",
+    "CC-MAIN-2024-33",
+    "CC-MAIN-2024-30",
+    "CC-MAIN-2024-26",
+    "CC-MAIN-2024-22",
+    "CC-MAIN-2024-18",
+    "CC-MAIN-2024-10",
+    "CC-MAIN-2023-50",
+    "CC-MAIN-2023-40",
+    "CC-MAIN-2023-23",
+    "CC-MAIN-2023-14",
+    "CC-MAIN-2023-06",
+    "CC-MAIN-2022-49",
+    "CC-MAIN-2022-40",
+    "CC-MAIN-2022-33",
+    "CC-MAIN-2022-27",
+    "CC-MAIN-2022-21",
+    "CC-MAIN-2022-05",
+    "CC-MAIN-2021-49",
+    "CC-MAIN-2021-43"
+]
+all_idents = list(sorted(all_idents))
+
+
+if __name__ == "__main__":
+    basedir = "/store/swissai/a06/datasets_raw/commoncrawl_kyle"
+    logs_dir = os.path.join(basedir, "logs")
+    run_dir = os.path.join(basedir, "run")
+    db_version = 4
+
+    for ident in all_idents:
+        job_name = f"{ident}".replace("CC-MAIN-", "")
+        ident_logs_dir = os.path.join(logs_dir, ident)
+
+        os.makedirs(ident_logs_dir, exist_ok=True)
+        output_file = os.path.join(ident_logs_dir, "output_%x_%j.log")
+        error_file = os.path.join(ident_logs_dir, "output_%x_%j.err")
+
+        sbatch_file = f"""#!/bin/bash
+#SBATCH --job-name={job_name}
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --time=05:59:00
+#SBATCH --output={output_file}
+#SBATCH --error={error_file}
+#SBATCH --no-requeue
+#SBATCH --mem=460000
+#SBATCH --uenv=prgenv-gnu/24.7:v3
+
+##uenv start --view=default prgenv-gnu/24.7:v3
+source ~/localenv312/bin/activate
+python3 {basedir}/commoncrawl_robots/build_db.py --crawl_fullfilename={basedir}/{ident} --out_dir={basedir}/databases --db_version={db_version}
+        """
+        sbatch_fullfilename = os.path.join(run_dir, f"{ident}.sbatch")
+        with open(sbatch_fullfilename, "w") as f:
+            f.write(sbatch_file)
+
+        os.system(f"sbatch {sbatch_fullfilename}")
+        time.sleep(.15)
+    print("Done")
+
diff --git a/commoncrawl_robots/build_db.py b/commoncrawl_robots/build_db.py
@@ -0,0 +1,144 @@
+# https://docs.python.org/3/library/sqlite3.html
+# https://www.sqlitetutorial.net/sqlite-python/creating-database/
+# https://www.ionos.com/digitalguide/websites/web-development/sqlite3-python/
+import os
+import argparse
+import logging
+import sqlite3
+import datetime as dt
+
+import tqdm
+import fastwarc
+
+import log_config
+
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)
+logger.addHandler(log_config.get_standard_streamhandler())
+
+
+def is_okay_statusline(statusline: str) -> bool:
+    return statusline.startswith("200")
+
+
+def extract_fast(warc_fullfilename: str) -> tuple:
+    ok_codes = [200]
+
+    _, provenance = os.path.split(warc_fullfilename)
+    stream = fastwarc.stream_io.GZipStream(fastwarc.stream_io.FileStream(warc_fullfilename, 'rb'))
+    record_types = fastwarc.warc.WarcRecordType.response
+    func_filter = lambda _: (_ is not None) and \
+                            (_.http_headers is not None) and \
+                            (_.http_headers.status_code in ok_codes)
+    archive_iterator = fastwarc.warc.ArchiveIterator(stream,
+                                                     # func_filter=func_filter,
+                                                     record_types=record_types,
+                                                     parse_http=True)
+    for idx, record in enumerate(archive_iterator):
+        timestamp = record.record_date
+        url = record.headers['WARC-Target-URI']
+        status_code = record.http_headers.status_code
+        if status_code in ok_codes:
+            content = record.reader.read()
+        else:
+            content = ""
+
+        row = (url, timestamp, status_code, content, provenance)
+        yield row
+    #     print(row)
+    # print("Done")
+
+
+def initialize_db(db_fullfilename):
+    with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
+        conn.execute('PRAGMA journal_mode=OFF;')
+        logger.info(f"Opened SQLite database with version {sqlite3.sqlite_version}.")
+        cursor = conn.cursor()
+        command = """
+         CREATE TABLE IF NOT EXISTS robots (
+         url TEXT, 
+         timestamp TEXT,
+         response TEXT, 
+         content TEXT,
+         provenance TEXT
+         );
+        """
+        cursor.execute(command)
+
+        cursor.execute('PRAGMA journal_mode=OFF;')
+        result = cursor.fetchone()  
+        logger.info("Journal mode set to:", result[0])
+        cursor.close()
+        conn.commit()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--crawl_fullfilename', type=str, required=True)
+    parser.add_argument('--out_dir', type=str, required=True)
+    parser.add_argument('--db_version', type=str, required=True)
+    args = parser.parse_args()
+
+    crawl_fullfilename = args.crawl_fullfilename
+    out_dir = args.out_dir
+    os.makedirs(out_dir, exist_ok=True)
+    db_version = args.db_version
+
+    commit_every = 5
+
+    _, crawl_ident = os.path.split(crawl_fullfilename)
+    db_filename = f"{crawl_ident}_v{db_version}.db"
+    db_fullfilename = os.path.join(out_dir, db_filename)
+    initialize_db(db_fullfilename)
+    warc_list = sorted(os.listdir(crawl_fullfilename))
+    total_file_num = len(warc_list)
+    metadata_filename = f"{crawl_ident}_v{db_version}.txt"
+    metadata_dir = os.path.join(out_dir, "metadata")
+    os.makedirs(metadata_dir, exist_ok=True)
+    metadata_fullfilename = os.path.join(metadata_dir, metadata_filename)
+
+    if os.path.exists(metadata_fullfilename):
+        with open(metadata_fullfilename, 'r') as f:
+            line = f.readline()
+            line_split = line.split(",")
+            assert 2 == len(line_split)
+            last_file_num = int(line_split[0])
+            read_total_file_num = int(line_split[1])
+            assert total_file_num == read_total_file_num
+            logger.info(f"Found metadata file {last_file_num} / {total_file_num}")
+    else:
+        last_file_num = 0
+
+    with sqlite3.connect(db_fullfilename, timeout=30.0) as conn:
+        for idx, warc_filename in tqdm.tqdm(enumerate(warc_list), total=total_file_num):
+            if idx < last_file_num:
+                continue
+            warc_fullfilename = os.path.join(crawl_fullfilename, warc_filename)
+            # values_old = extract(warc_fullfilename)
+            values = extract_fast(warc_fullfilename)
+            if False:
+                tp0 = next(values_old)
+                tp1 = next(values)
+                print(tp0)
+                print(tp1)
+                timestamp = tp1[1].timestamp()
+                recovered = dt.datetime.fromtimestamp(timestamp, dt.UTC)
+            cursor = conn.cursor()
+            cursor.executemany("INSERT OR REPLACE INTO robots VALUES (?, ?, ?, ?, ?)", values)
+            cursor.close()
+
+            if idx % commit_every == 0:
+                conn.commit()
+
+            with open(metadata_fullfilename, 'w') as f:
+                line_to_write = f"{idx},{total_file_num}"
+                f.write(line_to_write)
+
+    cursor = conn.cursor()
+    num_rows = cursor.execute("SELECT COUNT(*) FROM robots;").fetchall()[0][0]
+
+    logger.info(f"Done building {db_fullfilename}")
+    logger.info(f"{len(warc_list)} warc files")
+    logger.info(f"{num_rows} total records")
+    logger.info(f"{int(num_rows / len(warc_list))} records / warc file")
+    logger.info("Done")