-
-
Notifications
You must be signed in to change notification settings - Fork 72
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
22acf9c
commit b0262a8
Showing
10 changed files
with
475 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
data/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Sudachi Benchmark | ||
|
||
Sudachi に大規模なテキストを解析させ、実行速度の計測やバグの検出を行う。 | ||
|
||
## Base Scripts | ||
|
||
### benchmark_setup.sh | ||
|
||
Sudachi のビルドおよび Sudachi 辞書のビルドを行う。 | ||
|
||
- ビルドした `sudachi-executable-[VERSION].zip` を `../build/distributions/sudachi/` 以下に展開する | ||
- `data/` 以下に `system_small.dic`, `system_core.dic`, `system_full.dic` をビルドする | ||
- `data/dictdata/` 以下にダウンロードした Sudachi 辞書データを格納する | ||
|
||
command: `benchmark_setup.sh [dict_version]` | ||
|
||
- `dict_version`: Sudachi 辞書バージョン (default "20240716") | ||
|
||
### benchmark_run.sh | ||
|
||
指定のテキストファイルを各辞書タイプ・分割単位で解析する。 | ||
解析結果は `/dev/null` に出力、対象ファイルや開始/終了時刻情報を `data/benchmark.log` に追記する。 | ||
|
||
command: `benchmark_run.sh corpus_file` | ||
|
||
- `corpus_file`: 解析対象とするテキストファイル | ||
|
||
### benchmark_multithread.sh | ||
|
||
指定のテキストファイルを解析するスレッドを複数同時に実行する。 | ||
解析結果は `/dev/null` に出力、対象ファイルや開始/終了時刻情報を `data/benchmark.log` に追記する。 | ||
|
||
command: `benchmark_multithread.sh corpus_file [num_thread [dict_type]]` | ||
|
||
- `corpus_file`: 解析対象とするテキストファイル | ||
- `num_thread`: 作成するスレッド数 (default 3) | ||
- `dict_type`: 使用する辞書タイプ (default "small") | ||
|
||
## Scripts | ||
|
||
### kyoto-leads-corpus.sh | ||
|
||
[Kyoto University Web Document Leads Corpus](https://github.com/ku-nlp/KWDLC) を取得し、setup および run を実行する。 | ||
|
||
command: `kyoto-leads-corpus.sh` | ||
|
||
### jawikipedia.sh | ||
|
||
[Wikipedia 日本語版ダンプデータ](https://ja.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3%83%BC%E3%82%B9%E3%83%80%E3%82%A6%E3%83%B3%E3%83%AD%E3%83%BC%E3%83%89)を取得し、setup および run を実行する。 | ||
サイズが非常に大きいため、先頭から指定サイズのみを対象とする。 | ||
|
||
- `data/jawiki_[DUMP_DATE]/` 以下にデータを格納する。 | ||
|
||
command: `jawikipedia.sh [dump_date [size]]` | ||
|
||
- `dump_date`: ダンプデータの生成日時 (default "20240801") | ||
- `size`: 使用するテキストのサイズ (default 100M) | ||
|
||
### commoncrawl.sh | ||
|
||
[CommonCrawl](https://commoncrawl.org/get-started) データを取得し、setup および run を実行する。 | ||
サイズが非常に大きいため、指定数のページのみを対象とする。 | ||
|
||
非日本語のサンプルとして利用するため、言語判別は行わず、また HTML を抽出して使用する。 | ||
|
||
- `data/cc[CRAWL_DATE]/` 以下にデータを格納する。 | ||
|
||
command: `commoncrawl.sh [crawl_date [file_index [num_records]]]` | ||
|
||
- `crawl_date`: クロールデータの生成日時 (CC-MAIN-\*, default "2024-33") | ||
- `file_index`: 使用する WARC ファイルの warc.paths ファイル中の行数 (default 1) | ||
- `num_records`: 使用するレコード数(対象 WARC の先頭から取得) (default 1000) | ||
- 2024-33 では 1000 レコードでおよそ 50M |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
#!/bin/bash | ||
# Analyze given file n-times in multithread. | ||
# assume `benchmark_setup.sh` is called beforehand. | ||
|
||
set -eux | ||
DIR=$(dirname "$(readlink -f "$0")") | ||
cd "${DIR}/.." | ||
|
||
SUDACHI_VERSION=$(./gradlew properties --console=plain -q | grep "^version:" | awk '{printf $2}') | ||
|
||
CORPUS_FILE=$1 | ||
NUM_THREAD=${2:-3} | ||
DICT_TYPE=${3:-"small"} | ||
|
||
# Build code | ||
BUILD_DIR="$DIR/../build/distributions" | ||
JAR_FILE="$BUILD_DIR/sudachi/sudachi-${SUDACHI_VERSION}.jar" | ||
SRC_ROOT="${DIR}/src" | ||
SRC_DIR="${SRC_ROOT}/com/worksap/nlp/sudachi/benchmark" | ||
SRC_NAME="TokenizeMultiThread" | ||
|
||
if [ ! -e "${SRC_DIR}/${SRC_NAME}.class" ]; then | ||
javac -cp ${JAR_FILE} ${SRC_DIR}/${SRC_NAME}.java | ||
fi | ||
|
||
# Run | ||
cd ${DIR} | ||
DATA_DIR=$DIR/data | ||
LOGFILE="$DATA_DIR/benchmark.log" | ||
|
||
echo "$(date), $SUDACHI_VERSION, multithread ${NUM_THREAD}, ${DICT_TYPE}, begin" >> $LOGFILE | ||
echo $(ls -l $CORPUS_FILE) >> $LOGFILE | ||
|
||
java -Dfile.encoding=UTF-8 -cp ${SRC_ROOT}:${JAR_FILE} \ | ||
com.worksap.nlp.sudachi.benchmark.${SRC_NAME} \ | ||
--systemDict ${DIR}/data/system_${DICT_TYPE}.dic \ | ||
-p "$NUM_THREAD" "$CORPUS_FILE" > /dev/null | ||
|
||
echo "$(date), $SUDACHI_VERSION, multithread ${NUM_THREAD}, ${DICT_TYPE}, end" >> $LOGFILE |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
#!/bin/bash | ||
# Tokenize given file, with each of small/core/full dict and A/B/C mode. | ||
|
||
set -eux | ||
DIR=$(dirname "$(readlink -f "$0")") | ||
cd "${DIR}/.." | ||
|
||
CORPUS_FILE=$1 | ||
TASK=${2:-"benchmark"} | ||
|
||
SUDACHI_VERSION=$(./gradlew properties --console=plain -q | grep "^version:" | awk '{printf $2}') | ||
|
||
# Run benchmark | ||
DATA_DIR=$DIR/data | ||
JAR_DIR="$DIR/../build/distributions/sudachi" | ||
LOGFILE="$DATA_DIR/benchmark.log" | ||
|
||
DICT_TYPES=("small" "core" "full") | ||
SPLIT_MODES=("A" "B" "C") | ||
|
||
echo "" >> $LOGFILE | ||
echo "$(date), $SUDACHI_VERSION, $TASK, begin" >> $LOGFILE | ||
echo $(ls -l $CORPUS_FILE) >> $LOGFILE | ||
for TYPE in ${DICT_TYPES[@]}; do | ||
DICT_FILE="$DATA_DIR/system_${TYPE}.dic" | ||
for MODE in ${SPLIT_MODES[@]}; do | ||
echo "$(date), $TYPE, $MODE, begin" >> $LOGFILE | ||
java -Dfile.encoding=UTF-8 -jar "$JAR_DIR/sudachi-${SUDACHI_VERSION}.jar" \ | ||
--systemDict "$DICT_FILE" -m ${MODE} -a \ | ||
"$CORPUS_FILE" > /dev/null | ||
echo "$(date), $TYPE, $MODE, end" >> $LOGFILE | ||
done | ||
done | ||
echo "$(date), $SUDACHI_VERSION, $TASK, end" >> $LOGFILE |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
#!/bin/bash | ||
# Build Sudachi and build small/core/full dictionary with it. | ||
|
||
set -eux | ||
DIR=$(dirname "$(readlink -f "$0")") | ||
cd "${DIR}/.." | ||
|
||
SUDACHI_VERSION=$(./gradlew properties --console=plain -q | grep "^version:" | awk '{printf $2}') | ||
|
||
DICT_VERSION=${1:-"20240716"} | ||
|
||
# Build Sudachi | ||
./gradlew build | ||
BUILD_DIR="$DIR/../build/distributions" | ||
JAR_DIR="$BUILD_DIR/sudachi" | ||
if [ -e "$JAR_DIR" ]; then | ||
rm -r "$JAR_DIR" | ||
fi | ||
unzip -d "$JAR_DIR" "$BUILD_DIR/sudachi-executable-$SUDACHI_VERSION.zip" | ||
|
||
# Get dictionary data | ||
DATA_DIR=$DIR/data | ||
DICT_DIR=$DIR/data/dictdata | ||
mkdir -p "$DICT_DIR" | ||
|
||
RAW_DICT_BASEURL="http://sudachi.s3-website-ap-northeast-1.amazonaws.com/sudachidict-raw" | ||
|
||
DICT_FILES=("small_lex" "core_lex" "notcore_lex") | ||
for TYPE in ${DICT_FILES[@]}; do | ||
if [ ! -e "$DICT_DIR/${TYPE}.csv" ]; then | ||
ZIPFILE=${TYPE}.zip | ||
if [ ! -e "$DICT_DIR/$ZIPFILE" ]; then | ||
wget "$RAW_DICT_BASEURL/$DICT_VERSION/$ZIPFILE" -P $DICT_DIR | ||
fi | ||
unzip -d $DICT_DIR $DICT_DIR/$ZIPFILE | ||
fi | ||
done | ||
|
||
MATRIX_FILE="matrix.def" | ||
if [ ! -e "$DICT_DIR/$MATRIX_FILE" ]; then | ||
ZIPFILE=${MATRIX_FILE}.zip | ||
if [ ! -e "$ZIPFILE" ]; then | ||
wget "$RAW_DICT_BASEURL/$ZIPFILE" -P $DICT_DIR | ||
fi | ||
unzip -d $DICT_DIR $DICT_DIR/$ZIPFILE | ||
fi | ||
|
||
# Build dictionary | ||
DICT_TYPES=("small" "core" "full") | ||
|
||
for i in $(seq 0 2); do | ||
TYPE=${DICT_TYPES[$i]} | ||
DICT_FILE="$DATA_DIR/system_${TYPE}.dic" | ||
if [ ! -e "$DICT_FILE" ]; then | ||
FILES=$(for v in ${DICT_FILES[@]:0:$(expr $i+1)}; do echo "$DICT_DIR/${v}.csv"; done) | ||
java -Dfile.encoding=UTF-8 -cp "$JAR_DIR/sudachi-${SUDACHI_VERSION}.jar" \ | ||
com.worksap.nlp.sudachi.dictionary.DictionaryBuilder \ | ||
-o "$DICT_FILE" \ | ||
-m "$DICT_DIR/$MATRIX_FILE" \ | ||
$FILES | ||
fi | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
#!/bin/bash | ||
# Run benchmark with CommonCrawl (raw HTML) | ||
|
||
set -eux | ||
DIR=$(dirname "$(readlink -f "$0")") | ||
|
||
CRAWL_DATE=${1:-"2024-33"} | ||
LINE=${2:-"1"} # use n-th file in path file | ||
NUM_RECORDS=${3:-"1000"} # take first n records | ||
|
||
# Download CommonCrawl | ||
DATA_DIR="$DIR/data/cc${CRAWL_DATE}" | ||
mkdir -p "$DATA_DIR" | ||
|
||
CCURL="https://data.commoncrawl.org" | ||
BASEURL="${CCURL}/crawl-data/CC-MAIN-${CRAWL_DATE}" | ||
|
||
PATHFILE="${DATA_DIR}/warc.paths" | ||
if [ ! -e "${PATHFILE}" ]; then | ||
curl -L "${BASEURL}/warc.paths.gz" | gzip -dc > $PATHFILE | ||
fi | ||
|
||
CORPUS_WARC="$DATA_DIR/${LINE}.warc" | ||
FILEURL="${CCURL}/$(head ${PATHFILE} -n ${LINE} | tail -n 1)" | ||
if [ ! -e "${CORPUS_WARC}" ]; then | ||
curl -L "$FILEURL" | gzip -dc > $CORPUS_WARC | ||
fi | ||
|
||
# extract HTML | ||
CORPUS_WARC="$DATA_DIR/${LINE}.warc" | ||
CORPUS_FILE="$DATA_DIR/${LINE}.txt" | ||
python process_warc.py -i ${CORPUS_WARC} -o ${CORPUS_FILE} -n ${NUM_RECORDS} | ||
|
||
# setup & run | ||
$DIR/benchmark_setup.sh | ||
$DIR/benchmark_run.sh $CORPUS_FILE "commoncrawl_${CRAWL_DATE}_${LINE}_${NUM_RECORDS}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
#!/bin/bash | ||
# Run benchmark with Japanese Wikipedia (first 100M articles) | ||
|
||
set -eux | ||
DIR=$(dirname "$(readlink -f "$0")") | ||
|
||
DUMP_DATE=${1:-"20240801"} | ||
SIZE=${2:-"100M"} | ||
|
||
# Download Wikipedia dump (ja) | ||
DATA_DIR=$DIR/data/jawiki_${DUMP_DATE} | ||
mkdir -p "$DATA_DIR" | ||
|
||
## full dump is too large (>15GB), take first split. | ||
BASEURL="https://dumps.wikimedia.org/jawiki/${DUMP_DATE}" | ||
FILEURL="${BASEURL}/jawiki-${DUMP_DATE}-pages-articles1.xml-p1p114794.bz2" | ||
CORPUS_XML="$DATA_DIR/jawiki_${DUMP_DATE}_1.xml" | ||
|
||
if [ ! -e "$CORPUS_XML" ]; then | ||
curl -L $FILEURL | bzip2 -dc > $CORPUS_XML | ||
fi | ||
|
||
# extract | ||
CORPUS_FILE="$DATA_DIR/wiki_00" | ||
|
||
## assume wikiextracutor is installed (https://github.com/attardi/wikiextractor) | ||
if [ ! -e "$CORPUS_FILE" ]; then | ||
python -m wikiextractor.WikiExtractor $CORPUS_XML -o $DATA_DIR -b ${SIZE} | ||
mv $DATA_DIR/AA/* $DATA_DIR | ||
rm -r "$DATA_DIR/AA" | ||
fi | ||
|
||
# setup & run | ||
$DIR/benchmark_setup.sh | ||
$DIR/benchmark_run.sh $CORPUS_FILE "jawiki_${DUMP_DATE}_${SIZE}" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
#!/bin/bash | ||
# Run benchmark with Kyoto Leads Corpus | ||
|
||
set -eux | ||
DIR=$(dirname "$(readlink -f "$0")") | ||
|
||
# Download Kyoto Leads corpus original texts | ||
DATA_DIR=$DIR/data | ||
mkdir -p "$DATA_DIR" | ||
|
||
CORPUS_FILE="$DATA_DIR/leads.txt" | ||
if [ ! -e "$CORPUS_FILE" ]; then | ||
curl -L https://github.com/ku-nlp/KWDLC/releases/download/release_1_0/leads.org.txt.gz | gzip -dc > $CORPUS_FILE | ||
fi | ||
|
||
# Setup & run | ||
$DIR/benchmark_setup.sh | ||
$DIR/benchmark_run.sh $CORPUS_FILE "kyoto-leads" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
""" | ||
Extract HTML from .warc file. | ||
""" | ||
|
||
import argparse as ap | ||
from pathlib import Path | ||
from tqdm import tqdm | ||
|
||
from warcio.archiveiterator import ArchiveIterator | ||
|
||
|
||
def parse_args() -> ap.Namespace: | ||
parser = ap.ArgumentParser() | ||
parser.add_argument("-i", "--input", type=Path, | ||
help="input warc file") | ||
parser.add_argument("-o", "--output", type=Path, default="output.txt", | ||
help="output text file") | ||
parser.add_argument("-n", "--num-records", type=int, default=None, | ||
help="how many records to dump. dump all if not set.") | ||
|
||
args = parser.parse_args() | ||
return args | ||
|
||
|
||
def main(): | ||
args = parse_args() | ||
|
||
with args.input.open("rb") as fi, args.output.open("wb") as fo: | ||
count = 0 | ||
for record in tqdm(ArchiveIterator(fi)): | ||
if (args.num_records is not None) and (count >= args.num_records): | ||
break | ||
|
||
try: | ||
if record.rec_type != "response": | ||
continue | ||
contenttype = record.http_headers.get_header("Content-Type") | ||
if not contenttype.startswith("text/html"): | ||
continue | ||
|
||
# dump raw html | ||
content = record.content_stream().read() | ||
fo.write(content) | ||
|
||
count += 1 | ||
except: | ||
continue | ||
print(f"count: {count}") | ||
return | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.