Switch to ICU tokenizer #939
Open
firefoxci-taskcluster / merge-corpus-ru-en
succeeded
Nov 22, 2024 in 18m 16s
FirefoxCI (pull_request)
merge corpus for ru-en
Details
View task in Taskcluster | View logs in Taskcluster | View task group in Taskcluster
Task Status
Started: 2024-11-22T23:07:44.790Z
Resolved: 2024-11-22T23:08:47.157Z
Task Execution Time: 1 minute, 2 seconds, 367 milliseconds
Task Status: completed
Reason Resolved: completed
RunId: 0
Artifacts
- public/build/corpus.en.zst
- public/build/corpus.ru.zst
- public/build/corpus.sample.txt
- public/build/corpus.stats.json
- public/logs/live_backing.log
- public/logs/live.log
[taskcluster 2024-11-22 23:07:44.845Z] Task ID: CMk3dMWqQrOhiX5_lJExuQ
[taskcluster 2024-11-22 23:07:44.845Z] Worker ID: 8661717521703678052
[taskcluster 2024-11-22 23:07:44.845Z] Worker Group: us-central1-c
[taskcluster 2024-11-22 23:07:44.845Z] Worker Node Type: projects/887720501152/machineTypes/n2-highmem-32
[taskcluster 2024-11-22 23:07:44.845Z] Worker Pool: translations-1/b-linux-large-gcp-300gb
[taskcluster 2024-11-22 23:07:44.845Z] Worker Version: 38.0.5
[taskcluster 2024-11-22 23:07:44.845Z] Public IP: 35.239.96.222
[taskcluster 2024-11-22 23:07:44.845Z] Hostname: translations-1-b-linux-large-gcp-300gb-pb6e-lxgsmwgez0rqf6luw
[taskcluster 2024-11-22 23:07:44.845Z] using cache "translations-level-1-checkouts-v3-7afeb851dd97df8f3607-KnyIE1GvSz67R9mjL97Now" -> /builds/worker/checkouts
[taskcluster 2024-11-22 23:07:48.504Z] Downloading artifact "public/image.tar.zst" from task ID: KnyIE1GvSz67R9mjL97Now.
[taskcluster 2024-11-22 23:07:53.507Z] Download Progress: 71.03%
[taskcluster 2024-11-22 23:07:56.328Z] Downloaded artifact successfully.
[taskcluster 2024-11-22 23:07:56.328Z] Downloaded 287.207 mb
[taskcluster 2024-11-22 23:07:56.329Z] Decompressing downloaded image
[taskcluster 2024-11-22 23:07:58.439Z] Loading docker image from downloaded archive.
[taskcluster 2024-11-22 23:08:29.834Z] Image 'public/image.tar.zst' from task 'KnyIE1GvSz67R9mjL97Now' loaded. Using image ID sha256:d31e1900b8212f46ff27eab4217df610f5d7a124bb4975b4b8ea07a64443f3ba.
[taskcluster 2024-11-22 23:08:30.007Z] === Task Starting ===
[setup 2024-11-22T23:08:31.021Z] run-task started in /builds/worker
[setup 2024-11-22T23:08:31.021Z] Invoked by command: --firefox_translations_training-checkout=/builds/worker/checkouts/vcs/ -- bash -c pip install -r $VCS_PATH/pipeline/clean/requirements/merge.txt && export PYTHONPATH=$PYTHONPATH:$VCS_PATH && python3 $VCS_PATH/pipeline/clean/merge-corpus.py --src ru --trg en --artifacts $TASK_WORKDIR/artifacts --name corpus --max_lines 1000 --datasets_glob "$MOZ_FETCHES_DIR/*.zst"
[setup 2024-11-22T23:08:31.021Z] Python version: 3.10.12
[cache 2024-11-22T23:08:31.023Z] cache /builds/worker/checkouts is empty; writing requirements: gid=1000 uid=1000 version=1
[volume 2024-11-22T23:08:31.023Z] volume /builds/worker/checkouts is a cache
[setup 2024-11-22T23:08:31.023Z] running as worker:worker
[vcs 2024-11-22T23:08:31.023Z] executing ['git', 'config', '--global', '--add', 'safe.directory', '/builds/worker/checkouts/vcs']
[vcs 2024-11-22T23:08:31.026Z] executing ['git', 'clone', 'https://github.com/mozilla/translations', '/builds/worker/checkouts/vcs']
[vcs 2024-11-22T23:08:31.027Z] Cloning into '/builds/worker/checkouts/vcs'...
[vcs 2024-11-22T23:08:32.980Z] executing ['git', 'fetch', '--tags', '--force', 'https://github.com/mozilla/translations', 'icu_tokenizer']
[vcs 2024-11-22T23:08:33.153Z] From https://github.com/mozilla/translations
[vcs 2024-11-22T23:08:33.153Z] * branch icu_tokenizer -> FETCH_HEAD
[vcs 2024-11-22T23:08:33.162Z] executing ['git', 'fetch', '--no-tags', 'https://github.com/mozilla/translations', 'icu_tokenizer']
[vcs 2024-11-22T23:08:33.347Z] From https://github.com/mozilla/translations
[vcs 2024-11-22T23:08:33.347Z] * branch icu_tokenizer -> FETCH_HEAD
[vcs 2024-11-22T23:08:33.356Z] executing ['git', 'checkout', '-f', '-B', 'icu_tokenizer', 'd585a63a6abc04ece83e26ce51a0caa2f7fa21e6']
[vcs 2024-11-22T23:08:34.069Z] Switched to a new branch 'icu_tokenizer'
[vcs 2024-11-22T23:08:34.092Z] executing ['git', 'submodule', 'init']
[vcs 2024-11-22T23:08:34.114Z] Submodule '3rd_party/browsermt-marian-dev' (https://github.com/browsermt/marian-dev) registered for path '3rd_party/browsermt-marian-dev'
[vcs 2024-11-22T23:08:34.115Z] Submodule 'extract-lex' (https://github.com/marian-nmt/extract-lex) registered for path '3rd_party/extract-lex'
[vcs 2024-11-22T23:08:34.115Z] Submodule 'fast_align' (https://github.com/clab/fast_align) registered for path '3rd_party/fast_align'
[vcs 2024-11-22T23:08:34.116Z] Submodule '3rd_party/kenlm' (https://github.com/kpu/kenlm) registered for path '3rd_party/kenlm'
[vcs 2024-11-22T23:08:34.116Z] Submodule '3rd_party/marian-dev' (https://github.com/marian-nmt/marian-dev) registered for path '3rd_party/marian-dev'
[vcs 2024-11-22T23:08:34.117Z] Submodule '3rd_party/preprocess' (https://github.com/kpu/preprocess.git) registered for path '3rd_party/preprocess'
[vcs 2024-11-22T23:08:34.117Z] Submodule 'inference/3rd_party/browsermt-marian-dev' (https://github.com/browsermt/marian-dev) registered for path 'inference/3rd_party/browsermt-marian-dev'
[vcs 2024-11-22T23:08:34.118Z] Submodule 'inference/3rd_party/emsdk' (https://github.com/emscripten-core/emsdk.git) registered for path 'inference/3rd_party/emsdk'
[vcs 2024-11-22T23:08:34.118Z] Submodule 'inference/3rd_party/ssplit-cpp' (https://github.com/browsermt/ssplit-cpp) registered for path 'inference/3rd_party/ssplit-cpp'
[vcs 2024-11-22T23:08:34.119Z] executing ['git', 'submodule', 'update', '--force']
[vcs 2024-11-22T23:08:34.143Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/browsermt-marian-dev'...
[vcs 2024-11-22T23:08:35.321Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/extract-lex'...
[vcs 2024-11-22T23:08:35.605Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/fast_align'...
[vcs 2024-11-22T23:08:35.949Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/kenlm'...
[vcs 2024-11-22T23:08:36.559Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/marian-dev'...
[vcs 2024-11-22T23:08:38.070Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/preprocess'...
[vcs 2024-11-22T23:08:38.550Z] Cloning into '/builds/worker/checkouts/vcs/inference/3rd_party/browsermt-marian-dev'...
[vcs 2024-11-22T23:08:39.698Z] Cloning into '/builds/worker/checkouts/vcs/inference/3rd_party/emsdk'...
[vcs 2024-11-22T23:08:40.198Z] Cloning into '/builds/worker/checkouts/vcs/inference/3rd_party/ssplit-cpp'...
[vcs 2024-11-22T23:08:40.620Z] Submodule path '3rd_party/browsermt-marian-dev': checked out '11c6ae7c46be21ef96ed10c60f28022fa968939f'
[vcs 2024-11-22T23:08:40.631Z] Submodule path '3rd_party/extract-lex': checked out '42fa605b53f32eaf6c6e0b5677255c21c91b3d49'
[vcs 2024-11-22T23:08:40.643Z] Submodule path '3rd_party/fast_align': checked out 'cab1e9aac8d3bb02ff5ae58218d8d225a039fa11'
[vcs 2024-11-22T23:08:40.672Z] Submodule path '3rd_party/kenlm': checked out 'bbf4fc511266c5d4515047055d7bdec659a6e158'
[vcs 2024-11-22T23:08:40.788Z] Submodule path '3rd_party/marian-dev': checked out 'e8a1a2530fb84cbff7383302ebca393e5875c441'
[vcs 2024-11-22T23:08:40.809Z] Submodule path '3rd_party/preprocess': checked out '64307314b4d5a9a0bd529b5c1036b0710d995eec'
[vcs 2024-11-22T23:08:40.884Z] Submodule path 'inference/3rd_party/browsermt-marian-dev': checked out '2781d735d4a10dca876d61be587afdab2726293c'
[vcs 2024-11-22T23:08:40.905Z] Submodule path 'inference/3rd_party/emsdk': checked out '2346baa7bb44a4a0571cc75f1986ab9aaa35aa03'
[vcs 2024-11-22T23:08:40.921Z] Submodule path 'inference/3rd_party/ssplit-cpp': checked out 'a311f9865ade34db1e8e080e6cc146f55dafb067'
[vcs 2024-11-22T23:08:40.921Z] cleaning git checkout...
[vcs 2024-11-22T23:08:40.921Z] executing ['git', 'clean', '-nxdff']
[vcs 2024-11-22T23:08:40.925Z] removing []
[vcs 2024-11-22T23:08:40.925Z] successfully cleaned git checkout!
[vcs 2024-11-22T23:08:40.927Z] TinderboxPrint:<a href='https://github.com/mozilla/translations/commit/d585a63a6abc04ece83e26ce51a0caa2f7fa21e6' title='Built from translations commit d585a63a6abc04ece83e26ce51a0caa2f7fa21e6'>d585a63a6abc04ece83e26ce51a0caa2f7fa21e6</a>
[setup 2024-11-22T23:08:40.927Z] MOZ_FETCHES_DIR is /builds/worker/fetches
[fetches 2024-11-22T23:08:40.927Z] fetching artifacts
[fetches 2024-11-22T23:08:40.927Z] executing ['/usr/bin/python3', '-u', '/usr/local/bin/fetch-content', 'task-artifacts']
attempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AIY73J0JRXanRkyp5XJp2A/artifacts/public/build/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.en.zst to /builds/worker/fetches/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.en.zst
attempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AIY73J0JRXanRkyp5XJp2A/artifacts/public/build/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.ru.zst to /builds/worker/fetches/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.ru.zstattempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AIY73J0JRXanRkyp5XJp2A/artifacts/public/build/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.en.zst
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/e1usRcXfTzCAagS0_21b9A/artifacts/public/build/ELRC-3075-wikipedia_health_v1.en.zst to /builds/worker/fetches/ELRC-3075-wikipedia_health_v1.en.zst
attempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/e1usRcXfTzCAagS0_21b9A/artifacts/public/build/ELRC-3075-wikipedia_health_v1.en.zst
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AIY73J0JRXanRkyp5XJp2A/artifacts/public/build/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.ru.zstDownloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/e1usRcXfTzCAagS0_21b9A/artifacts/public/build/ELRC-3075-wikipedia_health_v1.ru.zst to /builds/worker/fetches/ELRC-3075-wikipedia_health_v1.ru.zst
attempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/E6OGzex0TUuRpJ9kzAhVyg/artifacts/public/build/ada83_v1.en.zst to /builds/worker/fetches/ada83_v1.en.zst
attempt 1/5Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/e1usRcXfTzCAagS0_21b9A/artifacts/public/build/ELRC-3075-wikipedia_health_v1.ru.zst
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/E6OGzex0TUuRpJ9kzAhVyg/artifacts/public/build/ada83_v1.ru.zst to /builds/worker/fetches/ada83_v1.ru.zst
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/E6OGzex0TUuRpJ9kzAhVyg/artifacts/public/build/ada83_v1.en.zstDownloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/E6OGzex0TUuRpJ9kzAhVyg/artifacts/public/build/ada83_v1.ru.zst
attempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kd3wDfigT_-rzZx25C5y0A/artifacts/public/build/gcp_pytest-dataset_a0017e.en.zst to /builds/worker/fetches/gcp_pytest-dataset_a0017e.en.zst
attempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kd3wDfigT_-rzZx25C5y0A/artifacts/public/build/gcp_pytest-dataset_a0017e.ru.zst to /builds/worker/fetches/gcp_pytest-dataset_a0017e.ru.zstDownloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kd3wDfigT_-rzZx25C5y0A/artifacts/public/build/gcp_pytest-dataset_a0017e.en.zst
attempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RcSp41SmREiuG3yT1h6UHg/artifacts/public/build/dedupe.tar.zst to /builds/worker/fetches/dedupe.tar.zst
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kd3wDfigT_-rzZx25C5y0A/artifacts/public/build/gcp_pytest-dataset_a0017e.ru.zst
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RcSp41SmREiuG3yT1h6UHg/artifacts/public/build/dedupe.tar.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RcSp41SmREiuG3yT1h6UHg/artifacts/public/build/dedupe.tar.zst resolved to 133246 bytes with sha256 6b021bdc0013dbd8e676afda47ef0a8eab66813947095922d605028bff5eb4a4 in 0.096s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RcSp41SmREiuG3yT1h6UHg/artifacts/public/build/dedupe.tar.zst
Extracting /builds/worker/fetches/dedupe.tar.zst to /builds/worker/fetches
/builds/worker/fetches/dedupe.tar.zst extracted in 0.003s
Removing /builds/worker/fetches/dedupe.tar.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/e1usRcXfTzCAagS0_21b9A/artifacts/public/build/ELRC-3075-wikipedia_health_v1.ru.zst resolved to 217818 bytes with sha256 ba83e0374222d835cd622d0aca582769b4647b5ad9d8139b6e87731eedc14911 in 0.139s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/e1usRcXfTzCAagS0_21b9A/artifacts/public/build/ELRC-3075-wikipedia_health_v1.ru.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AIY73J0JRXanRkyp5XJp2A/artifacts/public/build/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.en.zst resolved to 46555 bytes with sha256 6a65093e36a78b7d51c7dc05bc006e554d1b0543bd61ffd86366699f2b7e4b94 in 0.166s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AIY73J0JRXanRkyp5XJp2A/artifacts/public/build/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.en.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/E6OGzex0TUuRpJ9kzAhVyg/artifacts/public/build/ada83_v1.ru.zst resolved to 164352 bytes with sha256 ff886a99c18146f398bd8413875f9edfca4e5f435da9159b57fa1654908ba60b in 0.164s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/E6OGzex0TUuRpJ9kzAhVyg/artifacts/public/build/ada83_v1.ru.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AIY73J0JRXanRkyp5XJp2A/artifacts/public/build/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.ru.zst resolved to 69084 bytes with sha256 ef3fddedf0dc4703e2e3e40ea3ed3a08330c99bf7acafe9adc08750a4de2a18f in 0.169s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/AIY73J0JRXanRkyp5XJp2A/artifacts/public/build/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.ru.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/e1usRcXfTzCAagS0_21b9A/artifacts/public/build/ELRC-3075-wikipedia_health_v1.en.zst resolved to 153565 bytes with sha256 2baf27cd62c99605d2a98f234021c5f61ed36c802663daebc43370ae49d747ba in 0.173s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/e1usRcXfTzCAagS0_21b9A/artifacts/public/build/ELRC-3075-wikipedia_health_v1.en.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/E6OGzex0TUuRpJ9kzAhVyg/artifacts/public/build/ada83_v1.en.zst resolved to 109864 bytes with sha256 13b7d6a907c0be2c53e25235d3fc7a6a866d5013276c1eb95af481e9a117aab7 in 0.185s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/E6OGzex0TUuRpJ9kzAhVyg/artifacts/public/build/ada83_v1.en.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kd3wDfigT_-rzZx25C5y0A/artifacts/public/build/gcp_pytest-dataset_a0017e.ru.zst resolved to 755 bytes with sha256 37b8ecdecfaaefbbcc0555a44d8f1201c09d9e0c58e5a1ee8fec159ddee93375 in 0.220s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kd3wDfigT_-rzZx25C5y0A/artifacts/public/build/gcp_pytest-dataset_a0017e.ru.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kd3wDfigT_-rzZx25C5y0A/artifacts/public/build/gcp_pytest-dataset_a0017e.en.zst resolved to 539 bytes with sha256 1584216350b567c80f2e92b665881d19b4466596d4c848af460c1c16bfd193ab in 0.238s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Kd3wDfigT_-rzZx25C5y0A/artifacts/public/build/gcp_pytest-dataset_a0017e.en.zst
PERFHERDER_DATA: {"framework": {"name": "build_metrics"}, "suites": [{"name": "fetch_content", "value": 0.24614094300000033, "lowerIsBetter": true, "shouldAlert": false, "subtests": []}]}
[fetches 2024-11-22T23:08:41.263Z] finished fetching artifacts
[task 2024-11-22T23:08:41.263Z] executing ['bash', '-c', 'pip install -r $VCS_PATH/pipeline/clean/requirements/merge.txt && export PYTHONPATH=$PYTHONPATH:$VCS_PATH && python3 $VCS_PATH/pipeline/clean/merge-corpus.py --src ru --trg en --artifacts $TASK_WORKDIR/artifacts --name corpus --max_lines 1000 --datasets_glob "$MOZ_FETCHES_DIR/*.zst"']
[task 2024-11-22T23:08:41.638Z] WARNING: The directory '/builds/worker/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
[task 2024-11-22T23:08:41.639Z] Defaulting to user installation because normal site-packages is not writeable
[task 2024-11-22T23:08:41.817Z] Collecting certifi==2024.7.4
[task 2024-11-22T23:08:41.903Z] Downloading certifi-2024.7.4-py3-none-any.whl (162 kB)
[task 2024-11-22T23:08:41.939Z] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.0/163.0 KB 5.1 MB/s eta 0:00:00
[task 2024-11-22T23:08:42.170Z] Collecting charset-normalizer==3.3.2
[task 2024-11-22T23:08:42.184Z] Downloading charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)
[task 2024-11-22T23:08:42.201Z] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 142.1/142.1 KB 8.9 MB/s eta 0:00:00
[task 2024-11-22T23:08:42.236Z] Collecting idna==3.7
[task 2024-11-22T23:08:42.249Z] Downloading idna-3.7-py3-none-any.whl (66 kB)
[task 2024-11-22T23:08:42.256Z] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.8/66.8 KB 12.7 MB/s eta 0:00:00
[task 2024-11-22T23:08:42.464Z] Collecting psutil==6.0.0
[task 2024-11-22T23:08:42.478Z] Downloading psutil-6.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (290 kB)
[task 2024-11-22T23:08:42.513Z] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 290.5/290.5 KB 8.7 MB/s eta 0:00:00
[task 2024-11-22T23:08:42.577Z] Collecting requests==2.31.0
[task 2024-11-22T23:08:42.591Z] Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[task 2024-11-22T23:08:42.597Z] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 KB 15.3 MB/s eta 0:00:00
[task 2024-11-22T23:08:42.670Z] Collecting urllib3==2.2.2
[task 2024-11-22T23:08:42.683Z] Downloading urllib3-2.2.2-py3-none-any.whl (121 kB)
[task 2024-11-22T23:08:42.697Z] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.4/121.4 KB 9.7 MB/s eta 0:00:00
[task 2024-11-22T23:08:42.777Z] Installing collected packages: urllib3, psutil, idna, charset-normalizer, certifi, requests
[task 2024-11-22T23:08:43.183Z] Successfully installed certifi-2024.7.4 charset-normalizer-3.3.2 idna-3.7 psutil-6.0.0 requests-2.31.0 urllib3-2.2.2
[task 2024-11-22T23:08:43.397Z] [merge-corpus] - /builds/worker/fetches/ELRC-3075-wikipedia_health_v1.en.zst (153.6 KB)
[task 2024-11-22T23:08:43.397Z] [merge-corpus] - /builds/worker/fetches/ELRC-3075-wikipedia_health_v1.ru.zst (217.8 KB)
[task 2024-11-22T23:08:43.397Z] [merge-corpus] - /builds/worker/fetches/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.en.zst (46.6 KB)
[task 2024-11-22T23:08:43.397Z] [merge-corpus] - /builds/worker/fetches/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.ru.zst (69.1 KB)
[task 2024-11-22T23:08:43.397Z] [merge-corpus] - /builds/worker/fetches/ada83_v1.en.zst (109.9 KB)
[task 2024-11-22T23:08:43.397Z] [merge-corpus] - /builds/worker/fetches/ada83_v1.ru.zst (164.4 KB)
[task 2024-11-22T23:08:43.397Z] [merge-corpus] - /builds/worker/fetches/gcp_pytest-dataset_a0017e.en.zst (539.0 B)
[task 2024-11-22T23:08:43.397Z] [merge-corpus] - /builds/worker/fetches/gcp_pytest-dataset_a0017e.ru.zst (755.0 B)
[task 2024-11-22T23:08:43.397Z] [merge-corpus] Parallel datasets:
[task 2024-11-22T23:08:43.398Z] [downloads] Reading lines from: /builds/worker/fetches/ELRC-3075-wikipedia_health_v1.ru.zst
[task 2024-11-22T23:08:43.398Z] [merge-corpus] Reading dataset /builds/worker/fetches/ELRC-3075-wikipedia_health_v1.ru.zst
[task 2024-11-22T23:08:43.398Z] [downloads] Reading lines from: /builds/worker/fetches/ELRC-3075-wikipedia_health_v1.en.zst
[task 2024-11-22T23:08:43.398Z] [merge-corpus] Reading dataset /builds/worker/fetches/ELRC-3075-wikipedia_health_v1.en.zst
[task 2024-11-22T23:08:43.425Z] [downloads] Reading lines from: /builds/worker/fetches/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.ru.zst
[task 2024-11-22T23:08:43.425Z] [merge-corpus] Reading dataset /builds/worker/fetches/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.ru.zst
[task 2024-11-22T23:08:43.426Z] [downloads] Reading lines from: /builds/worker/fetches/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.en.zst
[task 2024-11-22T23:08:43.426Z] [merge-corpus] Reading dataset /builds/worker/fetches/ELRC-web_acquired_data_related_to_scientific_resea_78c4de.en.zst
[task 2024-11-22T23:08:43.434Z] [downloads] Reading lines from: /builds/worker/fetches/ada83_v1.ru.zst
[task 2024-11-22T23:08:43.434Z] [merge-corpus] Reading dataset /builds/worker/fetches/ada83_v1.ru.zst
[task 2024-11-22T23:08:43.435Z] [downloads] Reading lines from: /builds/worker/fetches/ada83_v1.en.zst
[task 2024-11-22T23:08:43.435Z] [merge-corpus] Reading dataset /builds/worker/fetches/ada83_v1.en.zst
[task 2024-11-22T23:08:43.462Z] [downloads] Reading lines from: /builds/worker/fetches/gcp_pytest-dataset_a0017e.ru.zst
[task 2024-11-22T23:08:43.462Z] [merge-corpus] Reading dataset /builds/worker/fetches/gcp_pytest-dataset_a0017e.ru.zst
[task 2024-11-22T23:08:43.462Z] [downloads] Reading lines from: /builds/worker/fetches/gcp_pytest-dataset_a0017e.en.zst
[task 2024-11-22T23:08:43.463Z] [merge-corpus] Reading dataset /builds/worker/fetches/gcp_pytest-dataset_a0017e.en.zst
[task 2024-11-22T23:08:43.469Z] [merge-corpus] Stream in:
[task 2024-11-22T23:08:43.469Z] [merge-corpus] - /builds/worker/artifacts/corpus.ru.zst
[task 2024-11-22T23:08:43.469Z] [merge-corpus] - /builds/worker/artifacts/corpus.en.zst
[task 2024-11-22T23:08:43.469Z] [merge-corpus] Write a 10,000 line sample of the merged corpus:
[task 2024-11-22T23:08:43.469Z] [merge-corpus] - /builds/worker/artifacts/corpus.sample.txt
[fetches 2024-11-22T23:08:43.493Z] removing /builds/worker/fetches
[fetches 2024-11-22T23:08:43.493Z] finished
[taskcluster 2024-11-22 23:08:45.592Z] === Task Finished ===
[taskcluster 2024-11-22 23:08:46.511Z] Successful task run with exit code: 0 completed in 61.667 seconds
Loading