Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to ICU tokenizer #939

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Use ICU system package

d585a63
Select commit
Loading
Failed to load commit list.
Open

Switch to ICU tokenizer #939

Use ICU system package
d585a63
Select commit
Loading
Failed to load commit list.
firefoxci-taskcluster / merge-mono-trg-en succeeded Nov 22, 2024 in 18m 26s

FirefoxCI (pull_request)

merge mono for en

Details

View task in Taskcluster | View logs in Taskcluster | View task group in Taskcluster

Task Status

Started: 2024-11-22T23:08:50.792Z
Resolved: 2024-11-22T23:08:58.433Z
Task Execution Time: 7 seconds, 641 milliseconds
Task Status: completed
Reason Resolved: completed
RunId: 0

Artifacts

- public/build/mono.en.sample.txt
- public/build/mono.en.stats.json
- public/build/mono.en.zst
- public/logs/live_backing.log
- public/logs/live.log


[taskcluster 2024-11-22 23:08:50.893Z] Task ID: Cz9fCCvAT3-ivThmX5E9Vw
[taskcluster 2024-11-22 23:08:50.893Z] Worker ID: 8661717521703678052
[taskcluster 2024-11-22 23:08:50.893Z] Worker Group: us-central1-c
[taskcluster 2024-11-22 23:08:50.893Z] Worker Node Type: projects/887720501152/machineTypes/n2-highmem-32
[taskcluster 2024-11-22 23:08:50.893Z] Worker Pool: translations-1/b-linux-large-gcp-300gb
[taskcluster 2024-11-22 23:08:50.893Z] Worker Version: 38.0.5
[taskcluster 2024-11-22 23:08:50.893Z] Public IP: 35.239.96.222
[taskcluster 2024-11-22 23:08:50.893Z] Hostname: translations-1-b-linux-large-gcp-300gb-pb6e-lxgsmwgez0rqf6luw
[taskcluster 2024-11-22 23:08:50.893Z] using cache "translations-level-1-checkouts-v3-7afeb851dd97df8f3607-KnyIE1GvSz67R9mjL97Now" -> /builds/worker/checkouts

[taskcluster 2024-11-22 23:08:51.520Z] Image 'public/image.tar.zst' from task 'KnyIE1GvSz67R9mjL97Now' loaded.  Using image ID sha256:d31e1900b8212f46ff27eab4217df610f5d7a124bb4975b4b8ea07a64443f3ba.
[taskcluster 2024-11-22 23:08:51.576Z] === Task Starting ===
[setup 2024-11-22T23:08:52.066Z] run-task started in /builds/worker
[setup 2024-11-22T23:08:52.066Z] Invoked by command: --firefox_translations_training-checkout=/builds/worker/checkouts/vcs/ -- bash -c pip install -r $VCS_PATH/pipeline/clean/requirements/merge.txt && export PYTHONPATH=$PYTHONPATH:$VCS_PATH && python3 $VCS_PATH/pipeline/clean/merge-mono.py --parallel_corpus $MOZ_FETCHES_DIR/corpus/corpus.en.zst --output $TASK_WORKDIR/artifacts/mono.en.zst --max_sentences 1000 --datasets_glob "$MOZ_FETCHES_DIR/*.zst"
[setup 2024-11-22T23:08:52.066Z] Python version: 3.10.12
[cache 2024-11-22T23:08:52.067Z] cache /builds/worker/checkouts exists; requirements: gid=1000 uid=1000 version=1
[volume 2024-11-22T23:08:52.067Z] volume /builds/worker/checkouts is a cache
[setup 2024-11-22T23:08:52.067Z] running as worker:worker
[vcs 2024-11-22T23:08:52.068Z] executing ['git', 'config', '--global', '--add', 'safe.directory', '/builds/worker/checkouts/vcs']
[vcs 2024-11-22T23:08:52.070Z] executing ['git', 'fetch', '--tags', '--force', 'https://github.com/mozilla/translations', 'icu_tokenizer']
[vcs 2024-11-22T23:08:52.291Z] From https://github.com/mozilla/translations
[vcs 2024-11-22T23:08:52.291Z]  * branch            icu_tokenizer -> FETCH_HEAD
[vcs 2024-11-22T23:08:52.299Z] executing ['git', 'fetch', '--no-tags', 'https://github.com/mozilla/translations', 'icu_tokenizer']
[vcs 2024-11-22T23:08:52.470Z] From https://github.com/mozilla/translations
[vcs 2024-11-22T23:08:52.470Z]  * branch            icu_tokenizer -> FETCH_HEAD
[vcs 2024-11-22T23:08:52.478Z] executing ['git', 'checkout', '-f', '-B', 'icu_tokenizer', 'd585a63a6abc04ece83e26ce51a0caa2f7fa21e6']
[vcs 2024-11-22T23:08:52.484Z] Reset branch 'icu_tokenizer'
[vcs 2024-11-22T23:08:52.505Z] executing ['git', 'submodule', 'init']
[vcs 2024-11-22T23:08:52.527Z] executing ['git', 'submodule', 'update', '--force']
[vcs 2024-11-22T23:08:52.629Z] Submodule path '3rd_party/browsermt-marian-dev': checked out '11c6ae7c46be21ef96ed10c60f28022fa968939f'
[vcs 2024-11-22T23:08:52.642Z] Submodule path '3rd_party/extract-lex': checked out '42fa605b53f32eaf6c6e0b5677255c21c91b3d49'
[vcs 2024-11-22T23:08:52.656Z] Submodule path '3rd_party/fast_align': checked out 'cab1e9aac8d3bb02ff5ae58218d8d225a039fa11'
[vcs 2024-11-22T23:08:52.687Z] Submodule path '3rd_party/kenlm': checked out 'bbf4fc511266c5d4515047055d7bdec659a6e158'
[vcs 2024-11-22T23:08:52.820Z] Submodule path '3rd_party/marian-dev': checked out 'e8a1a2530fb84cbff7383302ebca393e5875c441'
[vcs 2024-11-22T23:08:52.842Z] Submodule path '3rd_party/preprocess': checked out '64307314b4d5a9a0bd529b5c1036b0710d995eec'
[vcs 2024-11-22T23:08:52.922Z] Submodule path 'inference/3rd_party/browsermt-marian-dev': checked out '2781d735d4a10dca876d61be587afdab2726293c'
[vcs 2024-11-22T23:08:52.943Z] Submodule path 'inference/3rd_party/emsdk': checked out '2346baa7bb44a4a0571cc75f1986ab9aaa35aa03'
[vcs 2024-11-22T23:08:52.960Z] Submodule path 'inference/3rd_party/ssplit-cpp': checked out 'a311f9865ade34db1e8e080e6cc146f55dafb067'
[vcs 2024-11-22T23:08:52.960Z] cleaning git checkout...
[vcs 2024-11-22T23:08:52.960Z] executing ['git', 'clean', '-nxdff']
[vcs 2024-11-22T23:08:52.964Z] removing ['/builds/worker/checkouts/vcs/pipeline/common/__pycache__/']
[vcs 2024-11-22T23:08:52.964Z] successfully cleaned git checkout!
[vcs 2024-11-22T23:08:52.966Z] TinderboxPrint:<a href='https://github.com/mozilla/translations/commit/d585a63a6abc04ece83e26ce51a0caa2f7fa21e6' title='Built from translations commit d585a63a6abc04ece83e26ce51a0caa2f7fa21e6'>d585a63a6abc04ece83e26ce51a0caa2f7fa21e6</a>
[setup 2024-11-22T23:08:52.966Z] MOZ_FETCHES_DIR is /builds/worker/fetches
[fetches 2024-11-22T23:08:52.966Z] fetching artifacts
[fetches 2024-11-22T23:08:52.966Z] executing ['/usr/bin/python3', '-u', '/usr/local/bin/fetch-content', 'task-artifacts']
attempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YgHPpWXiQ1GZN_fLFdNjQw/artifacts/public/build/news_2007.en.zst to /builds/worker/fetches/news_2007.en.zstattempt 1/5

Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/OBONr18VQ9SHFhHjY4htrA/artifacts/public/build/tldr-pages_v2023-08-29.en.zst to /builds/worker/fetches/tldr-pages_v2023-08-29.en.zst
attempt 1/5
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/OBONr18VQ9SHFhHjY4htrA/artifacts/public/build/tldr-pages_v2023-08-29.en.zst
Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/CMk3dMWqQrOhiX5_lJExuQ/artifacts/public/build/corpus.en.zst to /builds/worker/fetches/corpus/corpus.en.zstDownloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YgHPpWXiQ1GZN_fLFdNjQw/artifacts/public/build/news_2007.en.zst

Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/CMk3dMWqQrOhiX5_lJExuQ/artifacts/public/build/corpus.en.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YgHPpWXiQ1GZN_fLFdNjQw/artifacts/public/build/news_2007.en.zst resolved to 29313 bytes with sha256 f642aabdea0bf2f16cb56f55f7d4bf9fae0581a5370e1569c54d0efedaf8f817 in 0.139s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/YgHPpWXiQ1GZN_fLFdNjQw/artifacts/public/build/news_2007.en.zst
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/CMk3dMWqQrOhiX5_lJExuQ/artifacts/public/build/corpus.en.zst resolved to 34237 bytes with sha256 207b47693583af53ef835458b6bf2adc2e16be77ad5fba9f00e70514563b5276 in 0.142s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/CMk3dMWqQrOhiX5_lJExuQ/artifacts/public/build/corpus.en.zst
Extracting /builds/worker/fetches/corpus/corpus.en.zst to /builds/worker/fetches/corpus
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/OBONr18VQ9SHFhHjY4htrA/artifacts/public/build/tldr-pages_v2023-08-29.en.zst resolved to 7959 bytes with sha256 fddc904b2dbe829f7bf5f090fa89812dec1a67b1922811819986b9ba7970df81 in 0.216s
Verified size of https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/OBONr18VQ9SHFhHjY4htrA/artifacts/public/build/tldr-pages_v2023-08-29.en.zst
PERFHERDER_DATA: {"framework": {"name": "build_metrics"}, "suites": [{"name": "fetch_content", "value": 0.2196996050000024, "lowerIsBetter": true, "shouldAlert": false, "subtests": []}]}
[fetches 2024-11-22T23:08:53.275Z] finished fetching artifacts
[task 2024-11-22T23:08:53.276Z] executing ['bash', '-c', 'pip install -r $VCS_PATH/pipeline/clean/requirements/merge.txt && export PYTHONPATH=$PYTHONPATH:$VCS_PATH && python3 $VCS_PATH/pipeline/clean/merge-mono.py --parallel_corpus $MOZ_FETCHES_DIR/corpus/corpus.en.zst --output $TASK_WORKDIR/artifacts/mono.en.zst --max_sentences 1000 --datasets_glob "$MOZ_FETCHES_DIR/*.zst"']
[task 2024-11-22T23:08:53.658Z] WARNING: The directory '/builds/worker/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
[task 2024-11-22T23:08:53.658Z] Defaulting to user installation because normal site-packages is not writeable
[task 2024-11-22T23:08:53.818Z] Collecting certifi==2024.7.4
[task 2024-11-22T23:08:53.922Z]   Downloading certifi-2024.7.4-py3-none-any.whl (162 kB)
[task 2024-11-22T23:08:53.972Z]      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.0/163.0 KB 3.3 MB/s eta 0:00:00
[task 2024-11-22T23:08:54.189Z] Collecting charset-normalizer==3.3.2
[task 2024-11-22T23:08:54.211Z]   Downloading charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)
[task 2024-11-22T23:08:54.237Z]      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 142.1/142.1 KB 5.7 MB/s eta 0:00:00
[task 2024-11-22T23:08:54.264Z] Collecting idna==3.7
[task 2024-11-22T23:08:54.284Z]   Downloading idna-3.7-py3-none-any.whl (66 kB)
[task 2024-11-22T23:08:54.294Z]      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.8/66.8 KB 7.5 MB/s eta 0:00:00
[task 2024-11-22T23:08:54.505Z] Collecting psutil==6.0.0
[task 2024-11-22T23:08:54.527Z]   Downloading psutil-6.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (290 kB)
[task 2024-11-22T23:08:54.575Z]      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 290.5/290.5 KB 6.3 MB/s eta 0:00:00
[task 2024-11-22T23:08:54.633Z] Collecting requests==2.31.0
[task 2024-11-22T23:08:54.654Z]   Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[task 2024-11-22T23:08:54.661Z]      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 KB 10.6 MB/s eta 0:00:00
[task 2024-11-22T23:08:54.726Z] Collecting urllib3==2.2.2
[task 2024-11-22T23:08:54.746Z]   Downloading urllib3-2.2.2-py3-none-any.whl (121 kB)
[task 2024-11-22T23:08:54.767Z]      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.4/121.4 KB 6.4 MB/s eta 0:00:00
[task 2024-11-22T23:08:54.853Z] Installing collected packages: urllib3, psutil, idna, charset-normalizer, certifi, requests
[task 2024-11-22T23:08:55.443Z] Successfully installed certifi-2024.7.4 charset-normalizer-3.3.2 idna-3.7 psutil-6.0.0 requests-2.31.0 urllib3-2.2.2
[task 2024-11-22T23:08:55.660Z] [merge-mono] Monolingual datasets:
[task 2024-11-22T23:08:55.660Z] [merge-mono]  - /builds/worker/fetches/tldr-pages_v2023-08-29.en.zst (8.0 KB)
[task 2024-11-22T23:08:55.660Z] [merge-mono]  - /builds/worker/fetches/news_2007.en.zst (29.3 KB)
[task 2024-11-22T23:08:55.660Z] [merge-mono]  - 37.3 KB total
[task 2024-11-22T23:08:55.660Z] [merge-mono] Parallel corpus:
[task 2024-11-22T23:08:55.660Z] [merge-mono]  - /builds/worker/fetches/corpus/corpus.en.zst (29.3 KB)
[task 2024-11-22T23:08:55.660Z] [memory] 26.3 MB
[task 2024-11-22T23:08:55.660Z] [merge-mono] Compute hashes of the parallel data: /builds/worker/fetches/news_2007.en.zst
[task 2024-11-22T23:08:55.667Z] [memory] 27.0 MB (+675.8 KB)
[task 2024-11-22T23:08:55.667Z] [merge-mono] Deduplicated and shuffling lines in memory.
[task 2024-11-22T23:08:55.667Z] [downloads] Reading lines from: /builds/worker/fetches/tldr-pages_v2023-08-29.en.zst
[task 2024-11-22T23:08:55.669Z] [downloads] Reading lines from: /builds/worker/fetches/news_2007.en.zst
[task 2024-11-22T23:08:55.675Z] [memory] 27.2 MB (+266.2 KB)
[task 2024-11-22T23:08:55.675Z] [merge-mono] Write the final file: /builds/worker/artifacts/mono.en.zst
[task 2024-11-22T23:08:55.681Z] [memory] 28.6 MB (+1.4 MB)
[task 2024-11-22T23:08:55.681Z] [merge-mono] Write a 10,000 line sample of the final: /builds/worker/artifacts/mono.en.sample.txt
[task 2024-11-22T23:08:55.688Z] [memory] 27.6 MB (-987.1 KB)
[task 2024-11-22T23:08:55.688Z] [merge-mono] Saved the stats: /builds/worker/artifacts/mono.en.stats.json
[task 2024-11-22T23:08:55.688Z] [merge-mono] Done: Merging monolingual datasets
[task 2024-11-22T23:08:55.688Z] [memory] 27.6 MB (+0 B)
[fetches 2024-11-22T23:08:55.705Z] removing /builds/worker/fetches
[fetches 2024-11-22T23:08:55.706Z] finished
[taskcluster 2024-11-22 23:08:56.481Z] === Task Finished ===
[taskcluster 2024-11-22 23:08:57.101Z] Successful task run with exit code: 0 completed in 6.209 seconds