From 383b595d8a950d2fd104f28a14614765b3a42e4a Mon Sep 17 00:00:00 2001 From: Benjamin Minixhofer Date: Mon, 24 Jun 2024 21:05:46 +0200 Subject: [PATCH 1/4] Update README.md --- README.md | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 81db432b..4b09f6c2 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,11 @@ -# Segment any Text: Robust, Efficient and Adaptable Sentence Segmentation +

wtpsplitπŸͺ“

+

Segment any text quickly, and adaptably⚑

-Code for the paper [Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation](TODO) by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan VuliΔ‡ and Markus Schedl. +This repository allows you to segment text into sentences or other semantic units. It implements the models from: +- **SaT** — [Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation](TODO) by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan VuliΔ‡ and Markus Schedl (**state-of-the-art, encouraged**). +- **WtP** — [Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https://aclanthology.org/2023.acl-long.398/) by Benjamin Minixhofer, Jonas Pfeiffer and Ivan VuliΔ‡ (*previous version, maintained for reproducibility*). -This repository contains `wtpsplit`, a package for robust, efficient and adaptable sentence segmentation across 85 languages, as well as the code and configs to reproduce the **state-of-the-art** results in 8 distinct corpora and 85 languages demonstrated in our Segment any Text [paper](TODO). +The namesake WtP is maintained for reproducibility. Our new followup SaT provides robust, efficient and adaptable sentence segmentation across 85 languages at higher performance and less compute cost. Check out the **state-of-the-art** results in 8 distinct corpora and 85 languages demonstrated in the [Segment any Text paper](TODO). ![System Figure](./configs/system-fig.png) @@ -346,13 +349,13 @@ Ensure to install packages from `requirements.txt` beforehand. For details, we refer to our [paper](TODO). -## Citation +## Citations If you find `wtpsplit` and our `SaT` models useful, please kindly cite our paper: ``` @inproceedings{TODO,} ``` -If you use WtP models, cite: +For the library and the WtP models, please cite: ``` @inproceedings{minixhofer-etal-2023-wheres, title = "Where{'}s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation", @@ -374,10 +377,6 @@ If you use WtP models, cite: This research was funded in whole or in part by the Austrian Science Fund (FWF): P36413, P33526, and DFH-23, and by the State of Upper Austria and the Federal Ministry of Education, Science, and Research, through grants LIT-2021-YOU-215. In addition, Ivan Vulic and Benjamin Minixhofer Β΄have been supported through the Royal Society University Research Fellowship β€˜Inclusive and Sustainable Language Technology for a Truly Multilingual World’ (no 221137) awarded to Ivan Vulic.Β΄ This research has also been supported with Cloud TPUs from Google’s TPU Research Cloud (TRC). This work was also supported by compute credits from a Cohere For AI Research Grant, these grants are designed to support academic partners conducting research with the goal of releasing scientific artifacts and data for good projects. We also thank Simone Teufel for fruitful discussions. +--- -## Previous Version - -*This repository previously contained `nnsplit` and `wtpsplit`, the precursors to `segment-any-text`. We still support all functionality of `wtpsplit`. Moreover, you can still use the `nnsplit` branch (or the `nnsplit` PyPI releases) for the old version, however, this is highly discouraged and not maintained! Please let us know if you have a usecase which `nnsplit` can solve but `segment-any-test` can not.* - -## Final Words -We hope this repo is useful. For any questions, please create an issue or send an email to markus.frohmann@gmail.com, and I will get back to you as soon as possible. \ No newline at end of file +For any questions, please create an issue or send an email to markus.frohmann@gmail.com, and I will get back to you as soon as possible. From fe7213d6186312b8a64abfbe5383759a9b4344f2 Mon Sep 17 00:00:00 2001 From: Benjamin Minixhofer Date: Mon, 24 Jun 2024 21:07:19 +0200 Subject: [PATCH 2/4] Create README_WTP.md --- README_WTP.md | 326 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 326 insertions(+) create mode 100644 README_WTP.md diff --git a/README_WTP.md b/README_WTP.md new file mode 100644 index 00000000..832a0db3 --- /dev/null +++ b/README_WTP.md @@ -0,0 +1,326 @@ +# WtP usage in wtpsplit (Legacy) + +This doc details how to use the old `WtP` models. You should probably use [SaT](./README.md) instead. + +## Usage + +```python +from wtpsplit import WtP + +wtp = WtP("wtp-bert-mini") +# optionally run on GPU for better performance +# also supports TPUs via e.g. wtp.to("xla:0"), in that case pass `pad_last_batch=True` to wtp.split +wtp.half().to("cuda") + +# returns ["Hello ", "This is a test."] +wtp.split("Hello This is a test.") + +# returns an iterator yielding a lists of sentences for every text +# do this instead of calling wtp.split on every text individually for much better performance +wtp.split(["Hello This is a test.", "And some more texts..."]) + +# if you're using a model with language adapters, also pass a `lang_code` +wtp.split("Hello This is a test.", lang_code="en") + +# depending on your usecase, adaptation to e.g. the Universal Dependencies style may give better results +# this always requires a language code +wtp.split("Hello This is a test.", lang_code="en", style="ud") +``` + +## ONNX support + +You can enable ONNX inference for the `wtp-bert-*` models: + +```python +wtp = WtP("wtp-bert-mini", onnx_providers=["CUDAExecutionProvider"]) +``` + +This requires `onnxruntime` and `onnxruntime-gpu`. It should give a good speedup on GPU! + +```python +>>> from wtpsplit import WtP +>>> texts = ["This is a sentence. This is another sentence."] * 1000 + +# PyTorch GPU +>>> model = WtP("wtp-bert-mini") +>>> model.half().to("cuda") +>>> %timeit list(model.split(texts)) +272 ms Β± 16.1 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each) + +# onnxruntime GPU +>>> model = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"]) +>>> %timeit list(model.split(texts)) +198 ms Β± 1.36 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each) +``` + +Notes: +- The `wtp-canine-*` models are currently not supported with ONNX because the pooling done by CANINE is not trivial to export. Ideas to solve this are very welcome! +- This does not work with Python 3.7 because `onnxruntime` does not support the opset we need for py37. + + +## Available Models + +Pro tips: I recommend `wtp-bert-mini` for speed-sensitive applications, otherwise `wtp-canine-s-12l`. The `*-no-adapters` models provide a good tradeoff between speed and performance. You should *probably not* use `wtp-bert-tiny`. + +| Model | English Score | English Score
(adapted) | Multilingual Score | Multilingual Score
(adapted) | +|:-----------------------------------------------------------------------|-----:|-----:|-----:|-----:| +| [wtp-bert-tiny](https://huggingface.co/benjamin/wtp-bert-tiny) | 83.8 | 91.9 | 79.5 | 88.6 | +| [wtp-bert-mini](https://huggingface.co/benjamin/wtp-bert-mini) | 91.8 | 95.9 | 84.3 | 91.3 | +| [wtp-canine-s-1l](https://huggingface.co/benjamin/wtp-canine-s-1l) | 94.5 | 96.5 | 86.7 | 92.8 | +| [wtp-canine-s-1l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-1l-no-adapters) | 93.1 | 96.4 | 85.1 | 91.8 | +| [wtp-canine-s-3l](https://huggingface.co/benjamin/wtp-canine-s-3l) | 94.4 | 96.8 | 86.7 | 93.4 | +| [wtp-canine-s-3l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-3l-no-adapters) | 93.8 | 96.4 | 86 | 92.3 | +| [wtp-canine-s-6l](https://huggingface.co/benjamin/wtp-canine-s-6l) | 94.5 | 97.1 | 87 | 93.6 | +| [wtp-canine-s-6l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-6l-no-adapters) | 94.4 | 96.8 | 86.4 | 92.8 | +| [wtp-canine-s-9l](https://huggingface.co/benjamin/wtp-canine-s-9l) | 94.8 | 97 | 87.7 | 93.8 | +| [wtp-canine-s-9l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-9l-no-adapters) | 94.3 | 96.9 | 86.6 | 93 | +| [wtp-canine-s-12l](https://huggingface.co/benjamin/wtp-canine-s-12l) | 94.7 | 97.1 | 87.9 | 94 | +| [wtp-canine-s-12l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-12l-no-adapters) | 94.5 | 97 | 87.1 | 93.2 | + +The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". "adapted" means adapation via WtP Punct; check out the paper for details. + +For comparison, here's the English scores of some other tools: + +| Model | English Score +|:-----------------------------------------------------------------------|-----:| +| SpaCy (sentencizer) | 86.8 | +| PySBD | 69.8 | +| SpaCy (dependency parser) | 93.1 | +| Ersatz | 91.6 | +| Punkt (`nltk.sent_tokenize`) | 92.5 | + +### Paragraph Segmentation + +Since WtP models are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences. + +```python +# returns a list of paragraphs, each containing a list of sentences +# adjust the paragraph threshold via the `paragraph_threshold` argument. +wtp.split(text, do_paragraph_segmentation=True) +``` + +### Adaptation + +WtP can adapt to the Universal Dependencies, OPUS100 or Ersatz corpus segmentation style in many languages by punctuation adaptation (*preferred*) or threshold adaptation. + +#### Punctuation Adaptation + +```python +# this requires a `lang_code` +# check the paper or `wtp.mixtures` for supported styles +wtp.split(text, lang_code="en", style="ud") +``` + +This also allows changing the threshold, but inherently has higher thresholds values since it is not newline probablity anymore being thresholded: + +```python +wtp.split(text, lang_code="en", style="ud", threshold=0.7) +``` + +To get the default threshold for a style: +```python +wtp.get_threshold("en", "ud", return_punctuation_threshold=True) +``` + +#### Threshold Adaptation +```python +threshold = wtp.get_threshold("en", "ud") + +wtp.split(text, threshold=threshold) +``` + +### Advanced Usage + +__Get the newline or sentence boundary probabilities for a text:__ + +```python +# returns newline probabilities (supports batching!) +wtp.predict_proba(text) + +# returns sentence boundary probabilities for the given style +wtp.predict_proba(text, lang_code="en", style="ud") +``` + +__Load a WtP model in [HuggingFace `transformers`](https://github.com/huggingface/transformers):__ + +```python +# import wtpsplit.models to register the custom models +# (character-level BERT w/ hash embeddings and canine with language adapters) +import wtpsplit.models +from transformers import AutoModelForTokenClassification + +model = AutoModelForTokenClassification.from_pretrained("benjamin/wtp-bert-mini") # or some other model name +``` + +__** NEW ** Adapt to your own corpus using WtP_Punct:__ + +Clone the repository: + +``` +git clone https://github.com/bminixhofer/wtpsplit +cd wtpsplit +``` + +Create your data: +```python +import torch + +torch.save( + { + "en": { + "sentence": { + "dummy-dataset": { + "meta": { + "train_data": ["train sentence 1", "train sentence 2"], + }, + "data": [ + "test sentence 1", + "test sentence 2", + ] + } + } + } + }, + "dummy-dataset.pth" +) +``` + +Run adaptation: + +``` +python3 wtpsplit/evaluation/adapt.py --model_path=benjamin/wtp-bert-mini --eval_data_path dummy-dataset.pth --include_langs=en +``` + +This should print something like + +``` +en dummy-dataset U=0.500 T=0.667 PUNCT=0.667 +100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 30.52it/s] +Wrote mixture to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini.skops +Wrote results to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini_intrinsic_results.json +``` + +i.e. run adaptation on your data and save the mixtures and evaluation results. You can then load and use the mixture like this: + +```python +from wtpsplit import WtP +import skops.io as sio + +wtp = WtP( + "wtp-bert-mini", + mixtures=sio.load( + "wtpsplit/.cache/wtp-bert-mini.skops", + ["numpy.float32", "numpy.float64", "sklearn.linear_model._logistic.LogisticRegression"], + ), +) + +wtp.split("your text here", lang_code="en", style="dummy-dataset") +``` + +... and adjust the dataset name, language and model in the above to your needs. + +## Reproducing the paper + +`configs/` contains the configs for the runs from the paper. We trained on a TPUv3-8. Launch training like this: + +``` +python wtpsplit/train/train.py configs/.json +``` + +In addition: +- `wtpsplit/data_acquisition` contains the code for obtaining evaluation data and raw text from the mC4 corpus. +- `wtpsplit/evaluation` contains the code for: + - intrinsic evaluation (i.e. sentence segmentation results) via `adapt.py`. The raw intrinsic results in JSON format are also at `evaluation_results/` + - extrinsic evaluation on Machine Translation in `extrinsic.py` + - baseline (PySBD, nltk, etc.) intrinsic evaluation in `intrinsic_baselines.py` + - punctuation annotation experiments in `punct_annotation.py` and `punct_annotation_wtp.py` + +## Supported Languages + +| iso | Name | +|:----|:-----------------------| +| af | Afrikaans | +| am | Amharic | +| ar | Arabic | +| az | Azerbaijani | +| be | Belarusian | +| bg | Bulgarian | +| bn | Bengali | +| ca | Catalan | +| ceb | Cebuano | +| cs | Czech | +| cy | Welsh | +| da | Danish | +| de | German | +| el | Greek | +| en | English | +| eo | Esperanto | +| es | Spanish | +| et | Estonian | +| eu | Basque | +| fa | Persian | +| fi | Finnish | +| fr | French | +| fy | Western Frisian | +| ga | Irish | +| gd | Scottish Gaelic | +| gl | Galician | +| gu | Gujarati | +| ha | Hausa | +| he | Hebrew | +| hi | Hindi | +| hu | Hungarian | +| hy | Armenian | +| id | Indonesian | +| ig | Igbo | +| is | Icelandic | +| it | Italian | +| ja | Japanese | +| jv | Javanese | +| ka | Georgian | +| kk | Kazakh | +| km | Central Khmer | +| kn | Kannada | +| ko | Korean | +| ku | Kurdish | +| ky | Kirghiz | +| la | Latin | +| lt | Lithuanian | +| lv | Latvian | +| mg | Malagasy | +| mk | Macedonian | +| ml | Malayalam | +| mn | Mongolian | +| mr | Marathi | +| ms | Malay | +| mt | Maltese | +| my | Burmese | +| ne | Nepali | +| nl | Dutch | +| no | Norwegian | +| pa | Panjabi | +| pl | Polish | +| ps | Pushto | +| pt | Portuguese | +| ro | Romanian | +| ru | Russian | +| si | Sinhala | +| sk | Slovak | +| sl | Slovenian | +| sq | Albanian | +| sr | Serbian | +| sv | Swedish | +| ta | Tamil | +| te | Telugu | +| tg | Tajik | +| th | Thai | +| tr | Turkish | +| uk | Ukrainian | +| ur | Urdu | +| uz | Uzbek | +| vi | Vietnamese | +| xh | Xhosa | +| yi | Yiddish | +| yo | Yoruba | +| zh | Chinese | +| zu | Zulu | From 970a8c488efdc5d8c92fff784e5ca39b982c64ce Mon Sep 17 00:00:00 2001 From: Benjamin Minixhofer Date: Mon, 24 Jun 2024 21:07:57 +0200 Subject: [PATCH 3/4] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4b09f6c2..3aaed206 100644 --- a/README.md +++ b/README.md @@ -87,7 +87,7 @@ wtp = WtP("wtp-bert-mini") wtp.split("This is a test This is another test.") ``` -For more details on WtP and reproduction details, see the `wtp` branch. +For more details on WtP and reproduction details, see the [WtP doc](./README_WTP.md). ## Paragraph Segmentation From 93b79859cf05a91989c3a226652913e8fb2c3d60 Mon Sep 17 00:00:00 2001 From: Benjamin Minixhofer Date: Mon, 24 Jun 2024 21:09:52 +0200 Subject: [PATCH 4/4] Update README.md --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 3aaed206..5671fdac 100644 --- a/README.md +++ b/README.md @@ -259,6 +259,9 @@ In addition: Ensure to install packages from `requirements.txt` beforehand. ## Supported Languages +
+ Table with supported languages + | iso | Name | |:----|:-----------------------| | af | Afrikaans | @@ -347,7 +350,9 @@ Ensure to install packages from `requirements.txt` beforehand. | zh | Chinese | | zu | Zulu | -For details, we refer to our [paper](TODO). +
+ +For details, please see the [paper](TODO). ## Citations