Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SaT #117

Closed
wants to merge 285 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
285 commits
Select commit Hold shift + click to select a range
d3473a4
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Jan 11, 2024
c29f6a9
fix standard case_corruption_prob
markus583 Jan 11, 2024
7a81a45
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Jan 11, 2024
1fd7730
add auxp config
markus583 Jan 12, 2024
ed61eeb
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Jan 12, 2024
e72c358
update intrinsic.py, add violin plot
markus583 Jan 12, 2024
7f9c7b8
add limited lookahead via shifted causal attn mask
markus583 Jan 13, 2024
6e959e2
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Jan 13, 2024
9ca7955
update plot & threshold
markus583 Jan 13, 2024
98c46e7
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Jan 14, 2024
2214f4c
add lowercase eval
markus583 Jan 14, 2024
dfba4de
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Jan 14, 2024
914d5fc
actually use lowercase eval
markus583 Jan 14, 2024
08b64b8
add timing benchmark
markus583 Jan 20, 2024
42464e1
refactor Case Corruption
markus583 Jan 20, 2024
b51f79a
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Jan 20, 2024
6231a99
eval ASR-style & pairwise
markus583 Jan 20, 2024
a9bd73c
fix split
markus583 Jan 20, 2024
bd02a88
fix stride
markus583 Jan 20, 2024
9678605
remove short seqs in pairwise eval
markus583 Jan 20, 2024
f83af3a
return model back to tpu
markus583 Jan 20, 2024
9f41a85
use correct labels in pairwise eval
markus583 Jan 21, 2024
fd5aa86
add some lyrics stuff
markus583 Feb 2, 2024
f9be951
add lower & rmp intrinsic eval
markus583 Feb 2, 2024
2443bcc
fix corruption
markus583 Feb 3, 2024
eaa1968
adapt to current codebase
markus583 Feb 4, 2024
1e6b237
fix rmp labels
markus583 Feb 5, 2024
48116f6
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Feb 5, 2024
d0d737e
actually fix rmp eval
markus583 Feb 5, 2024
2cf1198
pairwise eval (inference mode)
markus583 Feb 9, 2024
03bbb60
proper pairwise eval during training
markus583 Feb 12, 2024
8a8bc17
fix non-pair eval
markus583 Feb 14, 2024
f35ca5c
tiny fix
markus583 Feb 15, 2024
e5ee9d2
update gitignore
markus583 Feb 15, 2024
dc68dca
proper lookahead
markus583 Feb 18, 2024
dc835ec
first peft implementation
markus583 Feb 21, 2024
d4abc80
fix module saving
markus583 Feb 21, 2024
4366677
skip if no logits
markus583 Feb 22, 2024
c346a88
train adapters in parallel on TPUs
markus583 Feb 26, 2024
0fee97d
cleanup
markus583 Feb 26, 2024
0cd453a
add meta clf for ft
markus583 Feb 27, 2024
934381d
add lora config
markus583 Feb 28, 2024
6459683
add some stuff for adp
markus583 Mar 2, 2024
153485a
fix meta-clf head loading
markus583 Mar 2, 2024
0c0116d
update default threshold
markus583 Mar 3, 2024
1773c47
add corruption to adp training
markus583 Mar 4, 2024
45b892b
add corruptions & domain setup
markus583 Mar 6, 2024
4ac568c
fixes
markus583 Mar 6, 2024
5c89af1
optionally, skip eval loss
markus583 Mar 6, 2024
c6361d9
update non-parallel adp training
markus583 Mar 7, 2024
985c6c2
more convenient eval
markus583 Mar 7, 2024
b847d0b
fix name
markus583 Mar 7, 2024
e8800e5
add chunkwise eval function (for lyrics etc.)
markus583 Mar 9, 2024
df34a60
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Mar 9, 2024
32b9c14
ADP train with single samples
markus583 Mar 9, 2024
61ef0de
simplify score calculation
markus583 Mar 10, 2024
f804804
fix corruption
markus583 Mar 12, 2024
e380c2a
flatten list, add strip flag
markus583 Mar 26, 2024
76d58ca
undo parallel training changes
markus583 Mar 26, 2024
87ddd27
strip flag
markus583 Mar 26, 2024
33a9fc6
add tqdm feature
markus583 Mar 27, 2024
d59964c
add option for simple full FT
markus583 Mar 27, 2024
3d12df4
add full FT save
markus583 Mar 27, 2024
88c7310
add config
markus583 Mar 27, 2024
fa5cb3a
add full model ft support
markus583 Mar 28, 2024
e2eae1c
generalize pairwise eval to k-mer eval
markus583 Mar 28, 2024
19aa4b2
use train set!
markus583 Mar 28, 2024
ab776f8
full FT compat
markus583 Mar 29, 2024
b057630
indent
markus583 Mar 29, 2024
52b6a48
fix eval selection
markus583 Mar 29, 2024
d6747a0
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Mar 29, 2024
b12ee45
expand kmers
markus583 Mar 29, 2024
aec7fc2
indent!
markus583 Mar 29, 2024
8ded82c
safer .to_cpu
markus583 Mar 29, 2024
a37f2e2
eval mldbW for mldb models
markus583 Mar 31, 2024
2ad7843
add subsampling + fix auto-eval (?)
markus583 Apr 1, 2024
cdcaeba
minor fixes
markus583 Apr 1, 2024
31254c3
fix subsampling indexing
markus583 Apr 2, 2024
284e925
threshold investigation
markus583 Apr 4, 2024
888c1c9
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Apr 4, 2024
99c6b9c
fix threshold investigation
markus583 Apr 4, 2024
d6088f1
fix + propagate threshold
markus583 Apr 5, 2024
ac4380d
fix model loading
markus583 Apr 5, 2024
ce2d5aa
enable threshold analyses
markus583 Apr 5, 2024
e08ab57
fix f return
markus583 Apr 5, 2024
4b74828
update
markus583 Apr 8, 2024
5fba936
keep logits
markus583 Apr 8, 2024
127dd5c
reuse logits
markus583 Apr 9, 2024
62b6d7b
re-use logits
markus583 Apr 9, 2024
a334575
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Apr 9, 2024
02dba86
eval + train ADP on Igor's models
markus583 Apr 13, 2024
eadfc37
fix eval steps
markus583 Apr 13, 2024
b89c6b3
add extra langs csv
markus583 Apr 13, 2024
61661e1
return more metrics
markus583 Apr 13, 2024
23e95a8
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Apr 13, 2024
b370546
fix decision function for Igor's models
markus583 Apr 14, 2024
dd1e5fc
same fix for listwise
markus583 Apr 14, 2024
a96105a
also fix adp training w/ Igor's models.
markus583 Apr 14, 2024
447f54d
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Apr 14, 2024
f97c280
add possibility for clf_from_scratch
markus583 Apr 14, 2024
c14a893
fix punct if no full logits
markus583 Apr 15, 2024
7febe9a
add option for lookahead only in first N layers
markus583 Apr 16, 2024
62dc41b
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Apr 16, 2024
b794432
add clf from scratch
markus583 Apr 21, 2024
7c608c1
add possibility to corrupt entire chunks (v1)
markus583 Apr 21, 2024
41e198d
update configs
markus583 Apr 21, 2024
53edc76
fix return
markus583 Apr 21, 2024
42eb4f3
no pairwise eval + new corrupted data
markus583 Apr 22, 2024
3d1a8cc
add full corruption param
markus583 Apr 23, 2024
6b9e2b8
add asr strategy
markus583 Apr 23, 2024
37824a4
add mixed corruption strategy
markus583 Apr 23, 2024
baf398a
ensure same bs on all TPUs
markus583 Apr 23, 2024
86a7f10
update corrupt-asr
markus583 Apr 24, 2024
296e495
fix data path
markus583 Apr 24, 2024
b56e9f3
log more
markus583 Apr 25, 2024
3c69061
add detok!?
markus583 Apr 25, 2024
e5844cc
update xlmr labels
markus583 Apr 26, 2024
b11dafe
add label token hierarchy
markus583 Apr 26, 2024
0708454
skip corrupted evals (for now)
markus583 Apr 26, 2024
3b066b3
feats
markus583 Apr 27, 2024
2214fac
adapt window
markus583 Apr 27, 2024
21ee4f8
eval corrupted again
markus583 Apr 28, 2024
15d65b6
v3-3 backup
markus583 Apr 28, 2024
0600c17
draft t
markus583 Apr 29, 2024
ec60916
fix eval data load
markus583 Apr 29, 2024
2d4f8ed
cleanup, use label idx
markus583 Apr 29, 2024
96fa39c
no remove
markus583 Apr 29, 2024
1ba048a
no corrupted eval!
markus583 Apr 29, 2024
cecb294
clean-up
markus583 Apr 29, 2024
62a1aff
fix labeling!?
markus583 Apr 29, 2024
d3cea0c
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 Apr 29, 2024
f0e1f52
fix fix fix!
markus583 Apr 30, 2024
0c5cae4
remove ds at correct pos
markus583 Apr 30, 2024
b4b1704
fix no sents
markus583 Apr 30, 2024
8d9e870
use old tokenization
markus583 May 1, 2024
2294c39
add non-whitespace oversampling
markus583 May 1, 2024
d87d4d5
update run.sh
markus583 May 1, 2024
ad453bc
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 May 1, 2024
187fa8c
optionally, store indices
markus583 May 2, 2024
14f4283
adapt to intrinsic.py
markus583 May 3, 2024
a9f7fd9
fix
markus583 May 3, 2024
8cc90c0
use new data, less verbose
markus583 May 4, 2024
1e327b1
handle short-seqs
markus583 May 5, 2024
08c4b45
fix case: len(input_tokens) == 511
markus583 May 5, 2024
4c8f7ba
helper
markus583 May 5, 2024
96d247d
code for legal data baseline
markus583 May 7, 2024
39a66eb
update for newest data
markus583 May 7, 2024
f45976f
update to recent changes
markus583 May 7, 2024
6d19922
fix too short sequences for block creation + silence chars
markus583 May 8, 2024
87781b2
fix v2
markus583 May 8, 2024
158af0e
handle nllb train
markus583 May 9, 2024
9335d0c
add ted2020 shared task data + eval
markus583 May 9, 2024
23e0977
fixes
markus583 May 9, 2024
0a7d432
minor stuff.
markus583 May 10, 2024
bad5b8f
short seqs: no packing
markus583 May 10, 2024
983ff29
fix
markus583 May 10, 2024
0f7722a
fix list of lists subsampling
markus583 May 10, 2024
e6cd196
sync
markus583 May 10, 2024
41b7960
only load trained ADP
markus583 May 10, 2024
6b65157
new data + set lora config
markus583 May 10, 2024
5ddb9f0
fix naming
markus583 May 11, 2024
b119b5e
skip en legal laws
markus583 May 11, 2024
a27bd89
strip + new data
markus583 May 11, 2024
4f27691
eval: skip parts of ds, exclude every kth, xlmr compat
markus583 May 12, 2024
66b97f4
some fixes
markus583 May 12, 2024
57545b1
add entity (for v4)
markus583 May 12, 2024
70a9aea
LLM + new defaults
markus583 May 12, 2024
bdbc5ef
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 May 12, 2024
53277dc
docs: only legal + shuffle
markus583 May 12, 2024
0f544ef
also shuffle in forward
markus583 May 12, 2024
312a5c2
also shuffle here
markus583 May 12, 2024
bc514cf
v2
markus583 May 14, 2024
1d5d66a
add avg
markus583 May 14, 2024
34de88a
skip non/empty lists
markus583 May 14, 2024
2e80a9d
fix xlmr base + qol
markus583 May 14, 2024
e6a88f8
only extra-subsample legal
markus583 May 14, 2024
53fe503
use ALL samples; baselines
markus583 May 14, 2024
4198125
fix naming
markus583 May 14, 2024
2211bec
better E handling
markus583 May 14, 2024
b971b1c
add LLM eval sentence fn
markus583 May 14, 2024
f7554d3
k = 0
markus583 May 15, 2024
7ad9a3e
final eval setup?
markus583 May 15, 2024
1825000
also do legal asr corruptions
markus583 May 15, 2024
c4a4ff3
feats
markus583 May 15, 2024
a4f91af
don't store logits if no punct applied
markus583 May 15, 2024
4fcb260
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 May 15, 2024
4362896
no print
markus583 May 15, 2024
f235542
proper dict assignment
markus583 May 15, 2024
2c99884
cs lang code for canine
markus583 May 16, 2024
9feb8d5
new skip
markus583 May 16, 2024
9994910
Merge branch 'efficient-transfer' of https://github.com/bminixhofer/w…
markus583 May 16, 2024
edab657
fix imports + legal en
markus583 May 16, 2024
8809327
fix ted filter
markus583 May 17, 2024
545beca
fix filter, cf. intrinsic
markus583 May 17, 2024
fd6716e
update data pth, idcs
markus583 May 18, 2024
2f06d84
baselines
markus583 May 19, 2024
27e56f0
no subsample
markus583 May 19, 2024
29e4d06
fix append
markus583 May 19, 2024
98443b7
finally fix indices
markus583 May 19, 2024
d875c77
add xlmr+lora
markus583 May 20, 2024
6a09358
finally fix idcs
markus583 May 20, 2024
5a99038
quick xlm-r small model fix
markus583 May 21, 2024
f46b995
fix num layers
markus583 May 21, 2024
79598b1
fix short seq inclusion
markus583 May 21, 2024
c5a5cb9
update legal baseline
markus583 May 23, 2024
a20ce91
adapt to new data config
markus583 May 23, 2024
1a11667
xlmr + adapters compat
markus583 May 24, 2024
2b77d40
fix xlmr 3l eval
markus583 May 25, 2024
e6a31e7
fix idcs count
markus583 Jun 4, 2024
b4e70c5
up
markus583 Jun 5, 2024
07c78dc
prepare for v2
markus583 Jun 16, 2024
c6983d5
integrate igor's code
markus583 Jun 16, 2024
a4f9e20
fix some lint errors
markus583 Jun 16, 2024
c995b70
fix try:
markus583 Jun 16, 2024
1add5b5
f
markus583 Jun 16, 2024
3f3ad11
ignore bare except
markus583 Jun 16, 2024
fceca3f
noqa
markus583 Jun 16, 2024
c61973f
test
markus583 Jun 16, 2024
4d222f5
fix reqs
markus583 Jun 16, 2024
45de2ca
reqs?
markus583 Jun 16, 2024
6a76f73
fix np!!!
markus583 Jun 16, 2024
0a2f530
update
markus583 Jun 16, 2024
15651e2
regular sigmoid
markus583 Jun 16, 2024
08b34c0
refvert for now
markus583 Jun 16, 2024
d7af259
add results + configs
markus583 Jun 17, 2024
dfb154d
update reqs.txt
markus583 Jun 17, 2024
e7606d3
try onnx export
markus583 Jun 17, 2024
fb0e05f
update readme stuff
markus583 Jun 18, 2024
84ae09d
typooo
markus583 Jun 18, 2024
8f787c6
git rm
markus583 Jun 18, 2024
adc23b7
git rm v2
markus583 Jun 18, 2024
cc4871f
add eval results
markus583 Jun 19, 2024
08a6ed9
fix forward call + xlmr
markus583 Jun 19, 2024
478540d
rm
markus583 Jun 19, 2024
f3dd236
fix adp version
markus583 Jun 19, 2024
08432dc
transformers version?
markus583 Jun 19, 2024
48e2c90
t
markus583 Jun 19, 2024
a9e3012
loc
markus583 Jun 19, 2024
fc27c9a
bump?
markus583 Jun 19, 2024
d63e21e
monkey patch adapters
markus583 Jun 19, 2024
7fa1075
rm old stuff
markus583 Jun 19, 2024
8248ec5
more rm
markus583 Jun 19, 2024
af0963a
fix duplicates
markus583 Jun 19, 2024
310a077
add py3.12
markus583 Jun 19, 2024
383b595
Update README.md
bminixhofer Jun 24, 2024
fe7213d
Create README_WTP.md
bminixhofer Jun 24, 2024
970a8c4
Update README.md
bminixhofer Jun 24, 2024
93b7985
Update README.md
bminixhofer Jun 24, 2024
1750a28
Merge pull request #116 from bminixhofer/doc-update
markus583 Jun 25, 2024
376005b
speed-up List[str]
markus583 Jun 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v3
Expand Down
13 changes: 12 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,15 @@ external
*.egg-info
notebooks
final_outputs
.cache*
.cache*
data_subset/**
*.pth
*.zip
*.bin
*.html
*.csv
*.png
*.txt
*.log
xlmr-*/**
**/checkpoint-*/**
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2023 Benjamin Minixhofer
Copyright (c) 2024 Benjamin Minixhofer, Markus Frohmann, Igor Sterner

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
284 changes: 176 additions & 108 deletions README.md

Large diffs are not rendered by default.

326 changes: 326 additions & 0 deletions README_WTP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,326 @@
# WtP usage in wtpsplit (Legacy)

This doc details how to use the old `WtP` models. You should probably use [SaT](./README.md) instead.

## Usage

```python
from wtpsplit import WtP

wtp = WtP("wtp-bert-mini")
# optionally run on GPU for better performance
# also supports TPUs via e.g. wtp.to("xla:0"), in that case pass `pad_last_batch=True` to wtp.split
wtp.half().to("cuda")

# returns ["Hello ", "This is a test."]
wtp.split("Hello This is a test.")

# returns an iterator yielding a lists of sentences for every text
# do this instead of calling wtp.split on every text individually for much better performance
wtp.split(["Hello This is a test.", "And some more texts..."])

# if you're using a model with language adapters, also pass a `lang_code`
wtp.split("Hello This is a test.", lang_code="en")

# depending on your usecase, adaptation to e.g. the Universal Dependencies style may give better results
# this always requires a language code
wtp.split("Hello This is a test.", lang_code="en", style="ud")
```

## ONNX support

You can enable ONNX inference for the `wtp-bert-*` models:

```python
wtp = WtP("wtp-bert-mini", onnx_providers=["CUDAExecutionProvider"])
```

This requires `onnxruntime` and `onnxruntime-gpu`. It should give a good speedup on GPU!

```python
>>> from wtpsplit import WtP
>>> texts = ["This is a sentence. This is another sentence."] * 1000

# PyTorch GPU
>>> model = WtP("wtp-bert-mini")
>>> model.half().to("cuda")
>>> %timeit list(model.split(texts))
272 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# onnxruntime GPU
>>> model = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"])
>>> %timeit list(model.split(texts))
198 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Notes:
- The `wtp-canine-*` models are currently not supported with ONNX because the pooling done by CANINE is not trivial to export. Ideas to solve this are very welcome!
- This does not work with Python 3.7 because `onnxruntime` does not support the opset we need for py37.


## Available Models

Pro tips: I recommend `wtp-bert-mini` for speed-sensitive applications, otherwise `wtp-canine-s-12l`. The `*-no-adapters` models provide a good tradeoff between speed and performance. You should *probably not* use `wtp-bert-tiny`.

| Model | English Score | English Score<br>(adapted) | Multilingual Score | Multilingual Score<br>(adapted) |
|:-----------------------------------------------------------------------|-----:|-----:|-----:|-----:|
| [wtp-bert-tiny](https://huggingface.co/benjamin/wtp-bert-tiny) | 83.8 | 91.9 | 79.5 | 88.6 |
| [wtp-bert-mini](https://huggingface.co/benjamin/wtp-bert-mini) | 91.8 | 95.9 | 84.3 | 91.3 |
| [wtp-canine-s-1l](https://huggingface.co/benjamin/wtp-canine-s-1l) | 94.5 | 96.5 | 86.7 | 92.8 |
| [wtp-canine-s-1l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-1l-no-adapters) | 93.1 | 96.4 | 85.1 | 91.8 |
| [wtp-canine-s-3l](https://huggingface.co/benjamin/wtp-canine-s-3l) | 94.4 | 96.8 | 86.7 | 93.4 |
| [wtp-canine-s-3l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-3l-no-adapters) | 93.8 | 96.4 | 86 | 92.3 |
| [wtp-canine-s-6l](https://huggingface.co/benjamin/wtp-canine-s-6l) | 94.5 | 97.1 | 87 | 93.6 |
| [wtp-canine-s-6l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-6l-no-adapters) | 94.4 | 96.8 | 86.4 | 92.8 |
| [wtp-canine-s-9l](https://huggingface.co/benjamin/wtp-canine-s-9l) | 94.8 | 97 | 87.7 | 93.8 |
| [wtp-canine-s-9l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-9l-no-adapters) | 94.3 | 96.9 | 86.6 | 93 |
| [wtp-canine-s-12l](https://huggingface.co/benjamin/wtp-canine-s-12l) | 94.7 | 97.1 | 87.9 | 94 |
| [wtp-canine-s-12l-no-adapters](https://huggingface.co/benjamin/wtp-canine-s-12l-no-adapters) | 94.5 | 97 | 87.1 | 93.2 |

The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". "adapted" means adapation via WtP Punct; check out the paper for details.

For comparison, here's the English scores of some other tools:

| Model | English Score
|:-----------------------------------------------------------------------|-----:|
| SpaCy (sentencizer) | 86.8 |
| PySBD | 69.8 |
| SpaCy (dependency parser) | 93.1 |
| Ersatz | 91.6 |
| Punkt (`nltk.sent_tokenize`) | 92.5 |

### Paragraph Segmentation

Since WtP models are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.

```python
# returns a list of paragraphs, each containing a list of sentences
# adjust the paragraph threshold via the `paragraph_threshold` argument.
wtp.split(text, do_paragraph_segmentation=True)
```

### Adaptation

WtP can adapt to the Universal Dependencies, OPUS100 or Ersatz corpus segmentation style in many languages by punctuation adaptation (*preferred*) or threshold adaptation.

#### Punctuation Adaptation

```python
# this requires a `lang_code`
# check the paper or `wtp.mixtures` for supported styles
wtp.split(text, lang_code="en", style="ud")
```

This also allows changing the threshold, but inherently has higher thresholds values since it is not newline probablity anymore being thresholded:

```python
wtp.split(text, lang_code="en", style="ud", threshold=0.7)
```

To get the default threshold for a style:
```python
wtp.get_threshold("en", "ud", return_punctuation_threshold=True)
```

#### Threshold Adaptation
```python
threshold = wtp.get_threshold("en", "ud")

wtp.split(text, threshold=threshold)
```

### Advanced Usage

__Get the newline or sentence boundary probabilities for a text:__

```python
# returns newline probabilities (supports batching!)
wtp.predict_proba(text)

# returns sentence boundary probabilities for the given style
wtp.predict_proba(text, lang_code="en", style="ud")
```

__Load a WtP model in [HuggingFace `transformers`](https://github.com/huggingface/transformers):__

```python
# import wtpsplit.models to register the custom models
# (character-level BERT w/ hash embeddings and canine with language adapters)
import wtpsplit.models
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("benjamin/wtp-bert-mini") # or some other model name
```

__** NEW ** Adapt to your own corpus using WtP_Punct:__

Clone the repository:

```
git clone https://github.com/bminixhofer/wtpsplit
cd wtpsplit
```

Create your data:
```python
import torch

torch.save(
{
"en": {
"sentence": {
"dummy-dataset": {
"meta": {
"train_data": ["train sentence 1", "train sentence 2"],
},
"data": [
"test sentence 1",
"test sentence 2",
]
}
}
}
},
"dummy-dataset.pth"
)
```

Run adaptation:

```
python3 wtpsplit/evaluation/adapt.py --model_path=benjamin/wtp-bert-mini --eval_data_path dummy-dataset.pth --include_langs=en
```

This should print something like

```
en dummy-dataset U=0.500 T=0.667 PUNCT=0.667
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.52it/s]
Wrote mixture to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini.skops
Wrote results to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini_intrinsic_results.json
```

i.e. run adaptation on your data and save the mixtures and evaluation results. You can then load and use the mixture like this:

```python
from wtpsplit import WtP
import skops.io as sio

wtp = WtP(
"wtp-bert-mini",
mixtures=sio.load(
"wtpsplit/.cache/wtp-bert-mini.skops",
["numpy.float32", "numpy.float64", "sklearn.linear_model._logistic.LogisticRegression"],
),
)

wtp.split("your text here", lang_code="en", style="dummy-dataset")
```

... and adjust the dataset name, language and model in the above to your needs.

## Reproducing the paper

`configs/` contains the configs for the runs from the paper. We trained on a TPUv3-8. Launch training like this:

```
python wtpsplit/train/train.py configs/<config_name>.json
```

In addition:
- `wtpsplit/data_acquisition` contains the code for obtaining evaluation data and raw text from the mC4 corpus.
- `wtpsplit/evaluation` contains the code for:
- intrinsic evaluation (i.e. sentence segmentation results) via `adapt.py`. The raw intrinsic results in JSON format are also at `evaluation_results/`
- extrinsic evaluation on Machine Translation in `extrinsic.py`
- baseline (PySBD, nltk, etc.) intrinsic evaluation in `intrinsic_baselines.py`
- punctuation annotation experiments in `punct_annotation.py` and `punct_annotation_wtp.py`

## Supported Languages

| iso | Name |
|:----|:-----------------------|
| af | Afrikaans |
| am | Amharic |
| ar | Arabic |
| az | Azerbaijani |
| be | Belarusian |
| bg | Bulgarian |
| bn | Bengali |
| ca | Catalan |
| ceb | Cebuano |
| cs | Czech |
| cy | Welsh |
| da | Danish |
| de | German |
| el | Greek |
| en | English |
| eo | Esperanto |
| es | Spanish |
| et | Estonian |
| eu | Basque |
| fa | Persian |
| fi | Finnish |
| fr | French |
| fy | Western Frisian |
| ga | Irish |
| gd | Scottish Gaelic |
| gl | Galician |
| gu | Gujarati |
| ha | Hausa |
| he | Hebrew |
| hi | Hindi |
| hu | Hungarian |
| hy | Armenian |
| id | Indonesian |
| ig | Igbo |
| is | Icelandic |
| it | Italian |
| ja | Japanese |
| jv | Javanese |
| ka | Georgian |
| kk | Kazakh |
| km | Central Khmer |
| kn | Kannada |
| ko | Korean |
| ku | Kurdish |
| ky | Kirghiz |
| la | Latin |
| lt | Lithuanian |
| lv | Latvian |
| mg | Malagasy |
| mk | Macedonian |
| ml | Malayalam |
| mn | Mongolian |
| mr | Marathi |
| ms | Malay |
| mt | Maltese |
| my | Burmese |
| ne | Nepali |
| nl | Dutch |
| no | Norwegian |
| pa | Panjabi |
| pl | Polish |
| ps | Pushto |
| pt | Portuguese |
| ro | Romanian |
| ru | Russian |
| si | Sinhala |
| sk | Slovak |
| sl | Slovenian |
| sq | Albanian |
| sr | Serbian |
| sv | Swedish |
| ta | Tamil |
| te | Telugu |
| tg | Tajik |
| th | Thai |
| tr | Turkish |
| uk | Ukrainian |
| ur | Urdu |
| uz | Uzbek |
| vi | Vietnamese |
| xh | Xhosa |
| yi | Yiddish |
| yo | Yoruba |
| zh | Chinese |
| zu | Zulu |
22 changes: 22 additions & 0 deletions configs/SM/sat_sm_12l.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"output_dir": "sat_sm_12l",
"lim_lookahead": false,
"block_size": 256,
"no_sm_corruption": false,
"overwrite_output_dir": true,
"evaluation_strategy": "steps",
"eval_steps": 250,
"report_to": "wandb",
"learning_rate": 0.00003,
"warmup_steps": 500,
"per_device_train_batch_size": 128,
"per_device_eval_batch_size": 128,
"weight_decay": 0.01,
"push_to_hub": false,
"save_total_limit": 1,
"save_strategy": "steps",
"save_steps": 1000,
"load_best_model_at_end": false,
"max_steps": 20000,
"num_layers": 12
}
Loading
Loading