fast-aug
is a library for fast text augmentation, available for both Rust and Python as fast-aug
.
It is designed with focus on performance and real-time usage (e.g. during training), while providing a wide range of text augmentation methods.
Please refer to respective READMEs for details:
rust
- see fast_aug/README.md;python
- see bindings/python/README.md.
Flow
- ChanceAugmenter
- SelectorAugmenter
- SequentialAugmenter
Text
- RandomWordsAugmenter
- Base - swaps/deletions
- Insertions/Substitutions (from alphabet)
- RandomCharsAugmenter
- Base - swaps/deletions
- Insertions/Substitutions (from provided list)
- Insertions/Substitutions (from vocab by language tag)
- RandomSpellingAugmenter
- RandomKeyboardAugmenter
- RandomEmbeddingsAugmenter
- RandomTfIdfAugmenter
- RandomPosAugmenter
- EmojiNormalizer
- Keep labels (e.g. POS tags) unchanged
Models and utils
- Models lazy loading
- At creation time
- At first use
- Background after creation
- candle support for DL models loading
- HF loading
- ONNX loading
- Optimizations (fp16/int8/int4/layers/etc)
- GPU support
- TF-IDF model
- json file loading
- sklearn model loading
- Alphabet model
- Language Vocab model
- Embeddings model
- fasttext model loading
- word2vec model loading
- WordNet model
- English
- German
- More?
Rust
- Formatting
- rustfmt
- clippy
- rust flamegraph profiling
- Unit tests
- Integration tests
- CI build and tests
- CI publish to crates.io
Python
- Custom Python Augmenter class (user provided to use in pipelines)
- Bindings with
- Base pyo3 bindings
- maturin auto build from pyproject.toml
- Stubs (.pyi) files generation
- Auto generate stubs on maturing build
- Text
- Flow
- Auto generate return type in stubs, see pyo3 issue
- flamegraph profiling
- Optimizations - see this
- Integration tests
- CI build and tests
- CI publish to pypi
Clone the repository:
git clone [email protected]:k4black/fast-aug.git
cd fast-aug
For rust library development:
For python bindings development:
- All rust library prerequisites
cd bindings/python && python -m venv .venv
pip >= 23.1
to use--config-settings
, see pip issue
The Makefile
contains all the commands needed for development.
make help
*-rust
- all targets related to rust library (fast_aug/
folder)*-python
- all targets related to python bindings (bindings/python/
folder)
All text benchmarks are run on the tweet_eval dataset - sentiment task, test set, 12k rows.
cat test_data/tweet_eval_sentiment_test_text.txt | wc
12284 182576 1156877
This project and respective libraries are licensed under the MIT License - see the LICENSE file for details.