Skip to content

k4black/fast-aug

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fast-aug

Rust Test Workflow Status Python Test Workflow Status

Crates.io Version PyPI - Version GitHub License

fast-aug is a library for fast text augmentation, available for both Rust and Python as fast-aug.
It is designed with focus on performance and real-time usage (e.g. during training), while providing a wide range of text augmentation methods.


Please refer to respective READMEs for details:

Features and TODO

Flow

  • ChanceAugmenter
  • SelectorAugmenter
  • SequentialAugmenter

Text

  • RandomWordsAugmenter
    • Base - swaps/deletions
    • Insertions/Substitutions (from alphabet)
  • RandomCharsAugmenter
    • Base - swaps/deletions
    • Insertions/Substitutions (from provided list)
    • Insertions/Substitutions (from vocab by language tag)
  • RandomSpellingAugmenter
  • RandomKeyboardAugmenter
  • RandomEmbeddingsAugmenter
  • RandomTfIdfAugmenter
  • RandomPosAugmenter
  • EmojiNormalizer
  • Keep labels (e.g. POS tags) unchanged

Models and utils

  • Models lazy loading
    • At creation time
    • At first use
    • Background after creation
  • candle support for DL models loading
    • HF loading
    • ONNX loading
    • Optimizations (fp16/int8/int4/layers/etc)
    • GPU support
  • TF-IDF model
    • json file loading
    • sklearn model loading
  • Alphabet model
  • Language Vocab model
  • Embeddings model
    • fasttext model loading
    • word2vec model loading
  • WordNet model
    • English
    • German
    • More?

Rust

  • Formatting
    • rustfmt
    • clippy
  • rust flamegraph profiling
  • Unit tests
  • Integration tests
  • CI build and tests
  • CI publish to crates.io

Python

  • Custom Python Augmenter class (user provided to use in pipelines)
  • Bindings with
    • Base pyo3 bindings
    • maturin auto build from pyproject.toml
    • Stubs (.pyi) files generation
    • Auto generate stubs on maturing build
    • Text
    • Flow
  • Auto generate return type in stubs, see pyo3 issue
  • flamegraph profiling
  • Optimizations - see this
  • Integration tests
  • CI build and tests
  • CI publish to pypi

Development

Prerequisites

Clone the repository:

git clone [email protected]:k4black/fast-aug.git
cd fast-aug

For rust library development:

For python bindings development:

  • All rust library prerequisites
  • cd bindings/python && python -m venv .venv
  • pip >= 23.1 to use --config-settings, see pip issue

Make

The Makefile contains all the commands needed for development.

make help
  • *-rust - all targets related to rust library (fast_aug/ folder)
  • *-python - all targets related to python bindings (bindings/python/ folder)

Benchmarks

All text benchmarks are run on the tweet_eval dataset - sentiment task, test set, 12k rows.

cat test_data/tweet_eval_sentiment_test_text.txt | wc
12284  182576 1156877

License

This project and respective libraries are licensed under the MIT License - see the LICENSE file for details.