torchFastText : Efficient text classification with PyTorch

A flexible PyTorch implementation of FastText for text classification with support for categorical features.

Features

Supports text classification with FastText architecture
Handles both text and categorical features
N-gram tokenization
Flexible optimizer and scheduler options
GPU and CPU support
Model checkpointing and early stopping
Prediction and model explanation capabilities

Installation

pip install torchFastText

Key Components

build(): Constructs the FastText model architecture
train(): Trains the model with built-in callbacks and logging
predict(): Generates class predictions
predict_and_explain(): Provides predictions with feature attributions

Subpackages

preprocess: To preprocess text input, using nltk and unidecode libraries.
explainability: Simple methods to visualize feature attributions at word and letter levels, using captumlibrary.

Run pip install torchFastText[preprocess] or pip install torchFastText[explainability] to download these optional dependencies.

Quick Start

from torchFastText import torchFastText

# Initialize the model
model = torchFastText(
    num_buckets=1000000,
    embedding_dim=100,
    min_count=5,
    min_n=3,
    max_n=6,
    len_word_ngrams=True,
    sparse=True
)

# Train the model
model.train(
    X_train=train_data,
    y_train=train_labels,
    X_val=val_data,
    y_val=val_labels,
    num_epochs=10,
    batch_size=64
)
# Make predictions
predictions = model.predict(test_data)

where train_data is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in int format.

Please make sure y_train contains at least one time each possible label.

Dependencies

PyTorch Lightning
NumPy

Documentation

For detailed usage and examples, please refer to the experiments notebook. Use pip install -r requirements.txt after cloning the repository to install the necessary dependencies (some are specific to the notebook).

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

References

Inspired by the original FastText paper [1] and implementation.

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
notebooks		notebooks
tests		tests
torchFastText		torchFastText
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
renovate.json		renovate.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torchFastText : Efficient text classification with PyTorch

Features

Installation

Key Components

Subpackages

Quick Start

Dependencies

Documentation

Contributing

License

References

About

Releases

Packages

Contributors 5

Languages

License

InseeFrLab/torch-fastText

Folders and files

Latest commit

History

Repository files navigation

torchFastText : Efficient text classification with PyTorch

Features

Installation

Key Components

Subpackages

Quick Start

Dependencies

Documentation

Contributing

License

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages