Skip to content

HimanshuMittal01/bagmodels

Repository files navigation

BagModels

PyPI version Python tests

BagModels is a repository of various bag of words (BoW) algorithms in machine learning. Currently it includes OkapiBM25. More coming soon.

BM25 is a text retrieval function that can find similar documents or rank search in a set of documents based on the query terms appearing in each document irrespective of their proximity to each other. It is an improved and more generalised version of TF-IDF algorithm in NLP.

Installation

It can be installed using pip:

pip install bagmodels

Getting started

Basic usage

import re
from bagmodels import BM25

# Load corpus
corpus = list({
    "Yo, I love NLP model",
    "I like algorithms",
    "I love ML!"
})

# Clean manually if needed or pass custom tokenizer to BM25
corpus = [re.sub(r",|!", " ", doc).strip() for doc in corpus]

# Initialize model
model = BM25(corpus=corpus)

# Similarity
model.similarity("I love NLP model", "I like NLP model") # 0.775
model.similarity("I love blah", "I love algorithms") # 0.446

Save and reuse models

# libaries imported and corpus already loaded before it
model = BM25(corpus=corpus)

# write to save path
model.save("output/bm25_v1.jbl")

# load again
model = BM25.load("output/bm25_v1.jbl")

# add documents if required
model.resume(corpus=additonal_corpus)

# predict / search / find / retrieve like
model.similarity(doc_a, doc_b)

Coming soon

Please feel free to open an issue to request a feature or discuss any changes. Pull requests are most welcome.

I am trying to actively add the following:

  • OkapiBM25
  • BM25 variations
  • MultiThreading

About

Various bag-of-words ML algorithms like BM25

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages