Installation

A simple implementation of string search against a small-to-midsize (few million max) set of strings using torch and GPU acceleration. Cosine similarity of character 3-grams is the metric. This is meant to be a poor man's version of simstring, but does not scale up to anywhere near the DB sizes, and does not implement any of the fine tricks in simstring. On the other hand, it is easy to install. All it needs is sklearn and torch.

If the queries are batched by few hundred, the lookup against a DB of 1.4M strings from wikidata runs at 0.004sec per string on a relatively dated GPU.

Installation

python3 setup.py install

or

python3 setup.py bdist_wheel

the wheel file is in dist/SimString_cuda-0.1.0-py3-none-any.whl and then you can install it anywhere you want with pip3 install path/to/wheel.whl This has the advantage of adding the command-line executable simscuda into your path.

Usage

Here everywhere strings refers to a list of strings to index

Make an index and save it:

import simstringcuda as ssc
ssc_idx=ssc.build_index(strings)
ssc.save_index(ssc_idx,filename)

Load a saved index:

ssc_idx=ssc.load_index(filename)
ssc_idx.cuda() #If you place the index onto GPU, all search will happen on GPU, but you don't have to if you only have a small number of strings in your DB, this method passes all of its arguments to torch .cuda() call

Lookup some strings:

For GPU to make any sense, queries should preferably be batched into batches of few hundred or so, depending on your GPU memory. The limiting factor on memory is that a matrix of index x query is created. If your lookup runs out of GPU memory, make smaller query batches.

queries=["my","query","strings","there","can","be","many"]
res=ssc.lookup(queries,ssc_idx,10) #find top-10 hits for every query string

Command-line usage:

The command simscuda gets installed for you via pip, so maybe best install the package this way.

pip3 install path/to/builtwheel.whl
simscuda -h

Create an index out of all strings in a file, store it as index.fi file

bzcat strings.fi.bz2 | simscuda -c index.fi

Look up the first 1000 of these again

bzcat strings.fi.bz2 | head -n 1000 | simscuda index.fi

And get the output in a jsonl format for easier processing later

bzcat strings.fi.bz2 | head -n 1000 | simscuda --jsonl index.fi > out.jsonl

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
simstringcuda		simstringcuda
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
simscuda		simscuda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Usage

Make an index and save it:

Load a saved index:

Lookup some strings:

Command-line usage:

About

Releases

Packages

Languages

License

fginter/simstring-cuda

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Make an index and save it:

Load a saved index:

Lookup some strings:

Command-line usage:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages