A flexible implementation of several min-sketch variants.
A count-min sketch is a probabilistic counting data structure - to save space, we allow for some known probability of counting error. This allows to summarize a data stream with limited space, without growing indefinitely as we see more and more data.
For a longer description, and some of the relevant papers, see: https://sites.google.com/site/countminsketch/
Here's an example usage:
import minsketch
from functools import partial
from timeit import timeit
# Delta and epsilon and uncertainty measures
# Count-min sketches are within epsilon of the true value
# With probability 1 - delta
delta = 10 ** -5
epsilon = 0.001
# This implementation supports several different backing table classes
# For most purposes, the Python array makes for a good default:
table_class = minsketch.sketch_tables.ArrayBackedSketchTable
# If you are willing to only support positive updates (no deletions)
# You should use conservative updates, as significantly reduces error:
update_strategy = minsketch.update_strategy.ConservativeUpdateStrategy
# If you have a reason to believe you might want lossy updating, you
# should benchmark one of the lossy updating schemes on your data:
lossy_strategy = minsketch.lossy_strategy.NoLossyUpdateStrategy
# So far, the hash-pair based implementation appears to outperform
# the universal hash family based one:
sketch = minsketch.double_hashing.HashPairCMSketch(
delta, epsilon, table_class=table_class,
update_strategy=update_strategy, lossy_strategy=lossy_strategy)
# Update the sketch with some data
from numpy import random
data = random.randint(0, 1000, 100000)
print(timeit(partial(sketch.update, data), number=1))
# Query the ten most common elements:
print(sketch.most_common(10))
# For a performance boost, you can also use a Counter-Sketch Hybrid
hybrid = minsketch.counter_sketch_hybrid.SketchCounterHybrid(
minsketch.double_hashing.HashPairCMSketch(
delta, epsilon, table_class=table_class,
update_strategy=update_strategy, lossy_strategy=lossy_strategy))
print(timeit(partial(hybrid.update, data), number=1))
print(hybrid.most_common(10))
See the documentation at http://minsketch.readthedocs.io/en/latest/