GitHub - ezekielnewren/hibp_rust: Manage the hundreds of millions of ntlm hashes that are available on https://haveibeenpwned.com/

Processing a stream of hashes

Bloom Filter Pre-Filtering
- only add the element to the queue if it might be in the dataset
Sort Queue
- after a certain threshold is reached sort the queue for the next step
Search the elements of the queue
- use adaptive interpolation and binary search to determine if/where an element is located
- retain the bounds of adaptive interpolation and adjust it as elements of the queue are processed
Repeat until end of Stream
- go back to step 1 until all elements have been processed

The explicit switching between filtering which elements are even considered and fully determining if/where the element is in the dataset is done to optimize cache efficiency. During the filtering step, memory for that step will reside in the caches and will not be evicted by the explicit searches. Once the searching starts the memory involved in finding the elements will primarily be what's in the cache. I believe that this process could potentially be run in parallel for even better performance, or maybe suffer from resource conflict because this problem is mostly memory bound.

Initalize bloom filter upon use rather than pre-compute. A bloom filter may not be worth while at all if it's often saturated or the bit array is so large that cacheing is ineffective.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
hibp		hibp
hibp_benchmark		hibp_benchmark
hibp_core		hibp_core
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

ezekielnewren/hibp_rust

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages