Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use binary search to get seq from offset #200

Merged
merged 3 commits into from
Mar 29, 2022
Merged

use binary search to get seq from offset #200

merged 3 commits into from
Mar 29, 2022

Conversation

staltz
Copy link
Member

@staltz staltz commented Mar 29, 2022

Context

I'm preparing for #199 and one thing I realized is that I'll need a quick way of getting the seq from any given offset, e.g. during compaction there is going to be unshiftedOffset and I'd like to discover unshiftedSeq.

Problem

We don't have a helper function for that and currently we're just doing a linear for-loop to look for the offset.

Another problem is that the "end" of the for-loop is tarr.length, which I think is wrong because the tarr grows much more than the count does (like it doubles in size every time it "grows" but most of it is still unused). Instead we need to use index.count.

Solution

Introduce getSeqFromOffset that does a binary search and uses indexes['seq'].count as the end marker.

This should also have a performance improvement for those cases when the jitdb indexes are many and large but we just need to update the tip of the index. This way we can avoid scanning the whole index in O(n) and can do O(log n). I'm not sure we have any benchmark for this but should be a pretty common thing to happen in production.

@staltz staltz requested a review from arj03 March 29, 2022 12:49
@github-actions
Copy link

Benchmark results

Part Speed Heap Change Samples
Count 1 big index (3rd run) 0.31ms ± 0.04ms -5.65 kB ± 15.37 kB 52
Create an index twice concurrently 579.51ms ± 2.25ms 31.17 kB ± 61.11 kB 95
Load core indexes 0.83ms ± 0.01ms 93.75 B ± 183.36 B 9274
Load two indexes concurrently 720.66ms ± 13.28ms -90.18 kB ± 139.83 kB 17
Paginate 10 results 25.18ms ± 1.26ms 9.95 kB ± 18.95 kB 28
Paginate 20000 msgs with pageSize=5 6177.78ms ± 77.45ms -123.13 kB ± 4086.27 kB 5
Paginate 20000 msgs with pageSize=500 594.5ms ± 23.94ms -1.52 kB ± 712.31 kB 20
Query 1 big index (1st run) 811.31ms ± 11.76ms 1.67 kB ± 94.22 kB 67
Query 1 big index (2nd run) 281.26ms ± 5.11ms -30.71 kB ± 29.98 kB 47
Query 3 indexes (1st run) 1137.74ms ± 16.27ms 24.41 kB ± 149 kB 47
Query 3 indexes (2nd run) 216.3ms ± 3.1ms -96.49 kB ± 122.63 kB 54
Query a prefix map (1st run) 308.65ms ± 4.08ms 49.13 kB ± 513.65 kB 20
Query a prefix map (2nd run) 14.32ms ± 0.42ms -3.9 kB ± 16.56 kB 24

@staltz
Copy link
Member Author

staltz commented Mar 29, 2022

Here's previous benchmarks to compare with: #197

But there's no noticeable difference. But do we have a benchmark to measure this case? (A big index receives a bit of updates)

@arj03
Copy link
Member

arj03 commented Mar 29, 2022

This looks like a great optimization. I'm wondering if we can't we use bsb that is already included?

@staltz
Copy link
Member Author

staltz commented Mar 29, 2022

Good point! I switched to using bsb.eq in getSeqFromOffset, and I took a look at the source code, it's basically the same, so that's good. Now the PR is a net negative in lines of code changed. 😎

I also added another commit use seq index.count instead of index.tarr.length.

@github-actions
Copy link

Benchmark results

Part Speed Heap Change Samples
Count 1 big index (3rd run) 0.41ms ± 0.04ms -3.24 kB ± 15.44 kB 44
Create an index twice concurrently 757.32ms ± 7ms 46.4 kB ± 73.81 kB 72
Load core indexes 1.46ms ± 0.02ms 94.27 B ± 248.4 B 6504
Load two indexes concurrently 709.12ms ± 16.92ms 57.29 kB ± 142.61 kB 17
Paginate 10 results 24.43ms ± 1.05ms 972 B ± 18893.48 B 28
Paginate 20000 msgs with pageSize=5 5856.57ms ± 250.37ms -441.06 kB ± 4406.04 kB 5
Paginate 20000 msgs with pageSize=500 455.57ms ± 8.72ms 8.86 kB ± 669.42 kB 22
Query 1 big index (1st run) 822.93ms ± 8.8ms -5.34 kB ± 84.71 kB 66
Query 1 big index (2nd run) 264.16ms ± 1.51ms 15.44 kB ± 31.24 kB 52
Query 3 indexes (1st run) 854.47ms ± 9.14ms -7.25 kB ± 108.83 kB 64
Query 3 indexes (2nd run) 282.62ms ± 1.95ms -27.61 kB ± 172.07 kB 41
Query a prefix map (1st run) 272.8ms ± 5.3ms 10.09 kB ± 179.72 kB 22
Query a prefix map (2nd run) 16.26ms ± 0.45ms 1.2 kB ± 18.66 kB 21

Copy link
Member

@arj03 arj03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. This is a good improvement.

@staltz staltz merged commit a786bc2 into master Mar 29, 2022
@staltz staltz deleted the binary-search-seq branch March 29, 2022 14:32
@github-actions
Copy link

Benchmark results

Part Speed Heap Change Samples
Count 1 big index (3rd run) 0.35ms ± 0.03ms -5.7 kB ± 18.5 kB 51
Create an index twice concurrently 585.93ms ± 16.1ms 1.02 kB ± 52.52 kB 94
Load core indexes 1.21ms ± 0.01ms 98.8 B ± 208.69 B 7478
Load two indexes concurrently 700.31ms ± 15.45ms 90.45 kB ± 100.59 kB 18
Paginate 10 results 37.91ms ± 2.34ms 47.24 kB ± 42.95 kB 22
Paginate 20000 msgs with pageSize=5 6090.69ms ± 60.36ms 2.98 MB ± 2.67 MB 5
Paginate 20000 msgs with pageSize=500 540.77ms ± 27.94ms 565.28 kB ± 582.67 kB 20
Query 1 big index (1st run) 819.2ms ± 7.73ms -33.3 kB ± 75.36 kB 66
Query 1 big index (2nd run) 313.45ms ± 2.57ms -6.5 kB ± 33.52 kB 43
Query 3 indexes (1st run) 1026.27ms ± 9.03ms 37.59 kB ± 137.64 kB 53
Query 3 indexes (2nd run) 226.9ms ± 2.62ms -30 kB ± 201.89 kB 52
Query a prefix map (1st run) 294.02ms ± 4.54ms 28.23 kB ± 44.66 kB 21
Query a prefix map (2nd run) 13.75ms ± 0.55ms -7.06 kB ± 14.91 kB 24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants