-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster sort #123
Faster sort #123
Conversation
Need to link this db2 benchmark issue. I'll look at the tests, was working fine locally. |
Failing test:
Lets see if it is better now. |
Right, the problem is that we take the min of arrival and declared and if you are unlucky the arrival on 2 of them will be the same. So the test doesn't really work. |
Benchmark results
|
Very neat idea with a priority queue! I'll review carefully later. I might be able to help with By the way, I also thought that the sorting could be done gradually. Like if we are doing paginate with pageSize=1 every time, then basically we are just looking for maximums. I know that insertion sort allows us to do it gradually, but I wonder if there are other faster algorithms that do it gradually. Then we can spread that sorting workload across the many paginate calls. |
I put this into the browser and have been clicking around. Maybe it's just psychological, but this feels snappier than before. The numbers are also good. Public is now 40ms. |
Great news! |
tweak usage of fastpriorityqueue
Should have mentioned in the other one, but with the latest changes from @staltz this sometimes goes as low as 164 (first page) and 180 (total, 2x25 messages). It is a bit all over the place, but somewhere between 164 and 240 for the first one. |
Benchmark results
|
Strange that this seems to be slower on the jitdb benchmark, will try and run that locally |
I see the same here locally. For reference Paginate 1 big index was 269ms before. |
So strange... (I'm on mobile) |
In the esc meeting right now, but I'll have a look at the benchmarks here to see what is up and see if we can get them better aligned with real world tests. |
Yeah, I'll look into asyncAOL stream bugs in the meanwhile. |
I think these numbers are real, and the "paginate 1 huge index" benchmark is going through 4000 msgs, for each of those it takes roughly 10ms, sometimes less, which individually sounds like a little, but in aggregate it's indeed 20s+. |
Right 5 * 4000. So the question is, is that a good benchmark to have? I mean is it a realistic case. I'm thinking something like 25 * 10 is maybe a better benchmark to have. |
On master branch, I put some markers to measure performance of doing the cache lookup and then a simple |
Let me take a look if any one of the queries I do in Manyverse looks like this. But I'm inclined to say that yes it matters, because to build ssb-threads we want to go through several I do like the FastPriorityQueue idea, I just think we need to be careful with it. So let's not merge just yet. |
Ok, maybe you can increase the chunksize a bit? |
Hmm, that's a decent idea, I wonder if I could make a pull-stream operator that makes it transparent so that I can pull one item at a time, but underneath it's actually using pagination with a large page size. I did a quick study, see below. Sensible page sizes are in the hundreds, 500 was a sweet spot. This is for the "paginate one huge index" benchmark. Total items = 20k in all of the scenarios below
|
I approve this PR, but I'd like to add a benchmark that uses pageSize=5 and another that uses pageSize=1000, just so that we keep tracking the performance of these different ranges of values. |
Please do, not in a super rush to get this merged. It's a bit of a trade-of here. I like the idea of a chunked reader, I wonder if we could do something that would be general enough to work in the async iterator case as well. Almost sounds like someone must have thought of that problem before 🤔 |
Now that I think about it, I don't think we need a new operator, we can just do pull(
query(
fromDB(db),
and(equal(seekType, 'post', { indexType: 'type' })),
paginate(PAGESIZE),
toPullStream()
),
pull.take(NUMPAGES),
pull.map(pull.values), // this
pull.flatten(), // and this
pull.drain(op)
) For async iterators, people can do it manually or use an async iterable library such as https://github.com/ReactiveX/IxJS, we don't need to cover those use cases, because JITDB's responsibility ends after the |
Benchmark results
|
Great PR, great collaboration, and great that most of our hurdles are now behind us! |
Benchmark results
|
This is a PR to improve the performance of paginated queries with a large number of results. The main bottleneck was the sorting of the result set before doing a slice. I tried a lot of different sorting techniques including radix sort but they all had a drawback that made them not quite right. Instead I started looking and though, our old friend Lemire must have thought about this problem, lo and behold he has: We can sort and slice...
The idea is to use a priority queue instead of sorting the array and plucking elements from that instead. I was a bit concerned with the clone() call in the sortedCache but that turns out to be still of value.
Old
public (post + contact): 545 ms
public (post + contact) page 2: 551 ms (total for page 1+2)
private (post + contact): 50 ms
New
public (post + contact): 214 ms
public (post + contact) page 2: 241 ms (total for page 1+2)
private (post + contact): 50 ms