Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark analysis #108

Open
staltz opened this issue Jan 2, 2021 · 33 comments
Open

Benchmark analysis #108

staltz opened this issue Jan 2, 2021 · 33 comments

Comments

@staltz
Copy link
Member

staltz commented Jan 2, 2021

Not really an issue, I just want some place where I can report my analysis on the benchmark runs.

On my computer npm run benchmark-no-create from #107 reported 8030ms for the initial indexing test.

In a profiler, the big picture looks like this:

Screenshot from 2021-01-02 22-40-00

There are roughly 4 parts there:

  • Setup: require everything, set up secret-stack plugins, all that stuff (looks like spiky chaos on the left)
  • Migrate: that intensive and constant workload up until ~3200ms is just the migrate plugin
  • Base indexing: after some silence, around ~3500ms, db2 begins indexing the base index
  • Query: around ~7900ms, the JITDB query starts, begins doing createMissingIndexes, and in this process it needs to do a log.stream, and this log.stream performs decrypt and so forth, and then the final spike before the silence is getBitsetForOperation etc

Stats

  • Setup took approx 350ms
  • Migrate took approx 2915ms
  • Base indexing took approx 4390ms
  • Query took approx 440ms

Ideas

Separate benchmark reports for migration

Migration's duration should be measured separately from base indexing.

Don't strictly require indexing base index before db.query()

If db.query() would only use JITDB, we shouldn't block it from running until base index is done. In the example in benchmark/index.js, we don't use the base index, so we should be getting the db.query results much earlier.

Allow customizing when should decrypt run inside log.stream

For the db.query in this example, we didn't need to consider private messages. There should be a way of specifying that we don't need decrypt inside log.stream when indexing something in JITDB. In the ideal situation, JITDB would automatically detect whether it needs unboxing, and pass that down as a parameter to log.stream, but a manual opt-in might also be okay.

Push-streams could/should be paused

If we're doing asynchronous work upon receiving an item from a push-stream, we should pause the push-stream until the async work is done, and then resume it. This would use the backpressure capabilities of push-stream. Probably we're not in deep trouble if we don't do this, but depending on the corner case, if we don't do backpressure, we might just accumulate or exhaust the computer's RAM of tasks and incoming data, and this can have adverse effects elsewhere.

cc @arj03

@staltz
Copy link
Member Author

staltz commented Jan 3, 2021

Allow customizing when should decrypt run inside log.stream

I tested this idea, running a db.query() without decrypt takes 350ms (first JITDB run, i.e. creates the core indexes and bitvector index from scratch). With decrypt it takes 1300ms. That's 3.7x heavier.

I have an idea that is basically just pass down a boolean decrypt: true in the operators, and then JITDB will call log.stream({ decrypt: true }) if it detects that any index in the query had decrypt: true. E.g. db.query(and(type('post', {decrypt: false}), etc) basically will pass the opts down to JITDB and JITDB will pass those opts back to ssb-db2/log.js, which then will not run decrypt.

This way we let the query developer choose wisely when to do decryption or not. The default (is it false or true) can be discussed.

cc @arj03

@arj03
Copy link
Member

arj03 commented Jan 3, 2021

We can do a flag, but this is a 1 time cost and its shared between the indexes in ssb-db2 and jitdb so in the overall picture I don't think it will matter that much.

@arj03
Copy link
Member

arj03 commented Jan 3, 2021

Also will respond more in full tomorrow :)

@arj03
Copy link
Member

arj03 commented Jan 4, 2021

Don't strictly require indexing base index before db.query()

Sounds like a good idea. It would still need a log.drain.

I tested this idea, running a db.query() without decrypt

If it will help manyverse I'll be happy to accept such a PR, but I have been testing ssb-browser and I wouldn't use it there.

@arj03
Copy link
Member

arj03 commented Jan 4, 2021

Push-streams could/should be paused

Hmm yeah, would be interesting to test. The main thing I have left for the port of ssb-browser to db2 is that when I do an initial sync the browser just spins for quite a while afterwards. Outside the browser things are fine, I still need to see what is going on but I'll keep this in mind. Thanks for mentioned it.

@staltz
Copy link
Member Author

staltz commented Jan 4, 2021

@arj03 Alright, I thought this through after our video call, and here's what I figured out:

Yes it's true that heavyweight decrypt is run only once, and for subsequent queries it should be much lighter. But from the perspective of the user, it would be good if some queries take priority over others, and the results are shown ASAP while less-priority queries are processed in the background. And all this is about initial indexing, which is our mission to speed up, obviously even ssb-db1 performs fine after initial indexing.

Concurrency

So for a query like "give me all votes for this post", the user doesn't need care about private messages, so any work that is not absolutely necessary should not be blocking the query in question. Because we put decrypt inside log.stream, it applies always, regardless of our query. So my point is mostly, how can we avoid doing work that we don't need to do now, although we certainly need to do it soonish. Those workloads are: (1) base index initial indexing, and (2) private.decrypt.

I did a quick check, and if I turn off (1) or (2), then the benchmark looks like this:

Scenario Duration
db.query for 1 post, concurrently with (1) and (2) 2300ms
db.query for 1 post, concurrently with (1) 1060ms
db.query for 1 post, concurrently with (2) 1800ms
db.query for 1 post 430ms

I'm not sure what's the best solution for that, but two ideas I have are:

Priority system

We could assign some queries to have different priority numbers, and then schedule those that have higher number to run first. This could be too complex to build, specially when there exists workers.

Threads (workers)

I am not sure this would 100% work, because it could be our workloads are I/O bound, but because unboxing is CPU heavy, it could help to run the decrypt on log.stream inside a worker, and both browser and Node.js have it. Respectively, WebWorker and worker_thread.

@arj03
Copy link
Member

arj03 commented Jan 4, 2021

Would be interesting to check the threads idea.

If you want to persue the decrypt idea it would be natural to do that in the isPublic operator. As long as the implementation is straight forward I don't see a problem adding that.

@staltz
Copy link
Member Author

staltz commented Jan 4, 2021

I don't think I'll try the "decrypt only if the boolean opt is true" idea, not yet. I'll do some experiments with the threads idea, but I'll have to make a Manyverse release today and tomorrow.

@staltz
Copy link
Member Author

staltz commented Jan 18, 2021

I ran this benchmark again (just for my own sake, to keep these notes later for study), but this time on my production mobile Manyverse and building all sorts of indexes (see list below). The durations indicate "time until first content displays".

Scenario Duration
Initial indexing, concurrently with (1) and (2) 2m32s
Initial indexing, concurrently with (1) 1m44s
Initial indexing, concurrently with (2) 2m09s
Initial indexing 1m20s
  • base
  • canDecrypt
  • encrypted
  • key.32prefix
  • meta_.index
  • seq.index
  • sequence.index
  • timestamp.index
  • value_author.32prefix
  • value_content_fork.32prefix
  • value_content_root.32prefix
  • value_content_type_content.index
  • value_content_type_post.index

@staltz
Copy link
Member Author

staltz commented Jan 29, 2021

By the way, to analyze memory use, it's useful to run the Chrome inspector for Node.js. There, I used a "heap snapshot" that showed all strings and objects and buffers allocated. Turns out there's approx. 330 MB of TypedArrays, and most of those are cached blocks in aligned-block-file or async-AOL. Makes sense.

@arj03
Copy link
Member

arj03 commented Jan 29, 2021

We could try fiddling with this. Have some good benchmarks now so should be easier to see.

@staltz
Copy link
Member Author

staltz commented Jan 29, 2021

You kind of nerd-sniped me (by the way, I work on these alternative things while Manyverse is compiling, or the app is indexing)

"1024 / 1000" means async-AOL with cache size 1024 and aligned-block-file with cache size 1000

Test 1024 / 1000 512 / 500 256 / 250 128 / 125
migration (using ssb-db) 2762ms 2734ms 2740ms 2775ms
migration (alone) 2675ms 2712ms 2674ms 2728ms
mem usage without indexes 434.34 MB = 30.14 MB + etc 400.20 MB = 24.03 MB + etc 368.01 MB = 27.51 MB + etc 355.83 MB = 31.47 MB + etc
initial indexing 2040ms 2098ms 2198ms 2097ms
initial indexing compat 443ms 405ms 640ms 366ms
Two indexes updating concurrently 764ms 786ms 763ms 875ms
key one initial 471ms 477ms 460ms 438ms
key two 1ms 0ms 0ms 1ms
key one again 1ms 1ms 0ms 1ms
latest root posts 281ms 322ms 299ms 286ms
latest posts 13ms 13ms 13ms 22ms
votes one initial 366ms 430ms 395ms 402ms
votes again 1ms 0ms 0ms 0ms
hasRoot 238ms 305ms 279ms 272ms
hasRoot again 0ms 0ms 0ms 0ms
author one posts 238ms 286ms 249ms 272ms
author two posts 61ms 80ms 110ms 120ms
dedicated author one posts 231ms 247ms 287ms 308ms
dedicated author one posts again 0ms 0ms 1ms 0ms
maximum RAM used 521.02 MB = 56.60 MB + etc 525.47 MB = 53.75 MB + etc 399.65 MB = 54.60 MB + etc 385.50 MB = 57.64 MB + etc

@arj03
Copy link
Member

arj03 commented Jan 29, 2021

When you remove everything so the old log is just used for streaming, then disabling the cache in aligned-block-file will make a ton of sense. It be both faster and use less memory.

@staltz
Copy link
Member Author

staltz commented Feb 4, 2021

I'm trying to measure disk I/O performance on desktop and on mobile, to compare, and I'm not sure if I did this correctly because Android and Linux Desktop don't have the same tools, but I did:

On desktop fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting (this measures sequential reads) and got 1038 MiB/s

On Android I used an app called Disk Speed / Performance Test and got Read 189 MB/s and Write 34 MB/s.

Not sure if I can trust these numbers, but it seems like mobile is 5.5x slower than desktop. If it takes ~3 sec for desktop to scan the log, it should take 17 sec for mobile to scan it.

@staltz
Copy link
Member Author

staltz commented Feb 9, 2021

DB2 Benchmarks running in Manyverse

Nothing else running. Total msg count: 1015957

A: Part B: Duration C: Column B scaled to 100k msgs D: Latest PR with 100k msgs C / D
Initial indexing 83529ms 8221ms 2141ms 3.83x
Initial indexing compat 21181ms 2084ms 529ms 3.93x
Two indexes updating concurrently 17155ms 1688ms 1016ms 1.66x
key one initial 14073ms 1385ms 614ms 2.25x
key two 15ms 2ms 0ms
key one again 8ms 1ms 0ms
latest root posts 11843ms 1165ms 347ms 3.35x
latest posts 519ms 51ms 9ms 5.66x
votes one initial 15168ms 1492ms 504ms 2.96x
votes again 6ms 1ms 1ms
hasRoot 11227ms 1105ms 298ms 3.70x
hasRoot again 2ms 0ms 0ms
author one posts 12736ms 1253ms 287ms 4.36x
author two posts 5066ms 498ms 55ms 9.05x
dedicated author one posts 11687ms 1150ms 279ms 4.12x
dedicated author one posts again 5ms 1ms 0ms

Mobile seems on average 4.07x slower than GitHub actions server. So that's quite realistically close to the "5.5x slower" disc I/O.

@staltz
Copy link
Member Author

staltz commented Feb 15, 2021

On mobile:

ssb:db2:aboutSelf loaded full cache in 799ms

Note a lot, but almost 1 second. I can imagine that not loading the whole thing might speed up startup.

@arj03
Copy link
Member

arj03 commented Feb 15, 2021

Hmm yeah, might be one of those things that are hard to benchmark on a computer compared to a phone. Interested to hear how it is without descriptions.

@staltz
Copy link
Member Author

staltz commented Feb 15, 2021

Found out that loading the key__map.32prefixmap index takes a long while (on mobile):

jitdb readFile prefix map /data/user/0/se.manyver/files/.ssb/db2/indexes/key__map.32prefixmap in 1732ms
jitdb parsed prefix map JSON /data/user/0/se.manyver/files/.ssb/db2/indexes/key__map.32prefixmap in 1576ms

3300ms in total, 1st part is loading the atomically-universal, 2nd part is the JSON.parse.

What would the performance look like if we did it in leveldb? Perhaps load time would be smaller, and query time would be larger?

@arj03
Copy link
Member

arj03 commented Feb 15, 2021

Hmm. That is a good data point, I'll have to check that to see.

@arj03
Copy link
Member

arj03 commented Feb 15, 2021

Maybe we should pair on it tomorrow?

@staltz
Copy link
Member Author

staltz commented Feb 15, 2021

Yes!

@staltz
Copy link
Member Author

staltz commented Feb 15, 2021

Well, I have a PR that adds a leveldb index for that and it seems it speeds up things rather a lot. PR incoming, it requires a jitdb change first, though.

@arj03
Copy link
Member

arj03 commented Feb 15, 2021

I wonder if it will be one of those: fs faster in browser, level faster in node ;-)

@staltz
Copy link
Member Author

staltz commented Feb 16, 2021

I found another thing that maybe indicates some opportunity to optimize.

This is what manyverse mobile startup looks like:

Screenshot from 2021-02-16 13-01-28

It takes about 5s in total.

  • Around 2s, that's when the private post jitdb paginate query ends
  • In the very middle of the picture, you see a pink spike, that's when a jitdb paginate query for public posts ends

What's funny is that the private posts query is quick but the public posts query is slow:

jitdb:query paginate(AND(meta_private(),value_content_type(post))), seq: 0, limit: 1: 84ms, total messages: 288 
...
...
jitdb:query paginate(AND(meta(),OR(value_content_type(post),value_content_type(contact)))), seq: 0, limit: 1: 1152ms, total messages: 391518

Same story on desktop too:

jitdb:query paginate(AND(meta_private(),value_content_type(post))), seq: 0, limit: 1: 95ms, total messages: 3117
...
...
jitdb:query paginate(AND(meta(),OR(value_content_type(post),value_content_type(contact)))), seq: 0, limit: 1: 375ms, total messages: 516669

@arj03 Any idea what may be causing this?

@staltz
Copy link
Member Author

staltz commented Feb 16, 2021

@arj03 Just for your information, Chrome CPU profiler also annotates the source code with the time it spent in each line of code. This is from manyverse desktop, picturing the new ssb-friends:

Screenshot from 2021-02-16 13-20-01

@arj03
Copy link
Member

arj03 commented Feb 16, 2021

I'll have to try the private query to see. That reminds me, I don't think we have a benchmark of that.

@staltz
Copy link
Member Author

staltz commented Feb 16, 2021

I'll have to try the private query to see

Note, it's the public one that's slow. It's even weirder to me, because I'd expect the private one to be slow because it does a decrypt.

@arj03
Copy link
Member

arj03 commented Feb 18, 2021

Can reproduce. Let the debugging begin :-)

@staltz
Copy link
Member Author

staltz commented Feb 21, 2021

Good news: Manyverse now with db2 1.17.1 has a startup time of 4s

@staltz
Copy link
Member Author

staltz commented Mar 5, 2021

I'm copy pasting this table from #191 because it's useful and didn't want it to get lost.

initial sync duration maxCpu UI responsiveness
305s 86 (maxPause=300) good
299s 90 (maxPause=300) good
288s 92 (maxPause=300) good
222s 92 (maxPause=180) good
158s 92 (maxPause=90) a bit bad
281s 95 (maxPause=300) a bit bad
280s 99 (maxPause=300) bad
126s Infinity bad

@arj03
Copy link
Member

arj03 commented Mar 5, 2021

Might be good to include these numbers in some form in the README.md documentation

@staltz
Copy link
Member Author

staltz commented Mar 5, 2021

We can make another .md file for these

@staltz
Copy link
Member Author

staltz commented Mar 5, 2021

What I conclude from maxCpu is that it basically slows down the task duration by 1.7x while providing better user experience. Have to factor that into the 4x speedup that db2 provided, meaning that in the end the db2 is 2.3x faster, but it's more like:

what speedup UI responsiveness and device temperature
db1 1x slow and hot
db2 4x slow and hot
db2 with maxCpu 2.3x fast and cold

Temperature on mobile is a bigger deal because the device is in physical contact with the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants