Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDS: Adding new leaf to MST is O(n²) #22

Open
rudyfraser opened this issue Oct 12, 2024 · 14 comments
Open

PDS: Adding new leaf to MST is O(n²) #22

rudyfraser opened this issue Oct 12, 2024 · 14 comments

Comments

@rudyfraser
Copy link
Member

See graph for sample size of 254 leafs being added.

Screenshot 2024-10-12 at 9 43 32 AM

Discovered this issue when testing adding records in bulk (f6b1b26). Should be able to add 1000 records in ms. Canonical code for the same test runs in 0.717 s, estimated 2 s

@DavidBuchanan314
Copy link

DavidBuchanan314 commented Oct 12, 2024

So just to clarify, adding n nodes is costing O(n²) overall, meaning each individual add is O(n)?

Versus the ideal expectation of O(n log n) overall, and O(log n) for each add?

@DavidBuchanan314
Copy link

DavidBuchanan314 commented Oct 12, 2024

I think this is your issue right here:

let mut new_root = MST::create(self.storage.clone(), Some(updated), Some(key_zeros))?;

I'm not a rustacean so forgive me if I'm misunderstanding, but if you're copying the whole MST storage then that's an O(n) operation (where n is the number of nodes currently in the tree)

Edit: ah yeah I'm probably mistaken, I see storage is a SqlRepoReader so that line doesn't actually copy the data

@rudyfraser
Copy link
Member Author

So just to clarify, adding n nodes is costing O(n²) overall, meaning each individual add is O(n)?

Versus the ideal expectation of O(n log n) overall, and O(log n) for each add?

Yes, I think that is exactly right. I appreciate you taking a look at this!

@mackuba
Copy link

mackuba commented Oct 13, 2024

cc @steveklabnik

@steveklabnik
Copy link

Sorry for the late response here; I've glanced at this a few times, and nothing really sticks out to me. I don't have time this exact minute to try and dig in even further, but rather than look it over and go "nope, not immediately seeing it" and not leaving a comment like the last two or three times, figured I'd say something at least :)

@DavidBuchanan314
Copy link

I just came across this https://github.com/domodwyer/merkle-search-tree - which is a rust MST implementation with very impressive perf numbers. It doesn't look like it was written with atproto in mind though, so I'm not sure it'd be easy to drop in, but it might be useful for reference

@afbase
Copy link

afbase commented Dec 16, 2024

@rudyfraser are you still working on this?

Are you happy with the implementation of the MST - at the time of asking this, c948af2?

@afbase
Copy link

afbase commented Dec 16, 2024

@rudyfraser are you still working on this?

Are you happy with the implementation of the MST - at the time of asking this, c948af2?

I'm happy to take a look if you'd like. any guidance on what you'd like to see would be appreciated

@rudyfraser
Copy link
Member Author

@afbase still need help with this. Ultimately I would expect for unit tests in rsky-pds::repo::mst::tests to perform similarly to the canonical TS implementation. Currently they take significantly longer to run.

High level outcome would be to mirror the unit tests in https://github.com/bluesky-social/atproto/blob/main/packages/repo/tests/mst.test.ts (especially ones like "adds records") and pass the expected outcomes with similar performance.

@afbase
Copy link

afbase commented Dec 17, 2024

@rudyfraser

Testing, Benchmarking, (and Fuzzing)

I'll prioritize this first as I can work with the implementation as-is.

Canonical Implementation

from the README.md:

Implementations should follow closely to the canonical Typescript implementation

rsky's MST storage is a SqlRepoReader whereas the typescript's MST storage is a ReadableBlockStore. Is there a desire to implement a ReadableBlockStore?

Cloning, Reference Counting, or Neither

For when we need to insert a Cid in a higher layer of the MST, and reset the root, my thoughts are that we might want to have some kind of Arc<SqlRepoReader> or Arc<RwLock<SqlRepoReader>>. This would involve a great deal of refactoring across rsky.

The other merkle search tree implementation

However, when referencing the domodwyer/merkle-search-tree, I don't see any reference counting, or locking on the Page (i.e. layer). There is one caveat: a Node (NodeEntry) can dereference a Page (layer) that is "less than"/"lower". The implementation upsert (add) starts from the root and then traverses down to the corresponding Page (layer) whereas the typescript and rsky implementation - i believe (please correct me if i am wrong) - it starts from the bottom bottom and traverses upward.

I'm hesitant to look into the other MST too much further as that MST and the ATProto implementation might be too far apart in implementation to draw ideas from meaningfully.

@afbase
Copy link

afbase commented Dec 24, 2024

@rudyfraser

I made a branch on my fork that does some benchmarking of the add at sizes 100, 500, and 1000. It uses Criterion and cargo bench. I've copied the two benchmark reports i ran into their respective commit's hash name.

here is a screenshot of the report summary on second benchmark's 1000 keys sample size:

image

you can take a look at the others by opening up those index.html files in the benches folder.

also see my notes on the metric i made up. I think it is useful to give some idea of the sample's "tree depth" which depending on insert order, can make that let mut new_root = MST::create(self.storage.clone(), Some(updated), Some(key_zeros))?; line get triggered IIUC.

I'm not super familiar with Typescript and so the canonical implementations add tests to me are not exactly clear to me, but I could dig into them more. Granted, this benchmark is very rustic (i don't know what the equivalent is for "pythonic" for rust) and doesn't conform to the canonical typscript tests nor am I sure if we want to mirror that as that sounds like creating an entire test library for canonical'ness.

Some Questions

  1. It would take a little bit to make the benchmarks run in as a github action. That would also eat up a lot of compute time so I'm not sure if that is desirable. Perhaps just documentation on running benchmarks only locally on a developer's machine is best. What do you think?
  2. I'll see what else I can to with the tests from https://github.com/bluesky-social/atproto/blob/main/packages/repo/tests/mst.test.ts - are there particular add tests like the "MST Interop Edge Cases" that you're interested in seeing in the test module?

@erlend-sh
Copy link

millipds by @DavidBuchanan314 might be far enough along now that there’s prior art to draw from there as well?

@afbase
Copy link

afbase commented Jan 4, 2025

I've been playing around with tweaking the add function to see if my benchmarks improved any. They got worse 🙃

From the README.md:

Implementations should follow closely to the canonical Typescript implementation

I'd like tweak the MST struct and a couple of other things to make things more reference count friendly that could work in place of cloning. This might drift the implementation a bit from the canonical Typescript implementation. but my aim would be to have the unit tests work.

@rudyfraser
Copy link
Member Author

Documentation of the benchmark process would be fine but personally think it's lower priority at the moment.

Most of the unit tests already exist here https://github.com/blacksky-algorithms/rsky/blob/main/rsky-pds/src/repo/mst/mod.rs#L1274

For this case moving away from the canonical implementation would be fine as long as changes are directly related to the mst, existing tests pass, and there are improvements to performance.

I have more availability now and will try to revisit this as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants