Consider adding non-primitive API operations #376

r-c-n · 2023-10-05T08:42:58Z

My understanding is that the current API contains a set of primitive operations that more or less match 1-to-1 to database queries.

Some typical use cases that could be rather frequent, such as the kci show results tool, perform a potentially large number of individual API queries to extract a whole set of data. Accessing this data directly from the database must be almost instantaneous but doing so through primitive API calls takes a non-practical amount of time.

Suggestion: implement an extended set of non-primitive API operations that can do this kind of complex data retrieval in a single query. Any internal query complexities, data post-processing and formatting can be abstracted in the API.
This has two important benefits: 1) it'll make possible to do certain operations that currently are not practical to do due to the large number of queries required, 2) while these operations will be more computationally expensive, they should greatly reduce the query requests processing overhead. The net result should be favorable to the "single-complex-query" approach.

gctucker · 2023-10-05T08:49:00Z

Yes that's part of the goals, but we can't really know what these would look like in practice until we have well defined use-cases. I can see we will most likely need specialised API endpoints for a web dashboard use-case for example. The only "macro" query we have now is when posting a hierarchy of nodes, it will add them as a batch.

I'm not sure what we can do this with particular GitHub issue to be honest, but we will definitely need to create specific ones for use-cases as we identify them. The first goals I believe are to ensure we're feature-complete for a stable release, then performance optimisation can happen on top of that to make the same features work "faster".

In addition to this, we'll need to investigate the optimal database indexes, and generally speaking do some stress-testing with the AKS deployment to identify any performance bottlenecks in the entire system.

r-c-n · 2023-10-05T08:58:14Z

Yes that's part of the goals, but we can't really know what these would look like in practice until we have well defined use-cases

That's ok, I can begin to define some use cases and use them to prototype an initial proof of concept. The case of "give me the whole node tree hanging from this one" is a pretty evident one that could be useful.

I'm not sure what we can do this with particular GitHub issue to be honest, but we will definitely need to create specific ones for use-cases as we identify them

No problem, I'll create more speficic issues. The purpose of this one was to discuss about the general idea and to have a potential feature to put somewhere in the roadmap.

In addition to this, we'll need to investigate the optimal database indexes, and generally speaking do some stress-testing with the AKS deployment to identify any performance bottlenecks in the entire system.

I agree, but all of that should be use-case-driven. So I think we should start by defining the operations and then the necessary DB and implementation changes needed to accommodate them.

gctucker · 2023-10-05T09:00:42Z

That's ok, I can begin to define some use cases and use them to prototype an initial proof of concept. The case of "give me the whole node tree hanging from this one" is a pretty evident one that could be useful.

I believe we can already do this with the existing queries actually using common attributes for the whole set of nodes (e.g. same kernel revision or job/group name). And using recursive ID look-ups is not going to play well with MongoDB, it's not designed for this kind of thing.

gctucker · 2023-10-05T09:03:28Z

I agree, but all of that should be use-case-driven. So I think we should start by defining the operations and then the necessary DB and implementation changes needed to accommodate them.

While I think extending the API for performance with macro operations is driven by use cases (like in my first comment), stress-testing the system can already be done with the primitive operations we have now.

Also, it's a big advantage to not have many API endpoints. So adding macro operations on the API side should be carefully thought through to compare the actual added value with the costs of making the API more complex.

nuclearcat · 2023-10-05T09:04:58Z

What if we keep in all child nodes value similar to path, but with parent node ids?
So when child node created, it will inherit parent "nodepath" plus add parent id. (Or "parentpath")

node:"aaaa"
nodepath:[]

node:"bbbb"
nodepath:["aaaa"]

node:"cccc"
nodepath:["aaaa","bbbb"]

gctucker · 2023-10-05T09:06:00Z

That's already the case.

nuclearcat · 2023-10-05T09:09:20Z

That's already the case.

planned to be implemented or implemented?
Because right now as i see only path exist, but it contains non-unique path - with just name of parent jobs, not id.
Parent id also exist, but it contains only one level parent, which will require many queries to fetch whole tree.

gctucker · 2023-10-05T09:09:48Z

@nuclearcat Oh sorry I see what you mean - currently the path uses the node names, but I think you're suggesting to also have that with node object IDs as they're unique? Yes I guess that could work, but it would probably also not be as optimal as relying on indexed fields from the object e.g. revision.commit and group.

gctucker · 2023-10-05T09:15:41Z

What we're discussing here seems to mostly be about NoSQL design and how it's meant to be used to maximise performance, see for example: https://www.mongodb.com/nosql-explained/data-modeling

nuclearcat · 2023-10-05T09:16:14Z

It might serve @hardboprobot purpose and make fetching whole tree way easier and efficient, but probably need to simulate that on mongodb.
For example we want to fetch only arm64-x86-chromebook build with tests and results which have build id "bbbb" (checkout is "aaaa").
So on all nodes after build and build itself it will be always ["aaaa", "bbbb" (and more if it is under build)]
Then we need to search all documents hat have nodepath[1]="bbbb". (not sure if mongodb have such option and if it is efficient).

gctucker · 2023-10-05T09:17:21Z

Right, we can look at efficient ways of doing this with the current queries and see if it's still not good enough.

gctucker · 2023-10-05T09:22:32Z

Ah also this page is worth reading I think: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design

r-c-n · 2023-10-05T09:54:57Z

Thanks for the comments. I wouldn't go as far as altering the models for this specific functionality.

IMO the model structures shouldn't be dictated by the API "macro" endpoints. These should be, by design, utility operations based on the same principles of primitive endpoints with certain higher-level logic done by the API.

I don't want to over-complicate this and I don't think it's even a priority for now, but what I had in mind was a very simple solution: to encapsulate this kind of logic into the API as a macro endpoint to avoid the overhead of having to issue dozens of queries. No noticeable side-effects, no changes needed in the models nor specific DB design considerations. I also think that it wouldn't take a big effort to implement and it makes sense to try and see if it works as expected.

gctucker · 2023-10-05T10:03:04Z

Thanks for the context. The reality is that kci show is probably using the API in the worst possible way, the only reason why it's like this is because it was the simplest way to write it. With some minor improvements we could reduce the number of queries by a factor of maybe 100. That's also part of the plan following the kci command line tool rework.

This GitHub issue seems more like a "discussion" actually. Whatever the format, it's good to be going through these things. Discussions are better as they enable replying to particular comments as threads.

r-c-n · 2023-10-05T10:11:23Z

The kci show use case is an example but probably not the only case where it might be reasonable to extend the API to make the client lives easier.

With some minor improvements we could reduce the number of queries by a factor of maybe 100. That's also part of the plan following the kci command line tool rework.

Can you elaborate on that? Do you mean improvements on the client side?

This GitHub issue seems more like a "discussion" actually

Most of the issues I created are meant to start these kind of discussions. From there we can move to concrete and countable actions that we can actually execute. That's why I think it's worth it to consider these high-level topics ASAP. Thank you both for joining the discussions and help move them forward. I really appreciate your patience given the circumstances.

gctucker · 2023-10-05T10:22:24Z

Yes I mean, even with an optimal API design some client code could still be using it in a very naive way and retrieve nodes one by one. With the current API endpoints, there are simple ways to retrieve more nodes in one go as I mentioned earlier using the kernel revision and node groups for example. Also, it's unclear why retrieving a single node takes that long across the Azure network border, it's extremely fast when run within the same subnet in the cloud or with a local docker-compose instance. I believe there's some sort of throttling in place at the IP level, maybe it's left to some default settings and this could be fine-tuned. It's one of the things stress-testing will help clarify.

Thank you both for joining the discussions and help move them forward. I really appreciate your patience given the circumstances.

Thanks for starting these topics. I think we're just on track and discussing these things now is perfectly what we should be doing during the early access phase. We could have started this phase earlier, but within the current timeline I think we're doing pretty well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding non-primitive API operations #376

Consider adding non-primitive API operations #376

r-c-n commented Oct 5, 2023

gctucker commented Oct 5, 2023

r-c-n commented Oct 5, 2023 •

edited

Loading

gctucker commented Oct 5, 2023

gctucker commented Oct 5, 2023

nuclearcat commented Oct 5, 2023

gctucker commented Oct 5, 2023

nuclearcat commented Oct 5, 2023

gctucker commented Oct 5, 2023

gctucker commented Oct 5, 2023 •

edited

Loading

nuclearcat commented Oct 5, 2023

gctucker commented Oct 5, 2023

gctucker commented Oct 5, 2023

r-c-n commented Oct 5, 2023

gctucker commented Oct 5, 2023

r-c-n commented Oct 5, 2023

gctucker commented Oct 5, 2023

Consider adding non-primitive API operations #376

Consider adding non-primitive API operations #376

Comments

r-c-n commented Oct 5, 2023

gctucker commented Oct 5, 2023

r-c-n commented Oct 5, 2023 • edited Loading

gctucker commented Oct 5, 2023

gctucker commented Oct 5, 2023

nuclearcat commented Oct 5, 2023

gctucker commented Oct 5, 2023

nuclearcat commented Oct 5, 2023

gctucker commented Oct 5, 2023

gctucker commented Oct 5, 2023 • edited Loading

nuclearcat commented Oct 5, 2023

gctucker commented Oct 5, 2023

gctucker commented Oct 5, 2023

r-c-n commented Oct 5, 2023

gctucker commented Oct 5, 2023

r-c-n commented Oct 5, 2023

gctucker commented Oct 5, 2023

r-c-n commented Oct 5, 2023 •

edited

Loading

gctucker commented Oct 5, 2023 •

edited

Loading