-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard aware batching - add Session::shard_for_statement & Batch::enforce_target_node #738
base: main
Are you sure you want to change the base?
Conversation
Resolves scylladb#468 This is a follow-up on scylladb#508 and scylladb#658: - To minimize CPU usage related to network operations when inserting a very large number of lines, it is relevant to batch. - To batch in the most efficient manner, these batches have to be shard-aware. Since scylladb#508, `batch` will pick the shard of the first statement to send the query to. However it is left to the user to constitute the batches in such a way that the target shard is the same for all the elements of the batch. - This was made *possible* by scylladb#658, but it was still very boilerplate-ish. I was waiting for scylladb#612 to be merged (amazing work btw! 😃) to implement a more direct and factored API (as that would use it). - This new ~`Session::first_shard_for_statement(self, &PreparedStatement, &SerializedValues) -> Option<(Node, Option<Shard>)>` makes shard-aware batching easy on the users, by providing access to the first node and shard of the query plan.
One more clippy lint I don't agree with -.- |
845feee
to
20c32b8
Compare
7e47f67
to
146655e
Compare
533fac4
to
6406e70
Compare
…ard_awareness_api
Have you tested your solution? How? Could you write some unit tests? |
I can see that you merged |
The approach with |
If the node is removed from the cluster, new lines on which one would call
I'm making the call to the non-fallback function of self.fallback inside the pick function, so that should be pretty inexpensive for each message whose shard/batching task has already been picked before the node was removed. |
Correct, I missed it.
It seems that I don't quite understand your use case. As I imagine it, one would once (upon application initialisation) consistute/associate a batch with a specific shard using |
No, rather I have an app that receives millions of lines per second to insert. These lines are processed, spread over multiple threads, in a manner that involves just identifying the shard, and then looking up in a Now if a node is removed, the task will be left dangling, and a few messages will be sent to a default-policy-chosen shard by the time its channel is drained, but new messages will be assigned to an alive shard (and its corresponding channel). |
Let me dig deeper. What is the lifetime of your |
Basically yes. There's a bunch of different prepared statements but those are prepared at application startup. |
I think I've got it, many thanks. I'm convinced now that what you propose in this PR does work in your case. I still wonder, however, if we can make it more universal. So far no good ideas how to do that, though. |
Maybe you could write an example called, say, |
a484a4c
to
a651fed
Compare
cfdeff6
to
d5f71a9
Compare
Updated the PR. It looks like it indeed does solve #974. 😊 (NB: I know this will need squash because oldest commits are very old and we don't want spaghetti on |
After discussion with @Lorak-mmk, our impression is that we still don't understand the approach taken by this PR. Why would we want to hardwire the destination node and shard manually, instead of relying on the default load balancer? Hardwiring the target node and shard has significant drawbacks. RetryPolicy and SpeculativeExecutionPolicy are going to malfunction on single-target query plans. It's the responsibility of the user to select the correct node and shard. Clearly, it's unclear to us why the problem of shard-aware batches is approached this way. |
Thanks for the review. I initially thought it was enough as well, however
This means that if we check which node a statement would pick to put it in a batch, then send the batch, the batch may randomly be sent to a different shard that is appropriate for that statement, but not necessarily for all other statements of the batch, resulting in them having one chance out of
|
The solution is to batch by token, not by "first" replica.
|
Unless I'm mistaken because the space of tokens is large this would imply that we very rarely are able to batch, whereas batching by replica allows to group by "one random acceptable replica" for each statement. |
Afaik it doesn't really make sense to perform multi partition batches. This incurs additional work for the coordinator node and you would be better off just using single, non-batched writes. |
The main issue with not batching is that going through the heavy Tokio & networking machinery for every statement puts x5 CPU load on the server that makes the scylla calls using the Rust driver, whereas with batching that load is shared across coordinators (and probably less significant than managing Tokio futures is even if ignoring the fact that work is shared, because Tokio turned out to be heavy in my benchmarks). Using batches enabled me to x5 my insertion speed due to this. |
I forge my own batches and I do see increase in performance and lower latency. I did this couple months ago, I don't have the number with me, but it was a no brainer to code more in order to do batching locally before sending off to the scylla driver. |
Would you be able to share some benchmark we could reproduce? If the problem is driver's performance then my preferred solution would be to improve driver's performance instead of doing hacky workarounds.
I'd really like to see and reproduce the benchmark to investigate if we can improve this. |
Thanks for your answer.
I sure see your point - if there's a way to reduce the load in the driver without putting extra load on the servers and/or exposing less type-safe APIs it's definitely best to do that instead.
I don't think I still have the Bear in mind that for each statement, the work I do is otherwise reading a buffered line, some minor parsing to get a partition key, pushing the parsed line into a batch of say 64 lines, and then doing the operations below only 1 out of 64 times (thanks to batching). What shows up on
NB: We should probably merge these discussions because they seem to be about the same thing, but I'll copy it here because it's also relevant to this particular sub-discussion (from #468 (comment)): for the use-case of performing High-Frequency Append-Log Style Operations like is my case and that of the author of #974, it seems clear that more scylla servers will be required to store & perform maintenance operations than client servers (~1/10 ratio), so all things otherwise equal, being able to offload some work to the servers doesn't seem necessarily bad. To answer the question from the same thread:
Fundamentally what I need is to send statements in a processing queue, where I don't necessarily care about getting each individual result as an What allows to achieve optimal performance is that there is nowhere in the pipeline async sleep/wakeups that occur for each statement, because:
Wakeups do otherwise cost a lot, relatively, as I tried to explain in the Currently I achieve this no-per-statement-wakeup by:
Indeed I don't care about the other properties of batches that Scylla may try to uphold. My understanding is that if batch is
My understanding is that this would result in most of the time, hitting a shard that will be one of those that holds one of the copies of this particular data. Am I mistaken? I don't care that I hit a correct shard 100% of the time: if the Scylla nodes need to transfer the call to another node 1% of the time because there was a tablet split it doesn't look like it's going to have any impact. Actually even hitting a correct shard almost every time probably has almost no impact: my understanding is that the cluster will still perform as many extra network jumps for each statement as there are replicas, so that's just one more, so I could probably just random-batch. My understanding is that since the Scylla servers will already perform some per-statement jumps whatever happens, letting it perform the per-statement work instead of doing it via per-statement async management in Thanks! |
General comment, not specific to this issue - I think it makes sense to always offload more to the client, if possible, rather than push more work to the servers. There are most likely (in most scenarios) far more clients and they can be scaled out and so on than server resources. |
Definitely. Although it seems that unless I'm mistaken, in this particular case (append-log style operations) the load difference if a user chooses to use the API that offloads (namely, the batch API) may be insignificant on the servers, but an order of magnitude of difference on the clients, so maybe the approach of using batches for that purpose may be relevant, and if that's indeed the case, then proper shard awareness for batches (to the extent possible) is just uncompromising (perf-wise) load reduction on the server. |
Oh sorry, I didn't mean the results of the benchmark, but how to perform the benchmark (source code that uses driver, Scylla version, config and machines used etc), so we can reproduce the problem, make changes in the driver and measure the difference.
I understand the explanation, but the talk sounds interesting anyway. Could you share it?
Ok, I think I understand your use case now.
There must be some overhead for multi-partition unlogged batches, but I'm not sure how big it is in practice. Coordinator will have to split the batch and execute it's statements separatly iiuc.
Your cluster consists of nodes. Each node consists of shards (usually number of shards is about the same as number of CPU cores). Why do I explain all that? You mentioned that you are batching by statements shard. I understand that you have N nodes, each has K shards, and on the driver side you have K queues (not N, not N*K, just K). Then you send batch from queue Additionally, as I mentioned before, with new versions of Scylla the concept of "statements shard" will lose it's meaning because Scylla is moving from VNodes to Tablets. scylla-rust-driver/scylla/src/routing.rs Line 91 in 1386776
(A, n), (B, n), (C, n) .That means you could tell "n is a shard for this statement". With tablets it's possible (and it's a normal case, not some edge case) for replicas to have different shards, meaning your replicas could be I see 3 options here:
None of the above implementations requires changes in the driver, you can do them from user code. Regarding possible improvements in the driver:
There are probably more ideas to find but we need benchmarks for that. |
It's better to coordinate on the client. The overhead isn't large but it's an extra hop. |
Alright, I'll take the time to finish cleaning the auto-generated subtitles that are a disaster 😅 then post the link. 🙂
Ah it looks like we had a misunderstanding here: I do create N*K queues, that is, one per actual shard, not one per shard-index or one per number-of-shards-per-node. (I'm using the naming discussed here, but maybe I should have said replica? I thought that was a different concept from node, associated to a particular statement.)
I already don't make such assumptions. Apologies if I said "the shard for a statement" anywhere above, I really only ever meant "a shard that would match this statement": I only ever manipulate pairs of (Node, ShardIndex), hence the signature of the main new function of this PR
I know that it's already possible, it's even written in the PR's description:
The issues that I have with implementing this outside of the driver are:
To be clear: for me adding this function is the main point of this PR: scylla-rust-driver/scylla/src/transport/session.rs Lines 1826 to 1830 in d5f71a9
making the code that picks a shard for a given statement easily accessible and correct and consistent with load balancing, as efficient load balancing is effectively what I'm looking for. Now as pointed out by @wprzytula here any code that uses this doesn't make sense without also enforcing the target (node,shard) afterwards, and it's nice to have usage examples, which is why the PR also adds a policy to enforce the target shard afterwards, and an test that also constitutes an example of how to write a system such as the one I was describing in my previous message. |
Finally finished the subtitles (I recommend enabling them): https://youtu.be/cBElzSO4A_s |
Resolves #468
Resolves #974
Closes #975 (since that is just an alternate way to fix #974 and this implementation is probably more direct)
This is a follow-up on #508 and #658:
batch
will pick the shard of the first statement to send the query to. However it is left to the user to constitute the batches in such a way that the target shard is the same for all the elements of the batch.Session::first_shard_for_statement(self, &PreparedStatement, &SerializedValues) -> Option<(Node, Shard)>
makes shard-aware batching easy on the users, by providing access to the first node and shard of the query plan.enforce_target_node
method is added onBatch
to be able to choose the target node (that makes use of the load balancing API via anEnforceTargetNodePolicy
struct, so no significant change).Pre-review checklist
./docs/source/
.Fixes:
annotations to PR description.