Exposing `Shard` ID publicly #468

junglie85 · 2022-07-08T11:00:12Z

It'd be helpful to have greater control over how records are batched. Once scenario my team has discussed is using the Shard ID as a means to group data for batching, irrespective of partition.

The general thinking is that we can get the relevant Token from a PreparedStatement by computing the partition key and passing it to the Partitioner. To achieve this, ideally get_partitioner_name() would be pub instead of pub(crate) on PreparedStatement.

With the Token, get the Shard ID from somewhere that makes sense - no good ideas on this yet, perhaps a method on the ClusterData.

App specific logic can then be used to group records by the shard ID and batch them for writing to Scylla.

Is there any reason why this is a bad idea?
What's the appetite for API changes that would enable this?

Rather than expose the shard information, a shard aware batching API would be more useful. I think probably related is #448.

The text was updated successfully, but these errors were encountered:

Ten0 · 2022-08-14T23:17:31Z

IIUC (but I'm no expert, confirmation welcome) a Shard ID (= u32) is specific to a node.
So to have full information on which precise shard a statement would be directed to, it would be necessary to have information on both which node the statement should be directed to, as well as which shard inside the node.

psarna · 2022-08-16T10:00:24Z

I must have missed this issue before - our driver already computes everything it needs to be shard-aware, so it's capable of assigning tokens to owner shards of a particular node. The code is available in this source file: https://github.com/scylladb/scylla-rust-driver/blob/fd06928929c0dd78a72c7aca494a998da1514a79/scylla/src/routing.rs , so I guess it's only the matter of exposing these details to the end user. Right now they're hidden quite deep in the load balancing policy implementation.

wyfo · 2022-09-12T14:21:57Z

It should be feasible with combination of #484 and #485.
Actually, I've opened these MR because I'm also using the shard id.

Helps (or arguably fixes) scylladb#468 For smarter batch constitution (following up on scylladb#448), it is required to have access to the token a particular statement would be directed to. However, that is pretty difficult to do without access to calculate_token. That was initially put in scylladb#508, but planned to put in a separate pr to keep it minimal (scylladb#508 (comment)) It seems I had forgotten to open that separate PR, and I end up getting bitten by that now that I want to constitute my batches in a smarter way. So here it is. I'm only putting "helps" above because I think we may want to also expose query planning ( `session.plan(prepared_statement, serialized_values) -> impl Iterator<Item = (Arc<Node>, ShardID)>` ) as that may make it significantly easier - but I'd like to keep this PR that just *enables* the ideal behavior as simple as possible.

Ten0 · 2023-03-11T23:26:39Z

It doesn't seem like this gives access from the prepared statement + serialized values combination. I feel like that would be more practical to use, as that would enable abstracting the keyspace/table/partition_key part (especially partition key knowledge that may otherwise be unknown by application code) when dealing with sharding-related considerations.
#658

Resolves scylladb#468 This is a follow-up on scylladb#508 and scylladb#658: - To minimize CPU usage related to network operations when inserting a very large number of lines, it is relevant to batch. - To batch in the most efficient manner, these batches have to be shard-aware. Since scylladb#508, `batch` will pick the shard of the first statement to send the query to. However it is left to the user to constitute the batches in such a way that the target shard is the same for all the elements of the batch. - This was made *possible* by scylladb#658, but it was still very boilerplate-ish. I was waiting for scylladb#612 to be merged (amazing work btw! 😃) to implement a more direct and factored API (as that would use it). - This new ~`Session::first_shard_for_statement(self, &PreparedStatement, &SerializedValues) -> Option<(Node, Option<Shard>)>` makes shard-aware batching easy on the users, by providing access to the first node and shard of the query plan.

Lorak-mmk · 2024-06-20T13:13:15Z

It'd be helpful to have greater control over how records are batched. Once scenario my team has discussed is using the Shard ID as a means to group data for batching, irrespective of partition.

The general thinking is that we can get the relevant Token from a PreparedStatement by computing the partition key and passing it to the Partitioner. To achieve this, ideally get_partitioner_name() would be pub instead of pub(crate) on PreparedStatement.

With the Token, get the Shard ID from somewhere that makes sense - no good ideas on this yet, perhaps a method on the ClusterData.

App specific logic can then be used to group records by the shard ID and batch them for writing to Scylla.
1. Is there any reason why this is a bad idea?

With Tablets (https://www.scylladb.com/presentations/tablets-rethinking-replication/) there is no longer a mapping between Token and shard (and such mapping was always an implementation detail) because the same partition may be located on different shards of different replicas.
I'm not sure what is your use case for batches, but in general you should batch by Token, otherwise you start doing multi-partition batches, which incur more work for the coordinator node and are not a good idea afaik.

2. What's the appetite for API changes that would enable this?
Rather than expose the shard information, a shard aware batching API would be more useful. I think probably related is #448.

Ten0 · 2024-06-20T14:33:36Z

otherwise you start doing multi-partition batches, which incur more work for the coordinator node and are not a good idea afaik

The main issue with not batching is that going through the heavy Tokio machinery for every statement puts x5 CPU load on the server that makes the scylla calls using the Rust driver, whereas with batching that load is shared across coordinators (and probably less significant than managing Tokio futures is even if ignoring the fact that work is shared, because Tokio turned out to be heavy in my benchmarks)

wprzytula · 2024-06-25T07:11:03Z

otherwise you start doing multi-partition batches, which incur more work for the coordinator node and are not a good idea afaik

The main issue with not batching is that going through the heavy Tokio machinery for every statement puts x5 CPU load on the server that makes the scylla calls using the Rust driver, whereas with batching that load is shared across coordinators (and probably less significant than managing Tokio futures is even if ignoring the fact that work is shared, because Tokio turned out to be heavy in my benchmarks)

IIUC, your idea to create multi-partition batches performs faster wrt the driver, but imposes higher load on the cluster. Right? So this is a trade-off that we used to reject in the past due to our goal being to minimise load on the Scylla cluster, thus making it maximally performant.

Ten0 · 2024-06-25T10:21:50Z

thus making it maximally performant

It looks like at the moment the cluster may have to handle some extra networking and individual handling for every statement that might put sub-optimal load on it as well, which seems hard to control unless either one can send batches, or this can be configured to be longer (ideally into something like chunks_timeout).

But more importantly, if we agree just that the query load is similar or greater on the client side as on the server side if we handle statements individually, for the use-case of performing High-Frequency Append-Log Style Operations like is my case and that of the author of #974, it seems clear that more scylla servers will be required to store & perform maintenance operations than client servers (~1/10 ratio), so it doesn't seem like closing APIs that would allow offload part of the work to the servers is practical, as that would not significantly reduce the number of required scylla servers (they would still be needed for maintenance operations to work properly), but would drastically increase the number of required clients.

Bear in mind that with this use-case, currently the alternative is just making batches that are not targeted, and it's not better for the servers load either.

wprzytula · 2024-06-25T10:29:47Z

thus making it maximally performant

It looks like at the moment the cluster may have to handle some extra networking and individual handling for every statement that might put sub-optimal load on it as well, which seems hard to control unless either one can send batches, or this can be configured to be longer (ideally into something like chunks_timeout).

But more importantly, if we agree just that the query load is similar or greater on the client side as on the server side if we handle statements individually, for the use-case of performing High-Frequency Append-Log Style Operations like is my case and that of the author of #974, it seems clear that more scylla servers will be required to store & perform maintenance operations than client servers (~1/10 ratio), so it doesn't seem like closing APIs that would allow offload part of the work to the servers is practical, as that would not significantly reduce the number of required scylla servers (they would still be needed for maintenance operations to work properly), but would drastically increase the number of required clients.

That's convincing, at least to me.

Bear in mind that with this use-case, currently the alternative is just making batches that are not targeted, and it's not better for the servers load either.

OK, could you re-iterate what kind of API you need specifically access to in order to get such client-side performance gains?

Ten0 · 2024-06-25T12:01:14Z

OK, could you re-iterate what kind of API you need specifically access to in order to get such client-side performance gains?

It looks like the discussion here is joining with that of #738 so I'm just going to forward to there: #738 (comment)

This was referenced Aug 13, 2022

Token aware load balancing for batches #448

Closed

Pick connections based on batch first statement's shard #508

Merged

Ten0 mentioned this issue Mar 11, 2023

Expose calculate_token #658

Merged

6 tasks

Ten0 linked a pull request Jun 3, 2023 that will close this issue

Shard aware batching - add Session::shard_for_statement & Batch::enforce_target_node #738

Open

8 tasks

Lorak-mmk self-assigned this Nov 15, 2023

dkropachev mentioned this issue Jun 19, 2024

Extends gocql API for scylla shard aware info scylladb/gocql#164

Closed

Lorak-mmk removed their assignment Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing `Shard` ID publicly #468

Exposing `Shard` ID publicly #468

junglie85 commented Jul 8, 2022 •

edited

Loading

Ten0 commented Aug 14, 2022

psarna commented Aug 16, 2022

wyfo commented Sep 12, 2022

Ten0 commented Mar 11, 2023

Lorak-mmk commented Jun 20, 2024

Ten0 commented Jun 20, 2024

wprzytula commented Jun 25, 2024

Ten0 commented Jun 25, 2024

wprzytula commented Jun 25, 2024

Ten0 commented Jun 25, 2024

Exposing Shard ID publicly #468

Exposing Shard ID publicly #468

Comments

junglie85 commented Jul 8, 2022 • edited Loading

Ten0 commented Aug 14, 2022

psarna commented Aug 16, 2022

wyfo commented Sep 12, 2022

Ten0 commented Mar 11, 2023

Lorak-mmk commented Jun 20, 2024

Ten0 commented Jun 20, 2024

wprzytula commented Jun 25, 2024

Ten0 commented Jun 25, 2024

wprzytula commented Jun 25, 2024

Ten0 commented Jun 25, 2024

Exposing `Shard` ID publicly #468

Exposing `Shard` ID publicly #468

junglie85 commented Jul 8, 2022 •

edited

Loading