Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing Shard ID publicly #468

Open
junglie85 opened this issue Jul 8, 2022 · 10 comments · May be fixed by #738
Open

Exposing Shard ID publicly #468

junglie85 opened this issue Jul 8, 2022 · 10 comments · May be fixed by #738

Comments

@junglie85
Copy link

junglie85 commented Jul 8, 2022

It'd be helpful to have greater control over how records are batched. Once scenario my team has discussed is using the Shard ID as a means to group data for batching, irrespective of partition.

The general thinking is that we can get the relevant Token from a PreparedStatement by computing the partition key and passing it to the Partitioner. To achieve this, ideally get_partitioner_name() would be pub instead of pub(crate) on PreparedStatement.

With the Token, get the Shard ID from somewhere that makes sense - no good ideas on this yet, perhaps a method on the ClusterData.

App specific logic can then be used to group records by the shard ID and batch them for writing to Scylla.

  1. Is there any reason why this is a bad idea?
  2. What's the appetite for API changes that would enable this?

Rather than expose the shard information, a shard aware batching API would be more useful. I think probably related is #448.

@Ten0
Copy link
Contributor

Ten0 commented Aug 14, 2022

IIUC (but I'm no expert, confirmation welcome) a Shard ID (= u32) is specific to a node.
So to have full information on which precise shard a statement would be directed to, it would be necessary to have information on both which node the statement should be directed to, as well as which shard inside the node.

@psarna
Copy link
Contributor

psarna commented Aug 16, 2022

I must have missed this issue before - our driver already computes everything it needs to be shard-aware, so it's capable of assigning tokens to owner shards of a particular node. The code is available in this source file: https://github.com/scylladb/scylla-rust-driver/blob/fd06928929c0dd78a72c7aca494a998da1514a79/scylla/src/routing.rs , so I guess it's only the matter of exposing these details to the end user. Right now they're hidden quite deep in the load balancing policy implementation.

@wyfo
Copy link
Contributor

wyfo commented Sep 12, 2022

It should be feasible with combination of #484 and #485.
Actually, I've opened these MR because I'm also using the shard id.

Ten0 added a commit to Ten0/scylla-rust-driver that referenced this issue Mar 11, 2023
Helps (or arguably fixes) scylladb#468

For smarter batch constitution (following up on scylladb#448), it is required
to have access to the token a particular statement would be directed to.
However, that is pretty difficult to do without access to
calculate_token.

That was initially put in scylladb#508, but planned to put in a separate pr to
keep it minimal
(scylladb#508 (comment))

It seems I had forgotten to open that separate PR, and I end up getting
bitten by that now that I want to constitute my batches in a smarter
way.

So here it is.

I'm only putting "helps" above because I think we may want to also
expose query planning (
`session.plan(prepared_statement, serialized_values) -> impl Iterator<Item = (Arc<Node>, ShardID)>`
) as that may make it significantly easier - but I'd like to keep this
PR that just *enables* the ideal behavior as simple as possible.
@Ten0 Ten0 mentioned this issue Mar 11, 2023
6 tasks
@Ten0
Copy link
Contributor

Ten0 commented Mar 11, 2023

It doesn't seem like this gives access from the prepared statement + serialized values combination. I feel like that would be more practical to use, as that would enable abstracting the keyspace/table/partition_key part (especially partition key knowledge that may otherwise be unknown by application code) when dealing with sharding-related considerations.
#658

Ten0 added a commit to Ten0/scylla-rust-driver that referenced this issue Jun 3, 2023
Resolves scylladb#468

This is a follow-up on scylladb#508 and scylladb#658:
- To minimize CPU usage related to network operations when inserting a very large number of lines, it is relevant to batch.
- To batch in the most efficient manner, these batches have to be shard-aware. Since scylladb#508, `batch` will pick the shard of the first statement to send the query to. However it is left to the user to constitute the batches in such a way that the target shard is the same for all the elements of the batch.
- This was made *possible* by scylladb#658, but it was still very boilerplate-ish. I was waiting for scylladb#612 to be merged (amazing work btw! 😃) to implement a more direct and factored API (as that would use it).
- This new ~`Session::first_shard_for_statement(self, &PreparedStatement, &SerializedValues) -> Option<(Node, Option<Shard>)>` makes shard-aware batching easy on the users, by providing access to the first node and shard of the query plan.
@Lorak-mmk Lorak-mmk self-assigned this Nov 15, 2023
@Lorak-mmk
Copy link
Collaborator

It'd be helpful to have greater control over how records are batched. Once scenario my team has discussed is using the Shard ID as a means to group data for batching, irrespective of partition.

The general thinking is that we can get the relevant Token from a PreparedStatement by computing the partition key and passing it to the Partitioner. To achieve this, ideally get_partitioner_name() would be pub instead of pub(crate) on PreparedStatement.

With the Token, get the Shard ID from somewhere that makes sense - no good ideas on this yet, perhaps a method on the ClusterData.

App specific logic can then be used to group records by the shard ID and batch them for writing to Scylla.

1. Is there any reason why this is a bad idea?

With Tablets (https://www.scylladb.com/presentations/tablets-rethinking-replication/) there is no longer a mapping between Token and shard (and such mapping was always an implementation detail) because the same partition may be located on different shards of different replicas.
I'm not sure what is your use case for batches, but in general you should batch by Token, otherwise you start doing multi-partition batches, which incur more work for the coordinator node and are not a good idea afaik.

2. What's the appetite for API changes that would enable this?

Rather than expose the shard information, a shard aware batching API would be more useful. I think probably related is #448.

@Ten0
Copy link
Contributor

Ten0 commented Jun 20, 2024

otherwise you start doing multi-partition batches, which incur more work for the coordinator node and are not a good idea afaik

The main issue with not batching is that going through the heavy Tokio machinery for every statement puts x5 CPU load on the server that makes the scylla calls using the Rust driver, whereas with batching that load is shared across coordinators (and probably less significant than managing Tokio futures is even if ignoring the fact that work is shared, because Tokio turned out to be heavy in my benchmarks)

@wprzytula
Copy link
Collaborator

otherwise you start doing multi-partition batches, which incur more work for the coordinator node and are not a good idea afaik

The main issue with not batching is that going through the heavy Tokio machinery for every statement puts x5 CPU load on the server that makes the scylla calls using the Rust driver, whereas with batching that load is shared across coordinators (and probably less significant than managing Tokio futures is even if ignoring the fact that work is shared, because Tokio turned out to be heavy in my benchmarks)

IIUC, your idea to create multi-partition batches performs faster wrt the driver, but imposes higher load on the cluster. Right? So this is a trade-off that we used to reject in the past due to our goal being to minimise load on the Scylla cluster, thus making it maximally performant.

@Ten0
Copy link
Contributor

Ten0 commented Jun 25, 2024

thus making it maximally performant

It looks like at the moment the cluster may have to handle some extra networking and individual handling for every statement that might put sub-optimal load on it as well, which seems hard to control unless either one can send batches, or this can be configured to be longer (ideally into something like chunks_timeout).

But more importantly, if we agree just that the query load is similar or greater on the client side as on the server side if we handle statements individually, for the use-case of performing High-Frequency Append-Log Style Operations like is my case and that of the author of #974, it seems clear that more scylla servers will be required to store & perform maintenance operations than client servers (~1/10 ratio), so it doesn't seem like closing APIs that would allow offload part of the work to the servers is practical, as that would not significantly reduce the number of required scylla servers (they would still be needed for maintenance operations to work properly), but would drastically increase the number of required clients.

Bear in mind that with this use-case, currently the alternative is just making batches that are not targeted, and it's not better for the servers load either.

@wprzytula
Copy link
Collaborator

thus making it maximally performant

It looks like at the moment the cluster may have to handle some extra networking and individual handling for every statement that might put sub-optimal load on it as well, which seems hard to control unless either one can send batches, or this can be configured to be longer (ideally into something like chunks_timeout).

But more importantly, if we agree just that the query load is similar or greater on the client side as on the server side if we handle statements individually, for the use-case of performing High-Frequency Append-Log Style Operations like is my case and that of the author of #974, it seems clear that more scylla servers will be required to store & perform maintenance operations than client servers (~1/10 ratio), so it doesn't seem like closing APIs that would allow offload part of the work to the servers is practical, as that would not significantly reduce the number of required scylla servers (they would still be needed for maintenance operations to work properly), but would drastically increase the number of required clients.

That's convincing, at least to me.

Bear in mind that with this use-case, currently the alternative is just making batches that are not targeted, and it's not better for the servers load either.

OK, could you re-iterate what kind of API you need specifically access to in order to get such client-side performance gains?

@Ten0
Copy link
Contributor

Ten0 commented Jun 25, 2024

OK, could you re-iterate what kind of API you need specifically access to in order to get such client-side performance gains?

It looks like the discussion here is joining with that of #738 so I'm just going to forward to there: #738 (comment)

@Lorak-mmk Lorak-mmk removed their assignment Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants