[NEW] Cluster-wide SCAN #33

madolson · 2024-03-26T04:45:52Z

The problem/use-case that the feature addresses

Implement a scan cursor that can consistently scan the entire cluster, instead of today which requires individually targeting each node and sending it an individual SCAN command. This can also break, as slots can be migrated off the node and failovers can happen.

Description of the feature

Implement a consistent cluster scan with semantics like:

CSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type]

The cluster cursor would be marked as NOT_KEY, but would be hashed like all other keys by the clients so that they would be routed to the next node. The cursor would contain a component that includes a hashtag, to represent the slot it's currently scanning.

the format of the cursor would be:

<version>-{hashtag}-<slot db id>

Alternatives you've considered

Extending the existing SCAN cursor to support scanning a specific slot, something like:

SCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type]

This would require the end user to specific a specific slot, and the scan command would parse just that specific slot.

Additional information

See redis/redis#2702.

The text was updated successfully, but these errors were encountered:

NikolaBorisov · 2024-03-27T01:15:59Z

I think also adding

SCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type]

is important because it lets you scan the cluster in parallel. CSCAN can not be parallelized. My use case for this was when I have a large cluster and I want to iterate over all the keys and change them somewhat, but I want to do this in parallel. The cluster could be so large that using CSCAN could take super long time. I think the right abstraction would be just to allow user to specify which slot they want to scan. It is very easy to build something that scans the whole cluster reliably if you have that.

madolson · 2024-03-27T01:34:42Z

Ok, so maybe something that would be make me "happyish":

CSCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type]

Which just scans the given slot if it's provided. We can still have the be marked as a key, so that your client will route it for you. If you're really smart, you could reverse engineer a cursor that hits the right node and we could make some way to define that, but I think it makes sense to just have it be 0 or empty string.

There is a problem with my proposal. On the other thread, #4, means that it's not safe to do cross cluster scans without restarting the slot on moved, since the hashes in each databases aren't stable.

nihohit · 2024-03-27T15:23:49Z

does CSCAN <cluster cursor> [SLOT X] [MATCH pattern] [COUNT count] [TYPE type] mean that the cursor is only good for iterations with the same slot? what would be the cursor semantics be?

madolson · 2024-03-27T16:07:04Z

If SLOT is provided, it would only be valid for the SLOT specified. If it's omitted, it would do a scan across all slots in the cluster.

nihohit · 2024-03-27T20:43:33Z

If SLOT is provided, it would only be valid for the SLOT specified. If it's omitted, it would do a scan across all slots in the cluster.

I wonder whether a cursor might become "accidentally" usable between regular SCAN, CSAN, and CSCAN+slot calls, simply because its computed in the same way for each call.

The cursor would contain a component that includes a hashtag, to represent the slot it's currently scanning.

So, this means that a CSCAN might return a MOVED error if there was slot migration? If so, I think that it solves the issue well, but this requires a lot of heavy lifting from the cursor. For example, assuming CSCAN goes by order of slots, if a node contains slot 1 & 3 but not 2, CSCAN without slots will need to return the keys from slot 1, even if they're below COUNT, and then answer the next command with a MOVED to the node with slot 2, which will in turn respond with the keys in slot 2 and a MOVED back to the first node.
This allows the user to scan across the cluster, but it's not a great experience.

IMO this can be combined with a command that quickly (unlike CLUSTER NODES/SLOTS/SHARDS, which can be very slow in large, fragmented clusters) returns the slots available on the current node. Let's call it CLUSTER SLOTSLOCAL, and it only returns the slots available on the current node - no data on other nodes in the cluster.
That way the can pipeline CSCAN calls with CLUSTER SLOTSLOCAL without a significant perf penalty, and quickly know whether there was slot migration. Once the CSCAN calls on this node complete, the user knows exactly which slots were covered (if there wasn't any change), or retry with CSCAN SLOT slot_id for slots that were added during the process.

madolson · 2024-03-27T22:55:30Z

I wonder whether a cursor might become "accidentally" usable between regular SCAN, CSAN, and CSCAN+slot calls, simply because its computed in the same way for each call.

That's possible, and we could do that. The only concern is that with SCAN, the client doesn't expect it to need routing. We introduced the concept of "NOT A KEY" in Redis 7, that still requires routing.

This allows the user to scan across the cluster, but it's not a great experience.

Not quite. Let's simplify, there are 2 nodes (A and B) with 3 slots (A and 1 and 3, B has 2). All slots have 100 keys. The command ordering would look like:

B> CSCAN 0 COUNT 50 -> {slot:0}-0 and []
A> SCAN {slot:0}-0 COUNT 50-> {slot:0}-50 and [50 keys]
A> SCAN {slot:0}-50 COUNT 50-> {slot:1}-0 and [50 keys] // Notice how the slot was updated, we returned the remaining keys and the slot is empty.
B> SCAN {slot:1}-0 COUNT 50-> {slot:1}-50 and [50 keys]
B> SCAN {slot:1}-50 COUNT 50-> {slot:2}-0}and [50 keys]
A> SCAN {slot:2}-0 COUNT 50-> {slot:2}-50 and [50 keys]
A> SCAN {slot:2}-50 COUNT 50-> 0 and [50 keys] // We got a zero back, we're done!

At no point are we getting a moved, since we're routing based on the slot information, and the client knows that. You're right that if there are few overall keys, we might not have a very high density. We could optimize that by also including data from the next slot if the node has it though.

That way the can pipeline CSCAN calls with CLUSTER SLOTSLOCAL without a significant perf penalty, and quickly know whether there was slot migration. Once the CSCAN calls on this node complete, the user knows exactly which slots were covered (if there wasn't any change), or retry with CSCAN SLOT slot_id for slots that were added during the process.

The ask has just been to parallelize it, which you could still do. If you have like a million keys, we're over indexing on performance, since it'll finish fast. If you have 10 billion keys (~1 million keys per slot), then parallelization makes sense.

nihohit · 2024-03-28T07:36:05Z

Let's simplify, there are 2 nodes (A and B) with 3 slots (A and 1 and 3, B has 2)

This scenario works for a cluster that is stable, but what happens during slot migrations, or scale in/out?
For simplicity's sake, what happens if the client isn't aware of slot 2 moving from A to B? what would happen on a call A> SCAN {slot:1}-0 COUNT 50?
It seems like the correct response is either

{slot:2}-50 and [50 keys], since these are the next values on node A, or
MOVED B, since the next slot for the scan is in B

what would happen on a A> SCAN {slot:1}-0 COUNT 200 - where the count is larger than the number of entries in the slot? should A return {slot:3}-0 and [200 keys from slots 0 & 2, and implicitly skip slot 1? should it return {slot:1}-0 and [100 keys from slot 0], and under-provide on the COUNT in order to correctly reflect the missing slot?

Notice that in these examples the calls are without the [SLOT 1] argument - there's nothing explicitly requiring the queried node to contain slot 2.

Let's take a scenario in which the client only calls CSCAN, and doesn't perform any other operations - how would such a client correctly scan through a cluster undegoing changes?

madolson · 2024-04-02T20:22:53Z

if the client isn't aware of slot 2 moving from A to B

Then it'll get a MOVED message and try again. This is the normal behavior for misunderstanding the topology.

what would happen on a A> SCAN {slot:1}-0 COUNT 200 - where the count is larger than the number of entries in the slot? should A return {slot:3}-0 and [200 keys from slots 0 & 2, and implicitly skip slot 1

The current implementation only returns data from one slot at a time, which is our atomic unit, we would need to restart at the next slot. I mentioned an optimization, but I think you would probably want to opt in to it.

nihohit · 2024-04-03T04:04:58Z

The current implementation only returns data from one slot at a time

Oh, excellent. I didn't notice that this in the documentation. This solves my issues :)

avifenesh · 2024-04-03T06:09:42Z

Hope Im not repeating something, couldn't find some mention of it.
What about clusters with small amount of keys, lets say 200 heavy keys distributed to different slots, would i need to run 16384 scans to get all the keys in my cluster? The default count is 10 in the classic implementation so it would be around 20 calls, here its seems to be almost x1000 more calls.
The classic implementation is based on the node dictht which its size and the inside key distribution is kind of equivalent to the amount of keys (worse case amount of keys X 2), which promise some efficiency in this kind of scenarios.
With the new impl' offered, if the amount of keys i have is smaller than the 163840 i still need to add my own impl' for scanning efficiently.

ranshid · 2024-04-03T08:25:38Z

What about clusters with small amount of keys, lets say 200 heavy keys distributed to different slots, would i need to run 16384 scans to get all the keys in my cluster? The default count is 10 in the classic implementation so it would be around 20 calls, here its seems to be almost x1000 more calls.

I think this is a valid point, which will be SOMEWHAT handled in case we will continue to consume continuance slot ranges (which I support).
This would leave the issue as more impactful for fragmented slot ranges, but even in case of fragmented slot ranges we expect to have some continuation. and there is always the ability of the client to fanout on all slots in case the application understand it's workload.

I would like to ask 2 other questions:

How do we plan to make these commands available via scripts/transactions? I would imagine CSCAN will probably be non-script, but SCAN with specific slot will be available right?
one thing that I think is missing from current scan, is filter by TTL. can we consider adding such an option?

CharlesChen888 · 2024-04-17T07:15:20Z

Note that [pattern] may imply a specific slot, and this is useful when slot is not provided, or when the provided slot is different from the slot pattern implies.

madolson · 2024-04-20T00:11:51Z

How do we plan to make these commands available via scripts/transactions? I would imagine CSCAN will probably be non-script, but SCAN with specific slot will be available right?

Why would CSCAN be non-script? I don't see anything that would strictly break from it.

one thing that I think is missing from current scan, is filter by TTL. can we consider adding such an option?

Sounds like a separate issue? What is the use case to filter by TTL?

zuiderkwast · 2024-10-02T10:02:36Z

I know @madolson you don't want the end users to be aware of slots, but we can say that the intention is that client libraries should handle the slots and provide Scan functionality to the end user without exposing the slots. Then it should be fine?

I believe scanning per slot is the most robust way that also allows nodes to be scanned in parallel.

I'm not sure about the C prefix in CSCAN though. I know Redis added a command called SFLUSH to flush all keys in a slot, but the S prefix is normally for sets, like SADD, so that's pretty bad imo. C isn't used as a prefix yet though, but I think CLUSTER SCAN is better. It puts the command within the cluster-specific namespace, so non-cluster users don't get confused by it.

madolson · 2024-10-04T21:01:59Z

I know @madolson you don't want the end users to be aware of slots, but we can say that the intention is that client libraries should handle the slots and provide Scan functionality to the end user without exposing the slots. Then it should be fine?

I'm OK with this yes. The point of the cluster abstraction is that end users just see a flat keyspace. I have a secondary goal of I don't want clients to have to do much work, but given that Glide was willing to fully track the entire slot space (See valkey-io/valkey-glide#1623) to do a full cluster scan I guess that point isn't really that important anymore.

I believe scanning per slot is the most robust way that also allows nodes to be scanned in parallel.

I agree? I'm not sure if this is disagreeing with my point, but I was just positing that we can re-use a lot of existing functionality if we encode the slot into the cursor. I also thought it might be possible to someday in the future incorporate other types of hashing (like consistent hashing).

C isn't used as a prefix yet though, but I think CLUSTER SCAN is better.

I have a strong preference not to put something in the CLUSTER container unless it's actually necessary. There is a lot of annoyance around CLIENT having a mix of admin commands and steady state commands, so it makes ACLs hard to reason about. I think we can't use SSCAN because it already existed (with set scan). I assume SFLUSH was following after SPUBLISH, which stood for sharded publish. I don't like CSCAN, but it was the best I could come up with.

zuiderkwast · 2024-10-04T21:55:37Z

I agree? I'm not sure if this is disagreeing with my point, but I was just positing that we can re-use a lot of existing functionality if we encode the slot into the cursor.

I didn't mean to say you disagree about this one. 😄

Btw, we already encode the slot in the cursor, though it's undocumented.

I also thought it might be possible to someday in the future incorporate other types of hashing (like consistent hashing).

How are you thinking? The slot mapping is already a superset of consistent hashing. Or... replace CRC16 with ... SHA256? I think it would be a very big breaking change to cluster clients if we change the slot concept.

CRC16 is not fair, but we could already compensate for this. We can create a pre-computed heat map that say how likely a random key is to hash to each slot, that we can use when scaling to compensate for the unfairness of CRC16.

I have a strong preference not to put something in the CLUSTER container

Fair point about not mixing with admin commands. (CLUSTER SLOTS is not an admin command though.)

How about the already suggested names SLOTSCAN, or SCAN with a SLOT argument? They are clear and intuitive to me.

madolson · 2024-10-04T22:11:26Z

Btw, we already encode the slot in the cursor, though it's undocumented.

Sure, but it doesn't work with any existing cluster routing. It's also undocumented because it's fragile and might change.

How are you thinking? The slot mapping is already a superset of consistent hashing. Or... replace CRC16 with ... SHA256? I think it would be a very big breaking change to cluster clients if we change the slot concept.

Let's say we move away from 2^14 slots. Are we okay with adding a net new command, or a new argument, to add support for that new backend? The idea that I proposed, CSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type], would work regardless of the underlying number or type of slots. Which is why it felt more like a user command to me, since it represents the abstraction we want to show end users, the flat keyspace.

SLOTSCAN

Slight preference for this, since I would like it to be it's own command with it's own documentation page. We could also just do CLUSTERSCAN I guess. It doesn't have the ACL baggage.

madolson · 2024-10-04T22:19:49Z

In addition, I'm also pretty chill with this:

CLUSTERSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number]

The default scans all slots. If you specify a specific slot, it scans just that slot.

zuiderkwast · 2024-10-04T22:32:15Z

Let's say we move away from 2^14 slots.

It's such a breaking change that we'll probably never do it, and if we do, we could still use the slot concept, though we may want to accept slot ranges everywhere rather than single slots, if the total number of slots is very large.

CLUSTERSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number]

Yes! A accept this one. This works for the two use cases:

without slots, scan the node until you get a redirect to the next shard
with a slot, makes it possible to scan multiple nodes in parallel without worrying about slot migrations

The version without a slot is a bit risky if a slot migration happens during the scan, well, at least if you try to scan multiple shards in parallel. I don't want the client to feel the need do a 16K calls to CLUSTER SLOTS to detect that (is it what GLIDE does?). With a specified slot, you would get a redirect if the slot has been migrated. (We could even consider a slot range and return some error if the node doesn't own all slots in the range. This can be a future parameter.)

madolson · 2024-10-06T18:25:04Z

I don't want the client to feel the need do a 16K calls to CLUSTER SLOTS to detect that (is it what GLIDE does?).

GLIDE does a normal SCAN, if during the scan it detects the slot configuration changed, it will redo the scan. This maintains the current guarantee that items can be returned multiple times. Once it's done a full scan, it will indicate which slots it has scanned. I'm not sure it returns a serializable cursor though, @avifenesh would know though.

(We could even consider a slot range and return some error if the node doesn't own all slots in the range. This can be a future parameter.)

True.

avifenesh · 2024-10-06T18:31:17Z

I don't want the client to feel the need do a 16K calls to CLUSTER SLOTS to detect that (is it what GLIDE does?).

GLIDE does a normal SCAN, if during the scan it detects the slot configuration changed, it will redo the scan. This maintains the current guarantee that items can be returned multiple times. Once it's done a full scan, it will indicate which slots it has scanned. I'm not sure it returns a serializable cursor though, @avifenesh would know though.

(We could even consider a slot range and return some error if the node doesn't own all slots in the range. This can be a future parameter.)

True.

Glide scan nodes normally and sign the slots they own as scanned when a node is completed.
If some slots are added (epoch bump) the node is not valid and we don't mark its slots as scanned.
If some slots moved out (no epoch change) we mark the remaining slots as scanned.
We keep going to the owner of the next slots that haven't scanned.

Glide call cluster slots to a small number of nodes after each node completion.
So basically calls to slots as the number of shards.

We return to the user RC pointing to the object with the Scan progress (slot scanned, node in scan, epoch at the beginning of the node Scan).

avifenesh · 2024-11-05T08:45:33Z

In addition, I'm also pretty chill with this:
CLUSTERSCAN <cluster cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number]
The default scans all slots. If you specify a specific slot, it scans just that slot.

I'm afraid about the cost of multiple requests for one iteration, if the slots are not in the same shard, the call would need to be directed to many, instead of one by one. In the worst case, it can be pretty expensive. Maybe also in the common case.
Clients can solve that by using a topology map and choosing slots in the same shard but it, but is it the intention to still need the client to implement some method for smart scan?

madolson · 2024-11-05T21:43:39Z

I'm afraid about the cost of multiple requests for one iteration, if the slots are not in the same shard, the call would need to be directed to many, instead of one by one. In the worst case, it can be pretty expensive. Maybe also in the common case.

This doesn't seem like that material of an issue to me. The point is to make the scan more scalable, so we are more worried about the large cluster case. I didn't really follow the last sentence, what does "Maybe also in the common case" mean?

Clients can solve that by using a topology map and choosing slots in the same shard but it, but is it the intention to still need the client to implement some method for smart scan?

The goal is to balance the complexity on the server and client. The current implementation in glide, imo, is overly complicated to try to replicate the guarantees of the server. It seems like with a rather small change to the server-side cursor the client work becomes much simpler.

madolson mentioned this issue Mar 26, 2024

Better SCAN support for cluster mode enabled redis/redis#2702

Open

2 tasks

madolson added the major-decision-pending Major decision pending by TSC team label Apr 20, 2024

madolson added this to Roadmap Jun 10, 2024

madolson moved this to Idea in Roadmap Jun 25, 2024

nihohit mentioned this issue Jul 8, 2024

AsyncCommands::scan is broken redis-rs/redis-rs#759

Open

zuiderkwast changed the title ~~[NEW] Redis consistent cluster SCAN~~ [NEW] Cluster-wide SCAN Oct 2, 2024

zuiderkwast mentioned this issue Oct 2, 2024

Provide ability for local cluster topology query #808

Open

2 tasks

wuranxx mentioned this issue Nov 19, 2024

[Feature] Implement CLUSTER FLUSHSLOT command #1133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Cluster-wide SCAN #33

[NEW] Cluster-wide SCAN #33

madolson commented Mar 26, 2024

NikolaBorisov commented Mar 27, 2024

madolson commented Mar 27, 2024

nihohit commented Mar 27, 2024

madolson commented Mar 27, 2024

nihohit commented Mar 27, 2024 •

edited

Loading

madolson commented Mar 27, 2024

nihohit commented Mar 28, 2024 •

edited

Loading

madolson commented Apr 2, 2024 •

edited

Loading

nihohit commented Apr 3, 2024

avifenesh commented Apr 3, 2024

ranshid commented Apr 3, 2024

CharlesChen888 commented Apr 17, 2024

madolson commented Apr 20, 2024

zuiderkwast commented Oct 2, 2024 •

edited

Loading

madolson commented Oct 4, 2024 •

edited

Loading

zuiderkwast commented Oct 4, 2024

madolson commented Oct 4, 2024

madolson commented Oct 4, 2024

zuiderkwast commented Oct 4, 2024 •

edited

Loading

madolson commented Oct 6, 2024

avifenesh commented Oct 6, 2024 •

edited

Loading

avifenesh commented Nov 5, 2024

madolson commented Nov 5, 2024

[NEW] Cluster-wide SCAN #33

[NEW] Cluster-wide SCAN #33

Comments

madolson commented Mar 26, 2024

NikolaBorisov commented Mar 27, 2024

madolson commented Mar 27, 2024

nihohit commented Mar 27, 2024

madolson commented Mar 27, 2024

nihohit commented Mar 27, 2024 • edited Loading

madolson commented Mar 27, 2024

nihohit commented Mar 28, 2024 • edited Loading

madolson commented Apr 2, 2024 • edited Loading

nihohit commented Apr 3, 2024

avifenesh commented Apr 3, 2024

ranshid commented Apr 3, 2024

CharlesChen888 commented Apr 17, 2024

madolson commented Apr 20, 2024

zuiderkwast commented Oct 2, 2024 • edited Loading

madolson commented Oct 4, 2024 • edited Loading

zuiderkwast commented Oct 4, 2024

madolson commented Oct 4, 2024

madolson commented Oct 4, 2024

zuiderkwast commented Oct 4, 2024 • edited Loading

madolson commented Oct 6, 2024

avifenesh commented Oct 6, 2024 • edited Loading

avifenesh commented Nov 5, 2024

madolson commented Nov 5, 2024

nihohit commented Mar 27, 2024 •

edited

Loading

nihohit commented Mar 28, 2024 •

edited

Loading

madolson commented Apr 2, 2024 •

edited

Loading

zuiderkwast commented Oct 2, 2024 •

edited

Loading

madolson commented Oct 4, 2024 •

edited

Loading

zuiderkwast commented Oct 4, 2024 •

edited

Loading

avifenesh commented Oct 6, 2024 •

edited

Loading