-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Cluster-wide SCAN #33
Comments
I think also adding
is important because it lets you scan the cluster in parallel. CSCAN can not be parallelized. My use case for this was when I have a large cluster and I want to iterate over all the keys and change them somewhat, but I want to do this in parallel. The cluster could be so large that using CSCAN could take super long time. I think the right abstraction would be just to allow user to specify which slot they want to scan. It is very easy to build something that scans the whole cluster reliably if you have that. |
Ok, so maybe something that would be make me "happyish":
Which just scans the given slot if it's provided. We can still have the be marked as a key, so that your client will route it for you. If you're really smart, you could reverse engineer a cursor that hits the right node and we could make some way to define that, but I think it makes sense to just have it be 0 or empty string. There is a problem with my proposal. On the other thread, #4, means that it's not safe to do cross cluster scans without restarting the slot on moved, since the hashes in each databases aren't stable. |
does |
If SLOT is provided, it would only be valid for the SLOT specified. If it's omitted, it would do a scan across all slots in the cluster. |
I wonder whether a cursor might become "accidentally" usable between regular SCAN, CSAN, and CSCAN+slot calls, simply because its computed in the same way for each call.
So, this means that a CSCAN might return a MOVED error if there was slot migration? If so, I think that it solves the issue well, but this requires a lot of heavy lifting from the cursor. For example, assuming CSCAN goes by order of slots, if a node contains slot 1 & 3 but not 2, CSCAN without slots will need to return the keys from slot 1, even if they're below COUNT, and then answer the next command with a MOVED to the node with slot 2, which will in turn respond with the keys in slot 2 and a MOVED back to the first node. IMO this can be combined with a command that quickly (unlike |
That's possible, and we could do that. The only concern is that with SCAN, the client doesn't expect it to need routing. We introduced the concept of "NOT A KEY" in Redis 7, that still requires routing.
Not quite. Let's simplify, there are 2 nodes (A and B) with 3 slots (A and 1 and 3, B has 2). All slots have 100 keys. The command ordering would look like:
At no point are we getting a moved, since we're routing based on the slot information, and the client knows that. You're right that if there are few overall keys, we might not have a very high density. We could optimize that by also including data from the next slot if the node has it though.
The ask has just been to parallelize it, which you could still do. If you have like a million keys, we're over indexing on performance, since it'll finish fast. If you have 10 billion keys (~1 million keys per slot), then parallelization makes sense. |
This scenario works for a cluster that is stable, but what happens during slot migrations, or scale in/out?
what would happen on a Notice that in these examples the calls are without the Let's take a scenario in which the client only calls CSCAN, and doesn't perform any other operations - how would such a client correctly scan through a cluster undegoing changes? |
Then it'll get a MOVED message and try again. This is the normal behavior for misunderstanding the topology.
The current implementation only returns data from one slot at a time, which is our atomic unit, we would need to restart at the next slot. I mentioned an optimization, but I think you would probably want to opt in to it. |
Oh, excellent. I didn't notice that this in the documentation. This solves my issues :) |
Hope Im not repeating something, couldn't find some mention of it. |
I think this is a valid point, which will be SOMEWHAT handled in case we will continue to consume continuance slot ranges (which I support). I would like to ask 2 other questions:
|
Note that [pattern] may imply a specific slot, and this is useful when slot is not provided, or when the provided slot is different from the slot pattern implies. |
Why would CSCAN be non-script? I don't see anything that would strictly break from it.
Sounds like a separate issue? What is the use case to filter by TTL? |
I know @madolson you don't want the end users to be aware of slots, but we can say that the intention is that client libraries should handle the slots and provide Scan functionality to the end user without exposing the slots. Then it should be fine? I believe scanning per slot is the most robust way that also allows nodes to be scanned in parallel. I'm not sure about the C prefix in CSCAN though. I know Redis added a command called SFLUSH to flush all keys in a slot, but the S prefix is normally for sets, like SADD, so that's pretty bad imo. C isn't used as a prefix yet though, but I think |
I'm OK with this yes. The point of the cluster abstraction is that end users just see a flat keyspace. I have a secondary goal of I don't want clients to have to do much work, but given that Glide was willing to fully track the entire slot space (See valkey-io/valkey-glide#1623) to do a full cluster scan I guess that point isn't really that important anymore.
I agree? I'm not sure if this is disagreeing with my point, but I was just positing that we can re-use a lot of existing functionality if we encode the slot into the cursor. I also thought it might be possible to someday in the future incorporate other types of hashing (like consistent hashing).
I have a strong preference not to put something in the |
I didn't mean to say you disagree about this one. 😄 Btw, we already encode the slot in the cursor, though it's undocumented.
How are you thinking? The slot mapping is already a superset of consistent hashing. Or... replace CRC16 with ... SHA256? I think it would be a very big breaking change to cluster clients if we change the slot concept. CRC16 is not fair, but we could already compensate for this. We can create a pre-computed heat map that say how likely a random key is to hash to each slot, that we can use when scaling to compensate for the unfairness of CRC16.
Fair point about not mixing with admin commands. (CLUSTER SLOTS is not an admin command though.) How about the already suggested names SLOTSCAN, or SCAN with a SLOT argument? They are clear and intuitive to me. |
Sure, but it doesn't work with any existing cluster routing. It's also undocumented because it's fragile and might change.
Let's say we move away from 2^14 slots. Are we okay with adding a net new command, or a new argument, to add support for that new backend? The idea that I proposed,
Slight preference for this, since I would like it to be it's own command with it's own documentation page. We could also just do |
In addition, I'm also pretty chill with this:
The default scans all slots. If you specify a specific slot, it scans just that slot. |
It's such a breaking change that we'll probably never do it, and if we do, we could still use the slot concept, though we may want to accept slot ranges everywhere rather than single slots, if the total number of slots is very large.
Yes! A accept this one. This works for the two use cases:
The version without a slot is a bit risky if a slot migration happens during the scan, well, at least if you try to scan multiple shards in parallel. I don't want the client to feel the need do a 16K calls to CLUSTER SLOTS to detect that (is it what GLIDE does?). With a specified slot, you would get a redirect if the slot has been migrated. (We could even consider a slot range and return some error if the node doesn't own all slots in the range. This can be a future parameter.) |
GLIDE does a normal
True. |
Glide call cluster slots to a small number of nodes after each node completion. We return to the user RC pointing to the object with the Scan progress (slot scanned, node in scan, epoch at the beginning of the node Scan). |
I'm afraid about the cost of multiple requests for one iteration, if the slots are not in the same shard, the call would need to be directed to many, instead of one by one. In the worst case, it can be pretty expensive. Maybe also in the common case. |
This doesn't seem like that material of an issue to me. The point is to make the scan more scalable, so we are more worried about the large cluster case. I didn't really follow the last sentence, what does "Maybe also in the common case" mean?
The goal is to balance the complexity on the server and client. The current implementation in glide, imo, is overly complicated to try to replicate the guarantees of the server. It seems like with a rather small change to the server-side cursor the client work becomes much simpler. |
The problem/use-case that the feature addresses
Implement a scan cursor that can consistently scan the entire cluster, instead of today which requires individually targeting each node and sending it an individual SCAN command. This can also break, as slots can be migrated off the node and failovers can happen.
Description of the feature
Implement a consistent cluster scan with semantics like:
The cluster cursor would be marked as NOT_KEY, but would be hashed like all other keys by the clients so that they would be routed to the next node. The cursor would contain a component that includes a hashtag, to represent the slot it's currently scanning.
the format of the cursor would be:
Alternatives you've considered
Extending the existing SCAN cursor to support scanning a specific slot, something like:
This would require the end user to specific a specific slot, and the scan command would parse just that specific slot.
Additional information
See redis/redis#2702.
The text was updated successfully, but these errors were encountered: