-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slot migration improvement #245
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #245 +/- ##
============================================
+ Coverage 68.43% 68.89% +0.45%
============================================
Files 109 109
Lines 61681 61785 +104
============================================
+ Hits 42214 42566 +352
+ Misses 19467 19219 -248
|
@valkey-io/core-team ready for your review |
@PingXie I updated the PR description. Please check if it's correct and edit again if it's not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. (I've already reviewed it earlier and approved it, but it's long ago.)
I think my previous fix (redis/redis#13055) introduced a race condition. More specifically, I don't think it is a good idea to move the slots on L3015, Instead, I should've always routed the topology update through
Thanks @zuiderkwast! I will write the proper PR description next. |
Two commits need sign off. They're called "Update src/valkey-cli.c". |
This should now be fixed. I don't think we need to backport the change to Valkey 7.2. The original fix should still work. It is just that the new reliability improvements in this PR require all topology updates be done in Next step is to (re)introduce the server-initiate wait and get rid of the |
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Ping Xie <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
…nction Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
`CLUSTER SETSLOT NODE` Signed-off-by: Ping Xie <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nit-picks. Pretty difficult to wrap my head around these complex conditions around slot ownership. Will re-read the code few times :D.
- We should apply the formatting (planned to be added) on the new code changes. I see parantheses opening, spacing after
,
different from expected in few places. - We are using primary/replica in certain place(s) and master/slave terminology in few place(s) in code comments. Maybe just use
primary/replica
.
Thanks for the review, @hpatro. There is definitely inconsistency in this PR as it has gone through too many rounds of refactoring. I will clean up the naming and comments as part of the timeout enhancement but I will leave the coding style and terminology (primary/replica) to separate PRs which I plan to address across the entire codebase. |
Quick poll on the new timeout parameter. We talked about allowing the client to control the wait time (with a default of 5s) on a per command basis for "CLUSTER SETSLOT", which now pre-replicates the command to the replicas first. QThis new parameter can only be added to the end of the command to maintain backward compat. I think a cleaner change would be introducing a "timeout" token such as "timeout 1000" but looks like this is not a pattern used in existing commands like "WAIT" where the timeout parameter is expected to be at a fixed location. Thoughts? |
For newer commands e.g. https://valkey.io/commands/xread/, we went with what you're suggesting above i.e. first the keyword and then followed by the actual value. The parsing logic becomes a slightly more complex but gives more flexibility. I think it's reasonable to proceed with your suggestion. |
Signed-off-by: Ping Xie <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>
Initial PR to add a governance doc outlining permissions for the main Valkey project as well as define responsibilities for sub-projects. --------- Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: zhaozhao.zz <[email protected]> Co-authored-by: hwware <[email protected]> Co-authored-by: binyan <[email protected]
Fix the mem_freed variable to be initialized with init. with this PR prevents the variable from acting unknowingly. Signed-off-by: NAM UK KIM <[email protected]>
Delete unused declaration `void *dictEntryMetadata(dictEntry *de);` in dict.h. --------- Signed-off-by: Lipeng Zhu <[email protected]>
Serverassert is a drop-in replacement of assert. We use it even in code copied from other sources. To make these files usable outside of Valkey, it should be enough to replace the `serverassert.h` include with `<assert.h>`. Therefore, this file shouldn't have any dependencies to the rest of the valkey code. --------- Signed-off-by: Viktor Söderqvist <[email protected]>
Improve the performance of crc64 for large batches by processing large number of bytes in parallel and combining the results. ## Performance * 53-73% faster on Xeon 2670 v0 @ 2.6ghz * 2-2.5x faster on Core i3 8130U @ 2.2 ghz * 1.6-2.46 bytes/cycle on i3 8130U * likely >2x faster than crcspeed on newer CPUs with more resources than a 2012-era Xeon 2670 * crc64 combine function runs in <50 nanoseconds typical with vector + cache optimizations (~8 *microseconds* without vector optimizations, ~80 *microseconds without cache, the combination is extra effective) * still single-threaded * valkey-server test crc64 --help (requires `make distclean && make SERVER_TEST=yes`) --------- Signed-off-by: Josiah Carlson <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
…of-rewrite (valkey-io#393) Renamed redis to valkey/server in aof.c serverlogs. The AOF rewrite child process title is set to "redis-aof-rewrite" if Valkey was started from a redis-server symlink, otherwise to "valkey-aof-rewrite". This is a breaking changes since logs are changed. Part of valkey-io#207. --------- Signed-off-by: Shivshankar-Reddy <[email protected]>
This is a minor change where only naming and links now points properly to valkey. Fixes valkey-io#388 --------- Signed-off-by: Rolandas Šimkus <[email protected]> Signed-off-by: simkusr <[email protected]> Signed-off-by: simkusr <[email protected]> Signed-off-by: Viktor Söderqvist <[email protected]> Co-authored-by: simkusr <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>
These JSON files were originally not intended to be used directly, since they contain internals and some fiels like "acl_categories" that are not the final ACL categories. (Valkey will apply some implicit rules to compute the final ACL categories.) However, people see JSON files and use them directly anyway. So it's better to document them. In a later PR, we can get rid of all implicit ACL categories and instead populate them explicitly in the JSON files. Then, we'll add a validation (e.g. in generate-command-code.py) that the implied categories are set. --------- Signed-off-by: Viktor Söderqvist <[email protected]> Co-authored-by: Binbin <[email protected]>
Signed-off-by: Ikko Eltociear Ashimine <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tests left for review.
src/cluster_legacy.c
Outdated
/* We intentionally avoid updating myself's configEpoch when | ||
* taking ownership of this slot. This approach is effective | ||
* in scenarios where my primary crashed during the slot | ||
* finalization process. I became the new primary without | ||
* inheriting the slot ownership, while the source shard | ||
* continued and relinquished the slot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel very uncomfortable about just breaking core assumptions about the algorithm here (specifically omitting the configEpoch). It really feels like we should be bumping the epoch here, even if it means we steal slots unnecessarily. (This is an edge case after all) We are explicitly prioritizing availability here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is equally incorrect, if not more, to bump the config epoch without going through consensus. The impact on availability is the same, bumping the epoch or not, because this node will assume the ownership of this slot. In this case, the more likely case will be what I explained in the comments that follow (copied below too)
* By not increasing myself's configEpoch, we ensure that
* if the slot is correctly migrated to another primary, I
* will not mistakenly claim ownership. Instead, any ownership
* conflicts will be resolved accurately based on configEpoch
* values. */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is equally incorrect, if not more, to bump the config epoch without going through consensus.
This isn't true, there is nothing wrong with bumping the epoch without consensus. Consensus is the preferred way to do it to avoid unnecessary churn, but it's not a pre-requisite.
The impact on availability is the same, bumping the epoch or not, because this node will assume the ownership of this slot. In this case, the more likely case will be what I explained in the comments that follow (copied below too)
I agree, that is why I'm saying we should bump the epoch. There was a change to the slot ownership, that is supposed to come with an epoch bump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true, there is nothing wrong with bumping the epoch without consensus. Consensus is the preferred way to do it to avoid unnecessary churn, but it's not a pre-requisite.
A primary case that I am trying to handle is where the node in question becomes a primary without knowing that the source shard has just given up its slot ownership to its old primary since the old primary just failed. For this node to become the new primary, it would have to go through an election and get its config epoch bumped already (with a consensus). However, it is also conceivable that the source shard might have given the slot to a different shard and it is in this case where I still think it is safer to not blindly bump the config epoch because it would undo the true indent; while in the primary failure case we don't need to bump the config epoch, because it is a side product of the failover.
Btw, in all existing cases where we bump the config epoch without consensus, we have a clear intent to take over the slot ownership and this is the key difference IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, in all existing cases where we bump the config epoch without consensus, we have a clear intent to take over the slot ownership and this is the key difference IMO.
You're prescribing intent to an algorithm. The epoch get's bumped every time the slot ownership is changed, because we are in a new-epoch with new state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it is safer to not blindly bump the config epoch because it would undo the true indent;
But you're okay with blindly serving data without true intent? I suppose that is my problem with this, either we are committing to the change with the epoch bump or we shouldn't be serving traffic.
} | ||
|
||
# restart a server and wait for it to come back online | ||
proc restart_server_and_wait {server_id} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost universally we de-schedule processes by using the pause helpers to trigger failovers. This does two things, one is that we don't have to restart the process manually later, and second is that it's similar to when a processes disappears for a bit so tests additional modes.
Is there a specific reason we are restarting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost universally we de-schedule processes by using the pause helpers to trigger failovers.
is this pause_process
? it kills the process and a restart is needed too. or this is something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just pauses the process, I was asking if we specifically need to kill it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. my intent is to simulate true failures. If I don't kill the server, I will have to use "graceful failover" to switch roles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the server is paused (SIGPAUSE), it doesn't respond to cluster messages and this leads to an automatic failover, in the same way as if the network were broken. I think simulates a true failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the segfault seems to be causing more valgrind issues as well. I think we should re-evaluate this decision.
assert {$duration > 2000} | ||
|
||
# Setslot should fail with not enough good replicas to write after the timeout | ||
assert_equal {NOREPLICAS Not enough good replicas to write.} $e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's more weird behavior, but for now it seems acceptable.
Signed-off-by: Ping Xie <[email protected]>
I am not sure if I agree with the "more weird behavior" statement. There is actually a coherent story to tell/document here. The |
Signed-off-by: Ping Xie <[email protected]>
Okay now I think this is "weird behavior" on the github side :). It apparently misplaced your comment and made it look like it was about the error message but after I clicked into the code, I saw that you were commenting about the "torn" states. Please ignore my response above. |
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Sorry I can't fix the DCO. Shouldn't have rebased. Closing this PR and moving changes to #445. |
Overview
This PR significantly enhances the reliability and automation of the Valkey cluster re-sharding process, specifically during slot migrations in the face of primary failures. These updates address critical failure issues that previously required extensive manual intervention and could lead to data loss or inconsistent cluster states.
Enhancements
Automatic Failover Support in Empty Shards
The cluster now supports automatic failover in shards that do not own any slots, which is common during scaling operations. This improvement ensures high availability and resilience from the outset of shard expansion.
Replication of Slot Migration States
All
CLUSTER SETSLOT
commands are now initially executed on replica nodes before the primary. This ensures that the slot migration state is consistent within the shard, preventing state loss in the event of primary failure. A new timeout parameter has been introduced, allowing users to specify the duration in milliseconds to wait for replication to complete, with a default set at 2 seconds.Recovery of Logical Migration Links
The update automatically repairs the logical links between source and target nodes during failovers. This ensures that requests are correctly redirected to the new primary in the target shard after a primary failure, maintaining cluster integrity.
Enhanced Support for New Replicas
New replicas added to shards involved in slot migrations will now automatically inherit the slot's migration state as part of their initialization. This ensures that new replicas are immediately consistent with the rest of the shard.
Improved Logging for Slot Migrations
Additional logging has been implemented to provide operators with clearer insights into the slot migration processes and automatic recovery actions, aiding in monitoring and troubleshooting.
Additional Changes
cluster-allow-replica-migration
When
cluster-allow-replica-migration
is disabled, primary nodes that lose their last slot to another shard will no longer automatically become replicas of the receiving shard. Instead, they will remain in their own shards, which will now be empty, having no slots assigned to them.Fix #21