Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement round robin read from replica strategy for async cluster. #24

Merged
merged 3 commits into from
Sep 21, 2023

Conversation

nihohit
Copy link

@nihohit nihohit commented Aug 15, 2023

@nihohit
Copy link
Author

nihohit commented Aug 28, 2023

ping @barshaul

return slot.addrs.primary.as_str();
}
match read_from_replica {
ReadFromReplicaStrategy::AlwaysFromPrimary => slot.addrs.primary.as_str(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can we get to a scenario where slot_addr == Replica, but read_from_replica == AlwaysFromPrimary?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood it right - it can be if the user did a specific command routing to a replica node. I agree that if AlwaysFromPrimary is the readfrom strategy then we can't execute read commands on replicas (as we haven't ran READONLY on the existing connections), but what if the user did some management command?
for example, client.info(route=KeySlot(Replica, 1))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be documented. I agree it's not intuitive.
slot_addr is where the command can go, read_from_replica represents whether the command can be sent to a replica. So, some commands can go to replicas, but only if read_from_replica is true.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't slot_addr be the address that is passed by command routing (as I wrote in the example above)?
We should allow sending non key based commands to replica if the user routes it this way

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you differentiate between a command that was routed by the user to a replica, and a command that was routed by its parsing to a replica?
Maybe we need 3 cases for SlotAddr: Primary, ForceReplica, ReplicaIfReadFromReplicaIsEnabled ?

Copy link

@barshaul barshaul Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Route shouldn't be used for read from replica routing, but only for user routing / pre-fixed routing (e.g. allNodes). So get_address_from_slot would get the routing info and read_from_replica strategy, and if routing isn't none - it will use the routing. Otherwise, it will need to decide based on the read_from_replica strategy if we should route it to a primary or to a replica.
Is it feasible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry. I don't understand your suggesion. Can you write the signature you want get_address_from_slot to have?
and if Route is only for user routing, where/how will we get the slot for keyed commands, and whether they're readonly commands?

match read_from_replica {
ReadFromReplicaStrategy::AlwaysFromPrimary => slot.addrs.primary.as_str(),
ReadFromReplicaStrategy::RoundRobin => {
let index = slot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if all replicas failed, but still in the replicas addresses, do we have a fallback to the primary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

if slot_addr == SlotAddr::Master || slot.addrs.replicas.is_empty() {
        return slot.addrs.primary.as_str();
    }

ensures that if the SlotMap doesn't know about the replica, then the command will go to a primary. But if the ConnectionMap has removed a replica, the slot map isn't updated about this. This will cause the SlotMap to return a connection that the connection map doesn't know about, which will cause the cluster connection to send the request to a random node. problem.

Sounds like we can't keep the separation between connection and slot map.

Copy link

@barshaul barshaul Sep 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another scenario is if the replicas are present in the CLUSTER SLOTS output, but they aren't responsive (e.g. due to load or still wasn't recognized as failed in the cluster bus). Or, if the replicas have been filtered out from the cluster bus, but still in our connection map/slot map since we haven't done slots refreshment yet. So in both options we will face connection errors / timeouts.

So, basically- what i'm asking is how the retries mechanism of a read from replica command is implemented?
How do we handle a timeout/connection error when querying a replica? do we need to wait and retry to connect to this replica, or do we route it to the next replica? When do we decide to send it to the primary - only if no replicas are present?

I think we should try querying the next replica immediately. We should try to reconnect on the background, and/or start a topology refresh after several of connection errors, but as we have more nodes options to send this command to, I don't think we should delay the command execution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that those decisions are about removing a connection from the connections manager. we can do that, but its beyond the scope of this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only about removing the connection from the connection manager, I think we need to define the retry mechanism for read from replicas. If we're basing on the current retry logic, then (correct me if i'm wrong) on connection error redis-rs blocks this command on trying to reconnect to this node and then retrying to send the command to the same node. So, even if we'll remove this replica connection from the connection manager after the connection fails, only the next command will be sent to a different replica. I think that a better solution would be to trigger a reconnect task on the background and immediately retry the command on the next replica / primary. But I agree it can be done in a separate PR,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connection error redis-rs blocks this command on trying to reconnect to this node and then retrying to send the command to the same node

AFAIS on every retry the cluster connection requests a new connection from the container, so if there are multiple replicas, the request will be rotated. The only case where the connection is specified is when using CommandRouting::Connection, which is only created on multi-node requests. When using CommandRouting::Route the route is resolved on each retry.

@nihohit nihohit force-pushed the round-robin branch 3 times, most recently from 7092d2a to b390580 Compare September 21, 2023 10:55
redis/src/cluster_routing.rs Outdated Show resolved Hide resolved
redis/src/cluster_routing.rs Outdated Show resolved Hide resolved
@nihohit nihohit merged commit ca6b9cb into main Sep 21, 2023
10 checks passed
@nihohit nihohit deleted the round-robin branch September 21, 2023 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants