-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement round robin read from replica strategy for async cluster. #24
Conversation
8c9bde8
to
4307936
Compare
ping @barshaul |
return slot.addrs.primary.as_str(); | ||
} | ||
match read_from_replica { | ||
ReadFromReplicaStrategy::AlwaysFromPrimary => slot.addrs.primary.as_str(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can we get to a scenario where slot_addr == Replica, but read_from_replica == AlwaysFromPrimary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood it right - it can be if the user did a specific command routing to a replica node. I agree that if AlwaysFromPrimary is the readfrom strategy then we can't execute read commands on replicas (as we haven't ran READONLY on the existing connections), but what if the user did some management command?
for example, client.info(route=KeySlot(Replica, 1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should be documented. I agree it's not intuitive.
slot_addr
is where the command can go, read_from_replica
represents whether the command can be sent to a replica. So, some commands can go to replicas, but only if read_from_replica
is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't slot_addr be the address that is passed by command routing (as I wrote in the example above)?
We should allow sending non key based commands to replica if the user routes it this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would you differentiate between a command that was routed by the user to a replica, and a command that was routed by its parsing to a replica?
Maybe we need 3 cases for SlotAddr: Primary, ForceReplica, ReplicaIfReadFromReplicaIsEnabled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that Route shouldn't be used for read from replica routing, but only for user routing / pre-fixed routing (e.g. allNodes). So get_address_from_slot would get the routing info and read_from_replica strategy, and if routing isn't none - it will use the routing. Otherwise, it will need to decide based on the read_from_replica strategy if we should route it to a primary or to a replica.
Is it feasible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry. I don't understand your suggesion. Can you write the signature you want get_address_from_slot
to have?
and if Route
is only for user routing, where/how will we get the slot for keyed commands, and whether they're readonly commands?
match read_from_replica { | ||
ReadFromReplicaStrategy::AlwaysFromPrimary => slot.addrs.primary.as_str(), | ||
ReadFromReplicaStrategy::RoundRobin => { | ||
let index = slot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if all replicas failed, but still in the replicas addresses, do we have a fallback to the primary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
if slot_addr == SlotAddr::Master || slot.addrs.replicas.is_empty() {
return slot.addrs.primary.as_str();
}
ensures that if the SlotMap
doesn't know about the replica, then the command will go to a primary. But if the ConnectionMap
has removed a replica, the slot map isn't updated about this. This will cause the SlotMap to return a connection that the connection map doesn't know about, which will cause the cluster connection to send the request to a random node. problem.
Sounds like we can't keep the separation between connection and slot map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another scenario is if the replicas are present in the CLUSTER SLOTS output, but they aren't responsive (e.g. due to load or still wasn't recognized as failed in the cluster bus). Or, if the replicas have been filtered out from the cluster bus, but still in our connection map/slot map since we haven't done slots refreshment yet. So in both options we will face connection errors / timeouts.
So, basically- what i'm asking is how the retries mechanism of a read from replica command is implemented?
How do we handle a timeout/connection error when querying a replica? do we need to wait and retry to connect to this replica, or do we route it to the next replica? When do we decide to send it to the primary - only if no replicas are present?
I think we should try querying the next replica immediately. We should try to reconnect on the background, and/or start a topology refresh after several of connection errors, but as we have more nodes options to send this command to, I don't think we should delay the command execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that those decisions are about removing a connection from the connections manager. we can do that, but its beyond the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not only about removing the connection from the connection manager, I think we need to define the retry mechanism for read from replicas. If we're basing on the current retry logic, then (correct me if i'm wrong) on connection error redis-rs blocks this command on trying to reconnect to this node and then retrying to send the command to the same node. So, even if we'll remove this replica connection from the connection manager after the connection fails, only the next command will be sent to a different replica. I think that a better solution would be to trigger a reconnect task on the background and immediately retry the command on the next replica / primary. But I agree it can be done in a separate PR,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connection error redis-rs blocks this command on trying to reconnect to this node and then retrying to send the command to the same node
AFAIS on every retry the cluster connection requests a new connection from the container, so if there are multiple replicas, the request will be rotated. The only case where the connection is specified is when using CommandRouting::Connection
, which is only created on multi-node requests. When using CommandRouting::Route
the route is resolved on each retry.
7092d2a
to
b390580
Compare
b390580
to
350dfc5
Compare
based over https://github.com/nihohit/redis-rs/pull/22 and https://github.com/nihohit/redis-rs/pull/23 .