Introduce a fast reconnect process for async cluster connections. #184

ikolomi · 2024-08-18T12:29:44Z

Introduce a fast reconnect process for async cluster connections.
The process is periodic and can be configured via ClusterParams.
This process ensures that all expected user connections exist and have not been passively closed.
The expected connections are calculated from the current slot map.
Additionally, for the Tokio runtime, an instant disconnect notification is available, allowing the reconnect process to be triggered instantly without waiting for the periodic check.
This process is especially important for pub/sub support, as passive disconnects can render a pub/sub subscriber inoperative. Three integration tests are introduced with this feature: a generic fast reconnect test, pub/sub resilience to passive disconnects, and pub/sub resilience to scale-out.

Note! This PR must be followed by a PR to glide-core. implementing the similar functionality for CMD

Issue #, if available:
valkey-io/valkey-glide#2042

redis/src/aio/connection_manager.rs

redis/src/aio/mod.rs

redis/src/aio/multiplexed_connection.rs

redis/src/cluster_async/mod.rs

redis/src/cluster_client.rs

redis/src/sentinel.rs

redis/tests/support/mock_cluster.rs

The process is periodic and can be configured via ClusterParams. This process ensures that all expected user connections exist and have not been passively closed. The expected connections are calculated from the current slot map. Additionally, for the Tokio runtime, an instant disconnect notification is available, allowing the reconnect process to be triggered instantly without waiting for the periodic check. This process is especially important for pub/sub support, as passive disconnects can render a pub/sub subscriber inoperative. Three integration tests are introduced with this feature: a generic fast reconnect test, pub/sub resilience to passive disconnects, and pub/sub resilience to scale-out.

barshaul · 2024-08-20T10:07:44Z

redis/src/client.rs

        push_sender: Option<mpsc::UnboundedSender<PushInfo>>,
+        disconnect_notifier: Option<Box<dyn DisconnectNotifier>>,


In the context of redis-rs, this would be a breaking change since it's an exposed user API. However, since we're only using it internally within Glide, and if we're ok with breaking these APIs, I think it's the right time to modify this function to accept a ConnectionOptions struct (or another appropriate name) that internally holds all connection handlers/options. This change would reduce the need to modify the entire chain of internal function calls (like get_multiplexed_async_connection_with_timeouts, etc.) and fixing all tests that use these APIs each time we add a new option.

What do you think about changing it to:

pub struct ConnectionOptions { push_sender: Option<mpsc::UnboundedSender<PushInfo>>, disconnect_notifier: Option<Box<dyn DisconnectNotifier>>, } pub async fn get_multiplexed_async_connection( &self, connection_options: ConnectionOptions )

barshaul · 2024-08-20T10:35:38Z

redis/src/cluster_async/mod.rs

+    disconnect_notifier: Option<Box<dyn DisconnectNotifier>>,
+    #[cfg(feature = "tokio-comp")]
+    tokio_notify: Arc<Notify>,


Instead of passing tokio notify in another parameter, why not to expand the DisconnectNotifier trait to have a notified API?

Yep, that was my original try, but got :

102 | async fn notified(&mut self); | -----^^^^^^^^^^^^^^^^^^^^^^^^ | | | `async` because of this | = note: `async` trait functions are not currently supported

There might be some crates that do it, but i dont want to go down this rabbit hole.
Do you know how to do it?

#[async_trait::async_trait]

barshaul · 2024-08-20T11:09:59Z

redis/src/cluster_async/mod.rs

@@ -1145,22 +1217,93 @@ where
        }
    }

+    // Validate all existing user connections and try to reconnect if nessesary.
+    // In addition, as a safety measure, drop nodes that do not have any assigned slots.


The problem with removing connections that aren’t found in the slot map is that we might inadvertently remove newly added nodes received through a MOVED error before they are added to the slot map. This issue will persist even after fixing MOVED errors to update the slot map for specific slots, because updating the slot map based on a MOVED error is handled inside the refresh_slots task, which is spawned separately. Meanwhile, new connections can be established in the get_connection method after a MOVED error and might execute before the refresh_slots task runs.

For example, consider the following scenario:

A MOVED error is received with a new node address X.

The refresh_slots task is spawned to update the specific slot or perform a full slot refresh.

The request that received the MOVED error calls get_connection and creates a new connection for the moved node X.

validate_all_user_connections is called, finds X in the connections map but not in the slots map, and removes it from the connection map.

Another request encountering the same MOVED error doesn’t find the connection and creates a new one.

This cycle continues until the node X is eventually added to the slots map, which might happen quickly or could take longer if a full slots refresh is required and multiple iterations are needed for it to complete.

Given that during a full refresh_slots operation the connection map is completely replaced with a new one that contains only the nodes from the newly discovered map, the risk of connection leaks accumulating over time is minimal. Therefore, in weighing the tradeoff between prematurely removing connections (which could lead to closing new connections repeatedly causing higher latency and risking connection storms) versus potentially storing non-relevant connections temporarily, it might be safer to leave the cleanup to be handled solely by the refresh_slots process.

However, if we skip the cleanup, we need to ensure that nodes present in the connection map but not in the slot map aren't added to addrs_to_refresh. Otherwise, we risk repeatedly trying to refresh the connection of a stale node.

Yes, I am aware of that behavior.

First of all, periodic syncing must include both adding and removing the connections as required by the source of truth (the slots map), there is no way around this symmetry requirement.

Secondly this behavior is due to insufficient step (3) - if we create a connection, than it means we believe it is valid, and so, should update the slot map. Only creating the connection is not sufficient.

I thought to complement the step (3), but we discussed it and agreed that you`ll complement it in a more specific work. Do you want me to do it in this PR?

Discussed, will be addressed by a specific work

barshaul · 2024-08-20T11:14:48Z

redis/src/cluster_async/mod.rs

+            connections_container
+                .slot_map
+                .addresses_for_all_nodes()
+                .iter()
+                .for_each(|addr| {
+                    all_nodes_with_slots.insert(String::from(*addr));
+                });


Suggested change

connections_container

.slot_map

.addresses_for_all_nodes()

.iter()

.for_each(|addr| {

all_nodes_with_slots.insert(String::from(*addr));

});

all_nodes_with_slots = connections_container

.slot_map

.addresses_for_all_nodes()

.iter()

.map(|addr| String::from(*addr))

.collect();

barshaul · 2024-08-20T12:04:29Z

redis/src/cluster_async/mod.rs

+        let mut addrs_to_refresh = Vec::new();
+        for (addr, con_fut) in &all_valid_conns {
+            let con = con_fut.clone().await;
+            if con.is_closed() {


I think the distinction between connections that are still present in the connection map with is_closed set to true and those that have been removed because the client failed to reestablish their connection isn’t clear. Could you document this difference, when are we expecting to see each?

yes, will do

barshaul · 2024-08-20T12:08:12Z

redis/src/cluster_async/mod.rs

+            // dont try existing nodes since we know a. it does not exist. b. exist but its connection is closed
+            Self::refresh_connections(
+                inner.clone(),
+                addrs_to_refresh,
+                RefreshConnectionType::AllConnections,
+                false,
+            )


Suggested change

// dont try existing nodes since we know a. it does not exist. b. exist but its connection is closed

Self::refresh_connections(

inner.clone(),

addrs_to_refresh,

RefreshConnectionType::AllConnections,

false,

)

Self::refresh_connections(

inner.clone(),

addrs_to_refresh,

RefreshConnectionType::AllConnections,

// dont check the existing connections since we know a. it does not exist, or b. exist but its connection is closed

false,

)

or

Suggested change

// dont try existing nodes since we know a. it does not exist. b. exist but its connection is closed

Self::refresh_connections(

inner.clone(),

addrs_to_refresh,

RefreshConnectionType::AllConnections,

false,

)

// dont check the existing connections since we know a. it does not exist, or b. exist but its connection is closed

let check_existing_conns = false;

Self::refresh_connections(

inner.clone(),

addrs_to_refresh,

RefreshConnectionType::AllConnections,

check_existing_conns,

)

the first option is better

barshaul · 2024-08-20T12:10:28Z

redis/src/cluster_async/mod.rs

    async fn refresh_connections(
        inner: Arc<InnerCore<C>>,
        addresses: Vec<String>,
        conn_type: RefreshConnectionType,
+        try_existing_node: bool,


this option isn't clear.
maybe rename to check_existing_conn and document what it does

barshaul · 2024-08-20T12:51:40Z

redis/src/cluster_async/mod.rs

+                    let node_option = if try_existing_node {
+                        connections_container.remove_node(&address)
+                    } else {
+                        Option::None


Suggested change

Option::None

None

barshaul · 2024-08-20T13:05:18Z

redis/src/cluster_async/mod.rs

+            #[cfg(feature = "tokio-comp")]
+            let _ = timeout(interval_duration, async {
+                inner.tokio_notify.notified().await;
+            })
+            .await;
+            #[cfg(not(feature = "tokio-comp"))]
+            let _ = boxed_sleep(interval_duration).await;


Suggested change

#[cfg(feature = "tokio-comp")]

let _ = timeout(interval_duration, async {

inner.tokio_notify.notified().await;

})

.await;

#[cfg(not(feature = "tokio-comp"))]

let _ = boxed_sleep(interval_duration).await;

#[cfg(all(not(feature = "tokio-comp"), feature = "async-std-comp"))]

use async_std::future::timeout;

#[cfg(feature = "tokio-comp")]

use tokio::time::timeout;

if let Some(notifier) = inner.disconnect_notifier {

timeout(interval_duration, inner.disconnect_notifier.notified()).await;

} else {

let _ = boxed_sleep(interval_duration).await;

}

...

}

nice, will try

… other cleanups

ikolomi requested review from barshaul and eifrah-aws August 18, 2024 12:29

ikolomi force-pushed the fast_reconnect_cand branch from 241039a to 1ae1d15 Compare August 18, 2024 12:36

ikolomi requested a review from asafpamzn August 18, 2024 12:37

ikolomi force-pushed the fast_reconnect_cand branch 3 times, most recently from 4124c7b to 1592f57 Compare August 18, 2024 15:53

eifrah-aws reviewed Aug 18, 2024

View reviewed changes

ikolomi force-pushed the fast_reconnect_cand branch from 1592f57 to 56beb8f Compare August 18, 2024 16:51

ikolomi force-pushed the fast_reconnect_cand branch from 56beb8f to 2c7faec Compare August 19, 2024 05:27

eifrah-aws self-requested a review August 19, 2024 07:01

barshaul requested changes Aug 20, 2024

View reviewed changes

Agregate Glide-specific connection params into a struct

73ff308

ikolomi force-pushed the fast_reconnect_cand branch from 18b4925 to 9040c07 Compare September 3, 2024 14:12

CR changes: Add async method to DisconnectNotifier trait, styling and…

24c19dd

… other cleanups

ikolomi force-pushed the fast_reconnect_cand branch from 9040c07 to 24c19dd Compare September 3, 2024 14:16

ikolomi merged commit 426bb99 into main Sep 4, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a fast reconnect process for async cluster connections. #184

Introduce a fast reconnect process for async cluster connections. #184

ikolomi commented Aug 18, 2024

barshaul Aug 20, 2024

ikolomi Aug 20, 2024

ikolomi Sep 3, 2024

barshaul Aug 20, 2024

ikolomi Aug 20, 2024

ikolomi Sep 3, 2024

eifrah-aws Sep 12, 2024

barshaul Aug 20, 2024

ikolomi Aug 20, 2024

ikolomi Sep 3, 2024

barshaul Aug 20, 2024

ikolomi Aug 20, 2024

ikolomi Sep 3, 2024

barshaul Aug 20, 2024

ikolomi Aug 20, 2024

ikolomi Sep 3, 2024

barshaul Aug 20, 2024

ikolomi Aug 20, 2024

ikolomi Sep 3, 2024

barshaul Aug 20, 2024

ikolomi Aug 20, 2024

ikolomi Sep 3, 2024

barshaul Aug 20, 2024

ikolomi Aug 20, 2024

ikolomi Sep 3, 2024

barshaul Aug 20, 2024

ikolomi Aug 20, 2024

ikolomi Sep 3, 2024

		push_sender: Option<mpsc::UnboundedSender<PushInfo>>,
		disconnect_notifier: Option<Box<dyn DisconnectNotifier>>,

Introduce a fast reconnect process for async cluster connections. #184

Introduce a fast reconnect process for async cluster connections. #184

Conversation

ikolomi commented Aug 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment