Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added management connections to be used with the periodic checks #56

Closed
wants to merge 5 commits into from

Conversation

barshaul
Copy link

@barshaul barshaul commented Oct 3, 2023

This PR adds another connection (called 'management connection') for each cluster node so they could be used in the periodic checks, if the periodic checks is enabled.

Rebased over #64 , #90

Benchmark results:

<style> </style>
test num_of_tasks data_size Average of tps get non existing average latency get existing average latency set average latency TPS diff
python_baseline 1 100 939 0.001 0.001 0.001  
python_baseline   4000 921 0.001 0.001 0.001  
python_baseline 10 100 6556 0.002 0.002 0.002  
python_baseline   4000 6321 0.002 0.002 0.002  
python_baseline 100 100 26655 0.004 0.004 0.004  
python_baseline   4000 22485 0.004 0.004 0.004  
python_baseline 1000 100 28057 0.036 0.035 0.036  
python_baseline   4000 24573 0.041 0.042 0.041  
python_mange_conn_120_sec 1 100 937 0.001 0.001 0.001 99.83%
python_mange_conn_120_sec   4000 937 0.001 0.001 0.001 101.78%
python_mange_conn_120_sec 10 100 6491 0.002 0.002 0.002 99.01%
python_mange_conn_120_sec   4000 6400 0.002 0.002 0.002 101.26%
python_mange_conn_120_sec 100 100 25537 0.004 0.004 0.004 95.81%
python_mange_conn_120_sec   4000 22479 0.004 0.004 0.004 99.97%
python_mange_conn_120_sec 1000 100 26967 0.037 0.038 0.037 96.12%
python_mange_conn_120_sec   4000 25546 0.038 0.040 0.037 103.96%
python_mang_conn_60_sec 1 100 951 0.001 0.001 0.001 101.30%
python_mang_conn_60_sec   4000 927 0.001 0.001 0.001 100.60%
python_mang_conn_60_sec 10 100 6374 0.002 0.002 0.002 97.22%
python_mang_conn_60_sec   4000 6437 0.002 0.002 0.002 101.84%
python_mang_conn_60_sec 100 100 26341 0.004 0.004 0.004 98.82%
python_mang_conn_60_sec   4000 22788 0.004 0.004 0.004 101.35%
python_mang_conn_60_sec 1000 100 26090 0.038 0.036 0.039 92.99%
python_mang_conn_60_sec   4000 25010 0.040 0.040 0.040 101.78%
Average TPS diff             99.60%

@barshaul barshaul force-pushed the management_conn branch 3 times, most recently from f8c92a1 to 2aa84ce Compare October 4, 2023 13:59
@barshaul barshaul marked this pull request as ready for review October 4, 2023 13:59
@barshaul barshaul requested a review from nihohit October 4, 2023 14:09
@barshaul barshaul changed the title WIP: Added management connections to be used with the periodic checks Added management connections to be used with the periodic checks Oct 4, 2023
Copy link

@nihohit nihohit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial comments.

This PR is too large to review effectively. Please remove unnecessary changes, and if possible split to PRs, if not possible split to commits.

redis/src/cluster_async/connections_container.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/connections_container.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/connections_container.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/connections_container.rs Outdated Show resolved Hide resolved
@@ -11,18 +11,40 @@ type IdentifierType = ArcStr;

#[derive(Clone, Eq, PartialEq, Debug)]
pub(crate) struct ClusterNode<Connection> {
pub connection: Connection,
pub ip: Option<IpAddr>,
user_connection: Connection,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you decide to hide all of the fields?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just so if we'll have a connection pool in the future for user/management connections we'll have less code change, so I kept the same line of get functions for IP too. do you prefer we'll directly access the variables?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this reduces the code change in such a situation.

do you prefer we'll directly access the variables?

Yes, empty getters are usually avoided, unless there's an intention for encapsulation.

Copy link
Author

@barshaul barshaul Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this reduces the code change in such a situation.

now:

fn get_connection {
    self.connection
}

later

fn get_connection {
if pool.empty() {}
else {
pool.get-connection()
}

or something like that.
but I can remove it for now

redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/mod.rs Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
let requested_nodes = read_guard.random_connections(num_of_nodes_to_query);
let requested_nodes = read_guard.random_connections(
num_of_nodes_to_query,
inner.cluster_params.management_connections_enabled(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this means that refresh_slots ALWAYS uses management connections, if they are enabled? even if it wasn't triggered by periodic check?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, do you see any downside with it?
It solves the issue of blocking/pubsub connections

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but since you added no documentation or explanations to the PR, I'm trying to understand all of the changes and the reasons behind them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to think about this, and understand what's the ramifications.

let (mut conn, ip) = C::connect(info, socket_addr)
.timeout(connection_timeout)
.await??;
let connection_timeout: futures_time::time::Duration = params.connection_timeout.into();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the type definition needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without it - "..type inside async fn body must be known in this context"

@nihohit
Copy link

nihohit commented Oct 26, 2023

why is management_connections_enabled configurable? what will be the default that we offer the users, and when would the users prefer to change the configuration?

@nihohit
Copy link

nihohit commented Oct 26, 2023

and re: the benchmarks - the 100 bytes results are pretty bad. Is that the average of multiple runs?
Also, does the benchmark result change if you use this branch, but change management_connections_enabled?

@barshaul
Copy link
Author

@nihohit Updated benchmark results

@nihohit
Copy link

nihohit commented Oct 29, 2023

Those are averages of multiple runs? I'm trying to understand how robust is the 7% difference in one scenario.

It's generally preferable to benchmark with Node, since the higher TPS mean that there's less overhead to mask/compensate for small effects, and the benchmarks conclude faster, which allows first more iterations.

@nihohit
Copy link

nihohit commented Nov 2, 2023

what's the 60_sec / 120_sec difference in the benchmarks?

redis/tests/test_cluster_async.rs Show resolved Hide resolved
redis/tests/test_cluster_async.rs Show resolved Hide resolved
redis/src/cluster_async/mod.rs Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
@barshaul
Copy link
Author

barshaul commented Nov 6, 2023

@nihohit Node benchmark results:

<style> </style>
language num_of_tasks data_size Average of tps Average of get_non_existing_average_latency Average of get_existing_average_latency Average of set_average_latency TPS diff
node_periodic_check_enabled_60_sec_managed_conn 1 100 1009 0.989 0.990 0.996 101.77%
    4000 971 1.005 1.031 1.044 99.79%
  10 100 6168 1.621 1.621 1.625 101.48%
    4000 6083 1.628 1.652 1.639 103.02%
  100 100 55365 1.805 1.806 1.807 98.72%
    4000 48900 2.042 2.057 2.011 98.91%
  1000 100 119312 8.379 8.381 8.382 101.27%
    4000 69673 14.352 14.366 14.330 100.95%
node_periodic_check_enabled_60_secs_non_managed_conn 1 100 1001 0.995 0.997 1.008 100.18%
    4000 975 1.002 1.027 1.041 98.43%
  10 100 5983 1.669 1.670 1.675 102.44%
    4000 6049 1.638 1.660 1.646 99.75%
  100 100 55941 1.787 1.788 1.787 98.17%
    4000 48530 2.057 2.074 2.026 100.64%
  1000 100 118565 8.434 8.434 8.436 99.49%
    4000 69323 14.430 14.444 14.410 101.11%
node_periodic_check_disabled_management_branch 1 100 1003 0.995 0.995 1.006 100.33%
    4000 976 1.001 1.027 1.036 99.28%
  10 100 6034 1.657 1.658 1.660 103.69%
    4000 6123 1.617 1.642 1.625 99.12%
  100 100 55591 1.800 1.802 1.802 99.80%
    4000 49339 2.025 2.039 1.992 99.61%
  1000 100 117357 8.520 8.522 8.522 100.01%
    4000 69690 14.349 14.359 14.323  
node_periodic_check_disabled_main_branch (baseline) 1 100 992 1.005 1.007 1.015  
    4000 973 1.004 1.028 1.043  
  10 100 6078 1.643 1.645 1.649  
    4000 5905 1.676 1.699 1.688  
  100 100 56082 1.782 1.783 1.785  
    4000 49437 2.017 2.035 1.990  
  1000 100 117814 8.487 8.488 8.490  
    4000 69680 14.349 14.362 14.326  

@barshaul
Copy link
Author

barshaul commented Nov 6, 2023

why is management_connections_enabled configurable? what will be the default that we offer the users, and when would the users prefer to change the configuration?

@nihohit
We should separate between the use of managed connections to the periodic checks (i'll refactor it).
For periodic checks, ATM when we have issues with how we compare CLUSTER SLOTS results, we should keep it disabled by default, and enable it once the fix is done.
For management connections, we should use them by default and make it non-configurable.
WDYT?

@barshaul barshaul force-pushed the management_conn branch 6 times, most recently from 951617c to 5a3e7b0 Compare November 12, 2023 14:59
@barshaul
Copy link
Author

@nihohit round

@barshaul barshaul force-pushed the management_conn branch 2 times, most recently from c4d680e to 0fc47c5 Compare December 18, 2023 10:40
@barshaul
Copy link
Author

@shachlanAmazon round

@barshaul barshaul force-pushed the management_conn branch 2 times, most recently from fe1237c to 42dffe6 Compare December 21, 2023 16:57
@barshaul
Copy link
Author

@shachlanAmazon Lets have a code walkthrough on Sunday and i'll go through the latest changes I did. Benchmarks are in progress

Copy link

@nihohit nihohit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue I have is that you've added a lot of logic & branching, and the vast majority of it isn't tested. I'm not sure that it's possible to test this using the existing cluster testing infra structure - maybe the connection management logic should be spun off to a separate file / module, and tested separately?

redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
@barshaul
Copy link
Author

@shachlanAmazon Benchmark results in rust show decrease in up to ~5%. I'm not sure why these changes affect the happy path, I also tested mainline with periodic checks without the addition of management connections and I haven't seen this degradation. I'll add benchmark results of this branch without periodic checks and mainline with periodic checks, and node benchmark results.
Anyway, we decided we won't put too much effort at the moment when the degradation is up to 5%, but if you see something that pops out or if you want to help with quick perf to see if something pops out, let me know.

<style> </style>
language client num_of_tasks data_size Average of tps Average of get_non_existing_average_latency Average of get_existing_average_latency Average of set_average_latency TPS diff
rust mainline_no_periodic 1 100 5038 0.198 0.199 0.200  
rust     4000 4789 0.185 0.214 0.211  
rust   10 100 33667 0.299 0.299 0.300  
rust     4000 30456 0.317 0.331 0.329  
rust   100 100 210654 0.474 0.475 0.475  
rust     4000 104847 0.949 0.957 0.946  
rust   1000 100 277776 3.598 3.599 3.599  
rust     4000 145832 6.854 6.855 6.859  
rust management_conn_60_sec 1 100 5057 0.196 0.199 0.199 100.37%
rust     4000 4758 0.189 0.213 0.218 99.35%
rust   10 100 31663 0.318 0.319 0.319 94.05%
rust     4000 28841 0.335 0.350 0.347 94.70%
rust   100 100 205164 0.486 0.487 0.488 97.39%
rust     4000 107063 0.931 0.937 0.928 102.11%
rust   1000 100 265918 3.759 3.759 3.760 95.73%
rust     4000 141053 7.086 7.087 7.096 96.72%

@barshaul barshaul force-pushed the management_conn branch 5 times, most recently from 7f80d5b to d31c834 Compare December 28, 2023 14:15
@barshaul barshaul force-pushed the management_conn branch 6 times, most recently from 65f67ce to edec2e8 Compare January 2, 2024 10:42
Copy link

@shachlanAmazon shachlanAmazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round.
absolutely great work on the tests. well done!

redis/src/cluster_async/connections_logic.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/connections_logic.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/connections_logic.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/connections_logic.rs Outdated Show resolved Hide resolved
AsyncClusterNode::new(user_conn, management_conn, ip)
}

pub(crate) async fn connect_and_check_all_connections<C>(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_connections -> both_connections
Initially I thought this is about running all connections in a cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to connect_and_check_all_node_connections

redis/tests/test_async_cluster_connections_logic.rs Outdated Show resolved Hide resolved
conn_utils.returned_ip_type = ConnectionIPReturnType::Specified(ip);
drop(write_lock);
runtime.block_on(async {
let node: RedisResult<AsyncClusterNode<MockConnection>> = connect_and_check(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: use type alias instead of repeating this long declaration throughout.

None,
)
.await;
assert!(node.is_ok() && node.unwrap().management_connection.is_none());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throughout the tests - can we use MockConnection::id (or some other tool) to know which connection was chosen?

.get_mut(name)
.unwrap_or_else(|| panic!("MockConnectionUtils for `{name}` were not installed"));
conn_utils.return_connection_err = ShouldReturnConnectionError::Yes;
let conn_type: RefreshConnectionType = RefreshConnectionType::AllConnections;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: why do you need the type declaration?

.await;
assert!(node.is_err());
let err = node.unwrap_err();
assert!(err

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: please also assert on err.kind(), throughout

@shachlanAmazon
Copy link

@barshaul do you want to keep this open?

@barshaul barshaul closed this Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants