-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Availability modes #1095
Availability modes #1095
Conversation
|
||
# If we're using active-passive availability, attempt to start persisted sessions | ||
if self.availability_mode == "active-passive": | ||
self.kernel_session_manager.start_sessions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin-bates : I went over the previous comments on this decision, but I am still do not understand completely on "why we wouldn't want to load all the sessions from persistence at server start" irrespective of the availability_mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've had similar discussions with myself. 😄 (It's good to have someone else to talk with about this!)
I think the primary issue here is affinity and if we load the sessions in active-active then all EG nodes will have a KernelManager
thinking they are managing the kernel. In active-active, a "second" node will only manage a "previously managed" kernel when the "previously-managing node" has gone down, so there's still only one node managing the kernel (because we always require "node affinity".
Perhaps the terms active-active and active-passive to describe these modes are not quite correct. As @dnwe pointed out in the issue, they use active-passive as more of a form of resilency than HA (I suppose more for DR). Perhaps we could spin active-active to HA and active-passive to DR?
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my thoughts on the naming the availability modes:
active-passive -> single_instance
"active-active" -> multi_instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these names as they essentially describe the expected configuration of each and don't attempt to overload or conflate the meanings of the classic HA/DR terms.
I would like to continue using hyphens as the separators in the string values. (I view underscores more for variable names and constants.) So let's go with "single-instance" ("active-passive" is used) and "multi-instance" (where "active-active" is used). Does that sound okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the values to use the instance references. Note that I also added code to auto-enable kernel session persistence if not set when availability mode is set. It felt a little overbearing to require the persistence setting when it's required to use "availability". So, rather than throw an exception, we'll log an informational message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin-bates : I was going over some other service documentation where I found 2 new terms used to describe the similar availability scenarios:
- active-passive ->
standalone
- active-active ->
replication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - I think I like these names over "single-instance" and "multi-instance", especially if there's precedent.
@lresende - you just approved this PR. Are you okay with going with the names "Standalone" and "Replication"?
Known issues include: | ||
1. Culling configurations do not account for different nodes and therefore could result in the premature culling of kernels. | ||
2. Each "node switch" requires a manual reconnect to the kernel. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the above issues only with "active-active" mode and not with "active-passive" mode of EG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reconnecting is necessary for both forms.
Even with "active-active", because we still expect/advise affinity with the managed kernel, you shouldn't run into an issue where the kernel is culled prematurely because it should always stay on the originating node. Only if the affinity is not configured (or not working) could the kernel be culled prematurely from the previous node.
I'll look into some better wording for this, but we should probably better understand where things are with this before merging. Thanks for this comment.
) | ||
|
||
# If we're using single-instance availability, attempt to start persisted sessions | ||
if self.availability_mode == "single-instance": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we define constants / static variables for these availability_mode
so that it can be used across modules / files.
@@ -162,7 +162,9 @@ def check_kernel_id(self, kernel_id): | |||
self.parent.kernel_session_manager.delete_session(kernel_id) | |||
raise web.HTTPError(404, "Kernel does not exist: %s" % kernel_id) | |||
|
|||
def _refresh_kernel(self, kernel_id): | |||
def _refresh_kernel(self, kernel_id) -> bool: | |||
if not self.parent.availability_mode or self.parent.availability_mode == "single-instance": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thought here is, incase of s_i
mode, the kernels are already hydrated when the EG server starts..so there is not need to check the persistence for kernel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The multi-kernel manager should be aware of all active kernels in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Approving it.
I still need to apply the final name changes with "Standalone" and "Replication" so let's not merge yet. |
Need to rework the docs now that #1101 has been merged. |
464f59d
to
44d8b2c
Compare
for more information, see https://pre-commit.ci
44d8b2c
to
cad1d85
Compare
#1086 references an issue whereby the loading of persisted kernel sessions at EG's startup was commented out when the changes for #737 were merged. PR #737 essentially enabled the ability to, so to speak, have multiple instances of EG running simultaneously emulating an active-active availability. The previous code, on the other hand, emulated more of an active-passive behavior where only a single EG instance is running but introducing a higher degree of resiliency, as pointed out in #1086. Some users have found that functionality helpful and we should try to accommodate that use case as well.
This pull request introduces a configurable option named
availability_mode
that can hold one of three values:None
(default),active-active
, andactive-passive
. Both non-none values require that kernel session persistence also be enabled. Since 'active-active' was essentially the default behavior (when kernel session persistence was enabled), we will automatically set theavailability_mode
toactive-active
whenever kernel session persistence is enabled and availability mode is not - thereby providing a form of backward compatibility.Users desiring a single-instanced EG that is capable of restarting following an unexpected failure can now use the availability mode of 'active-passive'.
These modes (including kernel session persistence) can be enabled via a configuration file, command line, or environment variables as noted in the documentation or when running
jupyter enterprisegateway --help-all
.As noted in the companion documentation, this functionality should be considered experimental!
Resolves: #1086