Skip to content

Commit

Permalink
Rename availability modes per review
Browse files Browse the repository at this point in the history
  • Loading branch information
kevin-bates committed Jun 13, 2022
1 parent 496b359 commit 44d8b2c
Show file tree
Hide file tree
Showing 5 changed files with 39 additions and 34 deletions.
42 changes: 21 additions & 21 deletions docs/source/operators/config-availability.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Availability modes

Enterprise Gateway can be optionally configured in one of two "availability modes": _single-instance_ or _multi-instance_. When configured, Enterprise Gateway can recover from failures and reconnect to any active remote kernels that were previously managed by the terminated EG instance. As such, both modes require that kernel session persistence also be enabled via `KernelSessionManager.enable_persistence=True`.
Enterprise Gateway can be optionally configured in one of two "availability modes": _standalone_ or _replication_. When configured, Enterprise Gateway can recover from failures and reconnect to any active remote kernels that were previously managed by the terminated EG instance. As such, both modes require that kernel session persistence also be enabled via `KernelSessionManager.enable_persistence=True`.

```{note}
Kernel session persistence will be automtically enabled whenever availability mode is configured.
Expand All @@ -16,13 +16,13 @@ Known issues include:
We hope to address these in future releaases (depending on demand).
```

## Single-instance availability
## Standalone availability

_Single-instance availability_ assumes that, upon failure of the original EG instance, another EG instance will be started. Upon startup of the second instance (following the termination of the first), EG will attempt to load and reconnect to all kernels that were deemed active when the previous instance terminated. This mode is somewhat analogous to the classic HA/DR mode of _active-passive_ and is typically used when node resources are at a premium or the number of replicas (in the Kubernetes sense) must remain at 1.
_Standalone availability_ assumes that, upon failure of the original EG instance, another EG instance will be started. Upon startup of the second instance (following the termination of the first), EG will attempt to load and reconnect to all kernels that were deemed active when the previous instance terminated. This mode is somewhat analogous to the classic HA/DR mode of _active-passive_ and is typically used when node resources are at a premium or the number of replicas (in the Kubernetes sense) must remain at 1.

To enable Enterprise Gateway for 'single-instance' availability, configure `EnterpiseGatewayApp.availability_mode=single-instance` or set env `EG_AVAILABILITY_MODE=single-instance`.
To enable Enterprise Gateway for 'standalone' availability, configure `EnterpiseGatewayApp.availability_mode=standalone` or set env `EG_AVAILABILITY_MODE=standalone`.

Here's an example for starting Enterprise Gateway with single-instance availability:
Here's an example for starting Enterprise Gateway with standalone availability:

```bash
#!/bin/bash
Expand All @@ -31,7 +31,7 @@ LOG=/var/log/enterprise_gateway.log
PIDFILE=/var/run/enterprise_gateway.pid

jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG \
--EnterpriseGatewayApp.availability_mode=single-instance > $LOG 2>&1 &
--EnterpriseGatewayApp.availability_mode=standalone > $LOG 2>&1 &

if [ "$?" -eq 0 ]; then
echo $! > $PIDFILE
Expand All @@ -40,23 +40,23 @@ else
fi
```

## Multi-instance availability
## Replication availability

With _multi-instance availability_, multiple EG instances are operating at the same time, and fronted with some kind of reverse proxy or load balancer. Because state still resides within each `KernelManager` instance executing within a given EG instance, we strongly suggest configuring some form of _client affinity_ (a.k.a, "sticky session") to avoid node switches wherever possible since each node switch requires manual reconnection of the front-end (today).
With _replication availability_, multiple EG instances (or replicas) are operating at the same time, and fronted with some kind of reverse proxy or load balancer. Because state still resides within each `KernelManager` instance executing within a given EG instance, we strongly suggest configuring some form of _client affinity_ (a.k.a, "sticky session") to avoid node switches wherever possible since each node switch requires manual reconnection of the front-end (today).

```{tip}
Configuring client affinity is **strongly recommended**, otherwise functionality that relies on state within the servicing node (e.g., culling) can be affected upon node switches, resulting in incorrect behavior.
```

In this mode, when one node goes down, the subsequent request will be routed to a different node that doesn't know about the kernel. Prior to returning a `404` (not found) status code, EG will check its persisted store to determine if the kernel was managed and, if so, attempt to "hydrate" a `KernelManager` instance associated with the remote kernel. (Of course, if the kernel was running local to the downed server, chances are it cannot be _revived_.) Upon successful "hydration" the request continues as if on the originating node. Because _client affinity_ is in place, subsequent requests should continue to be routed to the "servicing node".

To enable Enterprise Gateway for 'multi-instance' availability, configure `EnterpiseGatewayApp.availability_mode=multi-instance` or set env `EG_AVAILABILITY_MODE=multi-instance`.
To enable Enterprise Gateway for 'replication' availability, configure `EnterpiseGatewayApp.availability_mode=replication` or set env `EG_AVAILABILITY_MODE=replication`.

```{attention}
To preserve backwards compatibility, if only kernel session persistence is enabled via `KernelSessionManager.enable_persistence=True`, the availability mode will be automatically configured to 'multi-instance' if `EnterpiseGatewayApp.availability_mode` is not configured.
To preserve backwards compatibility, if only kernel session persistence is enabled via `KernelSessionManager.enable_persistence=True`, the availability mode will be automatically configured to 'replication' if `EnterpiseGatewayApp.availability_mode` is not configured.
```

Here's an example for starting Enterprise Gateway with multi-instance availability:
Here's an example for starting Enterprise Gateway with replication availability:

```bash
#!/bin/bash
Expand All @@ -65,7 +65,7 @@ LOG=/var/log/enterprise_gateway.log
PIDFILE=/var/run/enterprise_gateway.pid

jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG \
--EnterpriseGatewayApp.availability_mode=multi-instance > $LOG 2>&1 &
--EnterpriseGatewayApp.availability_mode=replication > $LOG 2>&1 &

if [ "$?" -eq 0 ]; then
echo $! > $PIDFILE
Expand All @@ -74,25 +74,25 @@ else
fi
```

## Kernel Session Persistence
# Kernel Session Persistence

Enabling kernel session persistence allows Jupyter Notebooks to reconnect to kernels when Enterprise Gateway is restarted and forms the basis for the _availability modes_ described above. Enterprise Gateway provides two ways of persisting kernel sessions: _File Kernel Session Persistence_ and _Webhook Kernel Session Persistence_, although others can be provided by subclassing `KernelSessionManager` (see below).

```{attention}
Due to its experimental nature, kernel session persistence is disabled by default. To enable this functionality, you must configure `KernelSessionManger.enable_persistence=True` or configure `EnterpriseGatewayApp.availability_mode` to either `single-instance` or `multi-instance`.
Due to its experimental nature, kernel session persistence is disabled by default. To enable this functionality, you must configure `KernelSessionManger.enable_persistence=True` or configure `EnterpriseGatewayApp.availability_mode` to either `standalone` or `replication`.
```

As noted above, the availability modes rely on the persisted information relative to the kernel. This information consists of the arguments and options used to launch the kernel, along with its connection information. In essence, it consists of any information necessary to re-establish communication with the kernel.

### File Kernel Session Persistence
## File Kernel Session Persistence

File Kernel Session Persistence stores kernel sessions as files in a specified directory. To enable this form of persistence, set the environment variable `EG_KERNEL_SESSION_PERSISTENCE=True` or configure `FileKernelSessionManager.enable_persistence=True`. To change the directory in which the kernel session file is being saved, either set the environment variable `EG_PERSISTENCE_ROOT` or configure `FileKernelSessionManager.persistence_root` to the directory. By default, the directory used to store a given kernel's session information is the `JUPYTER_DATA_DIR`.

```{note}
Because `FileKernelSessionManager` is the default class for kernel session persistence, configuring `EnterpriseGatewayApp.kernel_session_manager_class` to `enterprise_gateway.services.sessions.kernelsessionmanager.FileKernelSessionManager` is not necessary.
```

### Webhook Kernel Session Persistence
## Webhook Kernel Session Persistence

Webhook Kernel Session Persistence stores all kernel sessions to any database. In order for this to work, an API must be created. The API must include four endpoints:

Expand All @@ -112,15 +112,15 @@ To enable the webhook kernel session persistence, set the environment variable `

Because `WebhookKernelSessionManager` is not the default kernel session persistence class, an additional configuration step must be taken to instruct EG to use this class: `EnterpriseGatewayApp.kernel_session_manager_class = enterprise_gateway.services.sessions.kernelsessionmanager.WebhookKernelSessionManager`.

#### Enabling Authentication
### Enabling Authentication

Enabling authentication is an option if the API requires it for requests. Set the environment variable `EG_AUTH_TYPE` or configure `WebhookKernelSessionManager.auth_type` to be either `Basic` or `Digest`. If it is set to an empty string authentication won't be enabled.

Then set the environment variables `EG_WEBHOOK_USERNAME` and `EG_WEBHOOK_PASSWORD` or configure `WebhookKernelSessionManager.webhook_username` and `WebhookKernelSessionManager.webhook_password` to provide the username and password for authentication.

### Bring Your Own Kernel Session Persistence
## Bring Your Own Kernel Session Persistence

To introduce a different implementation, you must configure the kernel session manager class. Here's an example for starting Enterprise Gateway using a custom `KernelSessionManager` and 'single-instance' availability. Note that setting `--MyCustomKernelSessionManager.enable_persistence=True` is not necessary because an availability mode is specified, but displayed here for completeness:
To introduce a different implementation, you must configure the kernel session manager class. Here's an example for starting Enterprise Gateway using a custom `KernelSessionManager` and 'standalone' availability. Note that setting `--MyCustomKernelSessionManager.enable_persistence=True` is not necessary because an availability mode is specified, but displayed here for completeness:

```bash
#!/bin/bash
Expand All @@ -131,7 +131,7 @@ PIDFILE=/var/run/enterprise_gateway.pid
jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG \
--EnterpriseGatewayApp.kernel_session_manager_class=custom.package.MyCustomKernelSessionManager \
--MyCustomKernelSessionManager.enable_persistence=True \
--EnterpriseGatewayApp.availability_mode=single-instance > $LOG 2>&1 &
--EnterpriseGatewayApp.availability_mode=standalone > $LOG 2>&1 &

if [ "$?" -eq 0 ]; then
echo $! > $PIDFILE
Expand All @@ -142,7 +142,7 @@ fi

Alternative persistence implementations using SQL and NoSQL databases would be ideal and, as always, contributions are welcome!

### Testing Kernel Session Persistence
## Testing Kernel Session Persistence

Once kernel session persistence has been enabled and configured, create a kernel by opening up a Jupyter Notebook. Save some variable in that notebook and shutdown Enterprise Gateway using `kill -9 PID`, where `PID` is the PID of gateway. Restart Enterprise Gateway and refresh you notebook tab. If all worked correctly, the variable should be loaded without the need to rerun the cell.

Expand Down
8 changes: 4 additions & 4 deletions enterprise_gateway/enterprisegatewayapp.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ def init_configurables(self):
# mode is not enabled, go ahead and default availability mode to 'multi-instance'.
if self.kernel_session_manager.enable_persistence:
if self.availability_mode is None:
self.availability_mode = "multi-instance"
self.availability_mode = EnterpriseGatewayConfigMixin.AVAILABILITY_REPLICATION
self.log.info(
f"Kernel session persistence is enabled but availability mode is not. "
f"Setting EnterpriseGatewayApp.availability_mode to '{self.availability_mode}'."
Expand All @@ -161,7 +161,7 @@ def init_configurables(self):
)

# If we're using single-instance availability, attempt to start persisted sessions
if self.availability_mode == "single-instance":
if self.availability_mode == EnterpriseGatewayConfigMixin.AVAILABILITY_STANDALONE:
self.kernel_session_manager.start_sessions()

self.contents_manager = None # Gateways don't use contents manager
Expand Down Expand Up @@ -272,11 +272,11 @@ def _build_ssl_options(self) -> Optional[ssl.SSLContext]:
return ssl_context

def init_http_server(self):
"""Initializes a HTTP server for the Tornado web application on the
"""Initializes an HTTP server for the Tornado web application on the
configured interface and port.
Tries to find an open port if the one configured is not available using
the same logic as the Jupyer Notebook server.
the same logic as the Jupyter Notebook server.
"""
ssl_options = self._build_ssl_options()
self.http_server = httpserver.HTTPServer(
Expand Down
7 changes: 4 additions & 3 deletions enterprise_gateway/mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -682,14 +682,15 @@ def dynamic_config_interval_changed(self, event):
dynamic_config_poller = None

# Availability Mode
AVAILABILITY_STANDALONE = "standalone"
AVAILABILITY_REPLICATION = "replication"
availability_mode_env = "EG_AVAILABILITY_MODE"
availability_mode_default_value = None
availability_mode = CaselessStrEnum(
allow_none=True,
values=["multi-instance", "single-instance"],
values=[AVAILABILITY_REPLICATION, AVAILABILITY_STANDALONE],
config=True,
help="""Specifies the type of availability. Values must be one of "single-instance" or "multi-instance".
Configuration of this this option requires that KernelSessionManager.enable_persistence is True.
help="""Specifies the type of availability. Values must be one of "standalone" or "replication".
(EG_AVAILABILITY_MODE env var)""",
)

Expand Down
9 changes: 5 additions & 4 deletions enterprise_gateway/services/kernels/remotemanager.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,10 +163,11 @@ def check_kernel_id(self, kernel_id):
raise web.HTTPError(404, "Kernel does not exist: %s" % kernel_id)

def _refresh_kernel(self, kernel_id) -> bool:
if not self.parent.availability_mode or self.parent.availability_mode == "single-instance":
return False
self.parent.kernel_session_manager.load_session(kernel_id)
return self.parent.kernel_session_manager.start_session(kernel_id)
if self.parent.availability_mode == EnterpriseGatewayConfigMixin.AVAILABILITY_REPLICATION:
self.parent.kernel_session_manager.load_session(kernel_id)
return self.parent.kernel_session_manager.start_session(kernel_id)
# else we should throw 404 when not using an availability mode of 'replication'
return False

async def start_kernel(self, *args, **kwargs):
"""
Expand Down
7 changes: 5 additions & 2 deletions enterprise_gateway/tests/test_gatewayapp.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from tornado.testing import AsyncHTTPTestCase, ExpectLog

from enterprise_gateway.enterprisegatewayapp import EnterpriseGatewayApp
from enterprise_gateway.mixins import EnterpriseGatewayConfigMixin

RESOURCES = os.path.join(os.path.dirname(__file__), "resources")

Expand Down Expand Up @@ -49,7 +50,9 @@ def _assert_envs_to_traitlets(self, env_prefix: str):
self.assertEqual(app.ssl_version, 3)
if env_prefix == "EG_": # These options did not exist in JKG
self.assertEqual(app.kernel_session_manager.enable_persistence, True)
self.assertEqual(app.availability_mode, "multi-instance")
self.assertEqual(
app.availability_mode, EnterpriseGatewayConfigMixin.AVAILABILITY_REPLICATION
)

def test_config_env_vars_bc(self):
"""B/C env vars should be honored for traitlets."""
Expand Down Expand Up @@ -96,7 +99,7 @@ def test_config_env_vars(self):
os.environ["EG_SSL_VERSION"] = "3"
os.environ[
"EG_KERNEL_SESSION_PERSISTENCE"
] = "True" # availability mode will be defaulted to multi-instance
] = "True" # availability mode will be defaulted to replication

self._assert_envs_to_traitlets("EG_")

Expand Down

0 comments on commit 44d8b2c

Please sign in to comment.