Skip to content

Commit

Permalink
added results for HA scenarios
Browse files Browse the repository at this point in the history
- scale_deployment_0_1-k3s3nodes-zeroload-emptydb
- s3wl-putobj-100ms-clusterip
- s3wl-putobj-100ms-ingress

(work in progress, to be amended)

Signed-off-by: Giuseppe Baccini <[email protected]>
  • Loading branch information
Giuseppe Baccini committed Sep 22, 2023
1 parent 6a01bcc commit 7dbb407
Show file tree
Hide file tree
Showing 19 changed files with 12,164 additions and 22 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ venv

*.swp
__pycache__/

tmp
123 changes: 101 additions & 22 deletions docs/research/ha/RATIONALE.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,18 @@
- [Notes on testing s3gw within K8s](#notes-on-testing-s3gw-within-k8s)
- [EXIT-1, 10 measures](#exit-1-10-measures)
- [EXIT-0, 10 measures](#exit-0-10-measures)
- [Tested Scenarios - radosgw focused](#tested-scenarios---radosgw-focused)
- [Tested Scenarios - radosgw-restart](#tested-scenarios---radosgw-restart)
- [regular\_localhost\_zeroload\_emptydb](#regular_localhost_zeroload_emptydb)
- [segfault\_localhost\_zeroload\_emptydb](#segfault_localhost_zeroload_emptydb)
- [regular\_localhost\_load\_fio\_64\_write](#regular_localhost_load_fio_64_write)
- [regular\_localhost\_zeroload\_400\_800Kdb](#regular_localhost_zeroload_400_800kdb)
- [400K objects - measures done with the WAL file zeroed](#400k-objects---measures-done-with-the-wal-file-zeroed)
- [800K objects - measures done with the WAL file still to be processed (size 32G)](#800k-objects---measures-done-with-the-wal-file-still-to-be-processed-size-32g)
- [regular-localhost-incremental-fill-5k](#regular-localhost-incremental-fill-5k)
- [scale\_deployment\_0\_1-k3s3nodes\_zeroload\_emptydb](#scale_deployment_0_1-k3s3nodes_zeroload_emptydb)
- [Tested Scenarios - S3-workload during s3gw Pod outage](#tested-scenarios---s3-workload-during-s3gw-pod-outage)
- [PutObj-100ms-ClusterIp](#putobj-100ms-clusterip)
- [PutObj-100ms-Ingress](#putobj-100ms-ingress)

We want to investigate what *High Availability* - HA - means for a project like
the s3gw.
Expand Down Expand Up @@ -523,7 +527,7 @@ at least for the cases when the process exits with zero.
Anyway, this behavior limits the number of measures we can collect and thus is
preventing us to compute decent statistics on restart timings using Deployments.

## Tested Scenarios - radosgw focused
## Tested Scenarios - radosgw-restart

When we test a scenario we are interested in collecting `radosgw`'s restart
events; for each of those we measure the following metrics:
Expand All @@ -539,63 +543,85 @@ events; for each of those we measure the following metrics:
From these 2 metrics, we produce also a derived metric: `frontend_up_main_delta`,
that is just the arithmetic difference between `to_frontend_up` and `to_main`.

For each scenario tested we collect a set of 100 measures.
For each scenario tested we produce 5 artifacts:
For each scenario tested we collect a set of measures.
For each scenario tested we produce a set of artifacts:

- deathtype-environment-description_stats_TS.json
- `*_stats.json`
- It is the `json` file containing all the measures done for a scenario.
It also contains some key statistics.

- deathtype-environment-description_raw_TS.svg
- `*_raw.svg`
- It is the plot containing the all the charts for the measures:
- `to_main`
- `to_frontend_up`
- `frontend_up_main_delta`

The ordinate axis is the `ID` of the restart event.
This is the natural order in which the restart events occurred.
On the X axis there are the restart event's `ID`s.
They follow the temporal order of the restart events.

- deathtype-environment-description_percentiles_to_main_TS.svg
- `*_percentiles_to_main.svg`
- It is the plot containing the percentile graph for the `to_main`
metric.

- deathtype-environment-description_percentiles_to_fup_TS.svg
- `*_percentiles_to_fup.svg`
- It is the plot containing the percentile graph for the `to_frontend_up`
metric.

- deathtype-environment-description_percentiles_fup_main_delta_TS.svg
- `*_percentiles_fup_main_delta.svg`
- It is the plot containing the percentile graph for the `frontend_up_main_delta`
metric.

Each file has a pattern name, where:
The file name, normally, contains some information such as:

- deathtype: is the way the `radosgw` process is asked to die:
- `exit0`
- `exit1`
- `segfault`
- `regular`
- deathtype: the way the `radosgw` process is asked to die:

- environment: is the environment where the scenario is tested:
- `localhost`
- `k8s`
- `exit0` - the process is asked to immediately exit with `exit(0)`
- `exit1` - the process is asked to immediately exit with `exit(1)`
- `segfault` - the process is asked to trigger a `segmentation fault`
- `regular` - the process is asked to exit with the ordered shutdown procedure

- environment: the environment where the scenario is tested:

- `localhost/host-path-volume`
- `k8s/k3d/k3s ... /host-path-volume`
- `k8s/k3d/k3s ... /LH-volume`

- description: is a key description of the scenario
- TS: this is just a timestamp of when the artifacts were produced

### regular_localhost_zeroload_emptydb

- restart-type: `regular`
- env: `localhost/host-path-volume`
- load: `zero-empty-db`
- #measures: `100`

<!-- markdownlint-disable MD013 -->
|<img src="measurements/regular_localhost_zeroload_emptydb/regular-localhost-zeroload-emptydb_raw_1694425886.svg">|<img src="measurements/regular_localhost_zeroload_emptydb/regular-localhost-zeroload-emptydb_percentiles_to_main_1694425886.svg">|
|---|---|
|<img src="measurements/regular_localhost_zeroload_emptydb/regular-localhost-zeroload-emptydb_percentiles_to_fup_1694425886.svg">| <img src="measurements/regular_localhost_zeroload_emptydb/regular-localhost-zeroload-emptydb_percentiles_fup_main_delta_1694425886.svg">|
<!-- markdownlint-enable MD013 -->

### segfault_localhost_zeroload_emptydb

- restart-type: `segfault`
- env: `localhost/host-path-volume`
- load: `zero-empty-db`
- #measures: `100`

<!-- markdownlint-disable MD013 -->
|<img src="measurements/segfault_localhost_zeroload_emptydb/segfault-localhost-zeroload-emptydb_raw_1694428197.svg">|<img src="measurements/segfault_localhost_zeroload_emptydb/segfault-localhost-zeroload-emptydb_percentiles_to_main_1694428197.svg">|
|---|---|
|<img src="measurements/segfault_localhost_zeroload_emptydb/segfault-localhost-zeroload-emptydb_percentiles_to_fup_1694428197.svg">| <img src="measurements/segfault_localhost_zeroload_emptydb/segfault-localhost-zeroload-emptydb_percentiles_fup_main_delta_1694428197.svg">|
<!-- markdownlint-enable MD013 -->

### regular_localhost_load_fio_64_write

- restart-type: `regular`
- env: `localhost/host-path-volume`
- load: `fio`
- #measures: `100`

`fio` configuration:

```ini
Expand All @@ -612,33 +638,86 @@ http_host=localhost:7480
filename=/workload-1/obj1
numjobs=8
rw=write
size=128m
size=64m
bs=1m
```

<!-- markdownlint-disable MD013 -->
|<img src="measurements/regular_localhost_load_fio_64_write/regular-localhost-writeload_raw_1694440297.svg">|<img src="measurements/regular_localhost_load_fio_64_write/regular-localhost-writeload_percentiles_to_main_1694440297.svg">|
|---|---|
|<img src="measurements/regular_localhost_load_fio_64_write/regular-localhost-writeload_percentiles_to_fup_1694440297.svg">| <img src="measurements/regular_localhost_load_fio_64_write/regular-localhost-writeload_percentiles_fup_main_delta_1694440297.svg">|
<!-- markdownlint-enable MD013 -->

### regular_localhost_zeroload_400_800Kdb

#### 400K objects - measures done with the WAL file zeroed

- restart-type: `regular`
- env: `localhost/host-path-volume`
- load: `zero-400K-db`
- #measures: `100`

<!-- markdownlint-disable MD013 -->
|<img src="measurements/regular_localhost_zeroload_400_800Kdb/regular-localhost-zeroload-400Kdb_raw_1694522179.svg">|<img src="measurements/regular_localhost_zeroload_400_800Kdb/regular-localhost-zeroload-400Kdb_percentiles_to_main_1694522179.svg">|
|---|---|
|<img src="measurements/regular_localhost_zeroload_400_800Kdb/regular-localhost-zeroload-400Kdb_percentiles_to_fup_1694522179.svg">| <img src="measurements/regular_localhost_zeroload_400_800Kdb/regular-localhost-zeroload-400Kdb_percentiles_fup_main_delta_1694522179.svg">|
<!-- markdownlint-enable MD013 -->

#### 800K objects - measures done with the WAL file still to be processed (size 32G)

- restart-type: `regular`
- env: `localhost/host-path-volume`
- load: `zero-800K-db`
- #measures: `100`

<!-- markdownlint-disable MD013 -->
|<img src="measurements/regular_localhost_zeroload_400_800Kdb/regular-localhost-zeroload-800Kdb_raw_1694524508.svg">|<img src="measurements/regular_localhost_zeroload_400_800Kdb/regular-localhost-zeroload-800Kdb_percentiles_to_main_1694524508.svg">|
|---|---|
|<img src="measurements/regular_localhost_zeroload_400_800Kdb/regular-localhost-zeroload-800Kdb_percentiles_to_fup_1694524508.svg">| <img src="measurements/regular_localhost_zeroload_400_800Kdb/regular-localhost-zeroload-800Kdb_percentiles_fup_main_delta_1694524508.svg">|
<!-- markdownlint-enable MD013 -->

### regular-localhost-incremental-fill-5k

Between every restart there is an interposed `PUT` of 5K objects,
- restart-type: `regular`
- env: `localhost/host-path-volume`
- load: `5K-incremental-800K-db`
- #measures: `100`

Between every restart there is an interposed `PUT-Object` sequence, each of 5K objects;
the sqlite db initially contained 800K objects.

<!-- markdownlint-disable MD013 -->
|<img src="measurements/regular-localhost-incremental-fill-5k/regular-localhost-incremental-fill-5k_raw_1694534032.svg">|<img src="measurements/regular-localhost-incremental-fill-5k/regular-localhost-incremental-fill-5k_percentiles_to_main_1694534032.svg">|
|---|---|
|<img src="measurements/regular-localhost-incremental-fill-5k/regular-localhost-incremental-fill-5k_percentiles_to_fup_1694534032.svg">| <img src="measurements/regular-localhost-incremental-fill-5k/regular-localhost-incremental-fill-5k_percentiles_fup_main_delta_1694534032.svg">|
<!-- markdownlint-enable MD013 -->

### scale_deployment_0_1-k3s3nodes_zeroload_emptydb

- restart-type: `scale_deployment_0_1`
- env: `virtual-machine/k3s-3-nodes/LH-volume`
- load: `zero-empty-db`
- #measures: `300`

The test has been conducted in 3 blocks, each of 100 restarts.
Each restart in a block is constrained to occur on a specific node.
The schema is the following:

1. taint all nodes but `node-1`
2. trigger 100 pod restarts
3. taint all nodes but `node-2`
4. trigger 100 pod restarts
5. taint all nodes but `node-3`
6. trigger 100 pod restarts

<!-- markdownlint-disable MD013 -->
|<img src="measurements/scale_deployment_0_1-k3s3nodes-zeroload-emptydb/scale_deployment_0_1-k3s3nodes-zeroload-emptydb_raw_1695046129.svg">|<img src="measurements/scale_deployment_0_1-k3s3nodes-zeroload-emptydb/scale_deployment_0_1-k3s3nodes-zeroload-emptydb_percentiles_to_main_1695046129.svg">|
|---|---|
|<img src="measurements/scale_deployment_0_1-k3s3nodes-zeroload-emptydb/scale_deployment_0_1-k3s3nodes-zeroload-emptydb_percentiles_to_fup_1695046129.svg">| <img src="measurements/scale_deployment_0_1-k3s3nodes-zeroload-emptydb/scale_deployment_0_1-k3s3nodes-zeroload-emptydb_percentiles_fup_main_delta_1695046129.svg">|
<!-- markdownlint-enable MD013 -->

## Tested Scenarios - S3-workload during s3gw Pod outage

### PutObj-100ms-ClusterIp

### PutObj-100ms-Ingress
Loading

0 comments on commit 7dbb407

Please sign in to comment.