Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(regionserver): add graceful shutdown configuration #570

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
4ad793f
feat(regionserver): add graceful shutdown configuration
razvan Oct 2, 2024
cb232df
Make UnifiedRoleConfiguration a sub-trait of Send
razvan Oct 2, 2024
dea179d
Replace trait with enum.
razvan Oct 2, 2024
eecaf23
implement region mover command
razvan Oct 2, 2024
0b14f92
fix: crd field names
razvan Oct 14, 2024
71793ea
unit tests and shell escaping
razvan Oct 14, 2024
1644aff
update docs
razvan Oct 14, 2024
1903f36
spelling
razvan Oct 14, 2024
5e8201f
cargo update
razvan Oct 14, 2024
e76166a
added shutdown test & hbase-entrypoint.sh
razvan Oct 16, 2024
3c63da1
cleanup and set region mover opts env var
razvan Oct 17, 2024
69a6f49
main merge
razvan Oct 17, 2024
8dbde9b
first successful integration test
razvan Oct 17, 2024
68756ab
main merge
razvan Oct 17, 2024
43abf6d
fix image pull policy for the kerberos tests
razvan Oct 17, 2024
2b6e89b
add RUN_REGION_MOVER env var
razvan Oct 17, 2024
c53497a
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 17, 2024
4e31a3c
remove trailing whitespace in docs
razvan Oct 17, 2024
a10caa0
rust : remove unused dep
razvan Oct 17, 2024
f42ab05
fix shellcheck lint
razvan Oct 17, 2024
0e9e37e
update shutdown test and run it successfuly
razvan Oct 18, 2024
c2c92c5
update docs
razvan Oct 18, 2024
8d7265e
Update rust/crd/src/lib.rs
razvan Oct 18, 2024
28a1395
fix const arithmetic
razvan Oct 18, 2024
f059e7f
switch to LazyLock
razvan Oct 18, 2024
67f3f1b
configure gracefulShutdownTimeout in (almost) all tests
razvan Oct 18, 2024
7e118ab
region mover args
razvan Oct 21, 2024
34a5ddb
Merge branch 'main' into feat/region-mover
razvan Oct 23, 2024
f9a769b
Update CHANGELOG.md
razvan Oct 23, 2024
420ba36
Update rust/crd/src/lib.rs
razvan Oct 24, 2024
2b0d63b
Update rust/crd/src/lib.rs
razvan Oct 24, 2024
5d5d5e9
Update rust/crd/src/lib.rs
razvan Oct 24, 2024
228ad4f
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 24, 2024
039c22a
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 24, 2024
60b9dc8
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 24, 2024
fd8331e
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 24, 2024
5378f11
Update rust/crd/src/lib.rs
razvan Oct 24, 2024
7b08a26
main merge
razvan Oct 25, 2024
6f087db
note on constant paths and the entrypoint script
razvan Oct 25, 2024
0f32e59
remove unnecessary configOverrides
razvan Oct 25, 2024
109e877
wip: use Fragment for the RegionMover
razvan Oct 25, 2024
05f4303
fix crd generation
razvan Oct 25, 2024
19fed55
test: fail if the regionmover fails (only with 2.6)
razvan Oct 28, 2024
8a8d26a
refactor to reduce (some) duplication
razvan Oct 28, 2024
e0aaa27
tests: use dev images
razvan Oct 28, 2024
eb52267
feat: remove hard-coded cluster.local from the domain name
razvan Oct 29, 2024
c051fb5
main merge
razvan Oct 29, 2024
40ae497
Merge branch 'main' into feat/region-mover
razvan Oct 29, 2024
d6d5fe4
fix: RegionMover fields should not be Optional
razvan Oct 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- Reduce CRD size from `1.4MB` to `96KB` by accepting arbitrary YAML input instead of the underlying schema for the following fields ([#548]):
- `podOverrides`
- `affinity`
- Support moving regions to other Pods during graceful shutdown of region servers ([#570]).

### Fixed

Expand All @@ -22,6 +23,7 @@
[#550]: https://github.com/stackabletech/hbase-operator/pull/550
[#556]: https://github.com/stackabletech/hbase-operator/pull/556
[#558]: https://github.com/stackabletech/hbase-operator/pull/558
[#570]: https://github.com/stackabletech/hbase-operator/pull/570

## [24.7.0] - 2024-07-24

Expand Down
26 changes: 16 additions & 10 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

44 changes: 30 additions & 14 deletions Cargo.nix

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ rstest = "0.22"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
serde_yaml = "0.9"
shell-escape = "0.1"
snafu = "0.8"
stackable-operator = { git = "https://github.com/stackabletech/operator-rs.git", tag = "stackable-operator-0.76.0" }
product-config = { git = "https://github.com/stackabletech/product-config.git", tag = "0.7.0" }
Expand Down
52 changes: 52 additions & 0 deletions deploy/helm/hbase-operator/crds/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -724,6 +724,32 @@ spec:
nullable: true
type: boolean
type: object
regionMover:
description: Before terminating a region server pod, the RegionMover tool can be invoked to transfer local regions to other servers. This may cause a lot of network traffic in the Kubernetes cluster if the entire HBase stacklet is being restarted. The operator will compute a timeout period for the region move that will not exceed the graceful shutdown timeout.
nullable: true
properties:
ack:
description: If enabled (default), the region mover will confirm that regions are available on the source as well as the target pods before and after the move.
type: boolean
extraOpts:
default: []
description: Additional options to pass to the region mover.
items:
type: string
type: array
maxThreads:
description: Maximum number of threads to use for moving regions.
format: uint16
minimum: 0.0
type: integer
runBeforeShutdown:
description: Move local regions to other servers before terminating a region server's pod.
type: boolean
required:
- ack
- maxThreads
- runBeforeShutdown
type: object
resources:
default:
cpu:
Expand Down Expand Up @@ -947,6 +973,32 @@ spec:
nullable: true
type: boolean
type: object
regionMover:
description: Before terminating a region server pod, the RegionMover tool can be invoked to transfer local regions to other servers. This may cause a lot of network traffic in the Kubernetes cluster if the entire HBase stacklet is being restarted. The operator will compute a timeout period for the region move that will not exceed the graceful shutdown timeout.
nullable: true
properties:
ack:
description: If enabled (default), the region mover will confirm that regions are available on the source as well as the target pods before and after the move.
type: boolean
extraOpts:
default: []
description: Additional options to pass to the region mover.
items:
type: string
type: array
maxThreads:
description: Maximum number of threads to use for moving regions.
format: uint16
minimum: 0.0
type: integer
runBeforeShutdown:
description: Move local regions to other servers before terminating a region server's pod.
type: boolean
required:
- ack
- maxThreads
- runBeforeShutdown
type: object
resources:
default:
cpu:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
= Graceful shutdown

You can configure the graceful shutdown as described in xref:concepts:operations/graceful_shutdown.adoc[].
You can configure the graceful shutdown grace period as described in xref:concepts:operations/graceful_shutdown.adoc[].

== Masters

Expand All @@ -15,7 +15,7 @@

== RegionServers

As a default, RegionServers have `60 minutes` to shut down gracefully.
By default, RegionServers have `60 minutes` to shut down gracefully.

They use the same mechanism described above.
In contrast to the Master servers, they will, however, acknowledge the graceful shutdown with a message in the logs:
Expand All @@ -26,6 +26,61 @@
2023-10-11 12:38:05,060 INFO [shutdown-hook-0] regionserver.HRegionServer: ***** STOPPING region server 'test-hbase-regionserver-default-0.test-hbase-regionserver-default.kuttl-test-topical-parakeet.svc.cluster.local,16020,1697027870348' *****
----

The operator allows for finer control over the shutdown process of region servers.
For each region server pod, the region mover tool may be invoked before terminating the region server's pod.
The affected regions are transferred to other pods thus ensuring that the data is not lost.
razvan marked this conversation as resolved.
Show resolved Hide resolved

Here is a an example:

Check notice on line 33 in docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc

View workflow job for this annotation

GitHub Actions / LanguageTool

[LanguageTool] docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc#L33

Two determiners in a row. Choose either “a” or “an”. (DT_DT[1]) Suggestions: `a`, `an` Rule: https://community.languagetool.org/rule/show/DT_DT?lang=en-US&subId=1 Category: GRAMMAR
Raw output
docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc:33:8: Two determiners in a row. Choose either “a” or “an”. (DT_DT[1])
 Suggestions: `a`, `an`
 Rule: https://community.languagetool.org/rule/show/DT_DT?lang=en-US&subId=1
 Category: GRAMMAR
razvan marked this conversation as resolved.
Show resolved Hide resolved

[source,yaml]
----
spec:
regionServers:
config:
regionMover
razvan marked this conversation as resolved.
Show resolved Hide resolved
runBeforeShutdown: true # <1>
maxThreads: 5 # <2>
ack: false # <3>
extraOpts: ["--designatedFile", "/path/to/designatedFile"] # <4>

Check notice on line 44 in docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc

View workflow job for this annotation

GitHub Actions / LanguageTool

[LanguageTool] docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc#L44

Unpaired symbol: ‘"’ seems to be missing (EN_UNPAIRED_QUOTES) URL: https://languagetool.org/insights/post/punctuation-guide/#what-are-parentheses Rule: https://community.languagetool.org/rule/show/EN_UNPAIRED_QUOTES?lang=en-US Category: PUNCTUATION
Raw output
docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc:44:37: Unpaired symbol: ‘"’ seems to be missing (EN_UNPAIRED_QUOTES)
 URL: https://languagetool.org/insights/post/punctuation-guide/#what-are-parentheses 
 Rule: https://community.languagetool.org/rule/show/EN_UNPAIRED_QUOTES?lang=en-US
 Category: PUNCTUATION
----
<1>: Run the region mover tool before shutting down the region server. Default is `false`.
<2>: Maximum number of threads to use for moving regions. Default is 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a hint how this value should be set? Eg: number of servers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it depends on the number of regions, servers to distribute the regions to, total row count and what not.

As usual : it's complicated and if you don't have performance problems, don't touch it!

"1" is the default region mover it's self uses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, there probably should be some guidance rather than leaving it ambiguous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to defer to the experts then.

<3>: Enable or disable region confirmation on the present and target servers. Default is `true`.
<4>: Extra options to pass to the region mover tool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a link to valid options (or maybe they're covered already?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any documentation on this tool. I used the source.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we ca give a hint where to find the extra options in the source code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 7e118ab


For a list of additional options accepted by the region mover use the `--help` option first:

[source,bash]
razvan marked this conversation as resolved.
Show resolved Hide resolved
----
$ /stackable/hbase/bin/hbase org.apache.hadoop.hbase.util.RegionMover --help
usage: hbase org.apache.hadoop.hbase.util.RegionMover <options>
Options:
-r,--regionserverhost <arg> region server <hostname>|<hostname:port>
-o,--operation <arg> Expected: load/unload/unload_from_rack/isolate_regions
-m,--maxthreads <arg> Define the maximum number of threads to use to unload and reload the regions
-i,--isolateRegionIds <arg> Comma separated list of Region IDs hash to isolate on a RegionServer and put region
server in draining mode. This option should only be used with '-o isolate_regions'. By
putting region server in decommission/draining mode, master can't assign any new region
on this server. If one or more regions are not found OR failed to isolate successfully,
utility will exist without putting RS in draining/decommission mode. Ex.
--isolateRegionIds id1,id2,id3 OR -i id1,id2,id3
-x,--excludefile <arg> File with <hostname:port> per line to exclude as unload targets; default excludes only
target host; useful for rack decommisioning.
-d,--designatedfile <arg> File with <hostname:port> per line as unload targets;default is all online hosts
-f,--filename <arg> File to save regions list into unloading, or read from loading; default
/tmp/<usernamehostname:port>
-n,--noack Turn on No-Ack mode(default: false) which won't check if region is online on target
RegionServer, hence best effort. This is more performant in unloading and loading but
might lead to region being unavailable for some time till master reassigns it in case the
move failed
-t,--timeout <arg> timeout in seconds after which the tool will exit irrespective of whether it finished or
not;default Integer.MAX_VALUE
----

NOTE: There is no need to explicitly specify a timeout for the region movement. The operator will compute an appropriate timeout that cannot exceed the `gracefulShutdownTimeout` for region servers.

IMPORTANT: The ZooKeeper connection must be available during the time the region mover is running for the graceful shutdown process to succeed.

== RestServers

As a default, RestServers have `5 minutes` to shut down gracefully.
Expand Down
1 change: 1 addition & 0 deletions rust/crd/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ publish = false
product-config.workspace = true
serde.workspace = true
serde_json.workspace = true
shell-escape.workspace = true
snafu.workspace = true
stackable-operator.workspace = true
strum.workspace = true
Expand Down
Loading
Loading