Add backup benchmarking under read stress #9307

kreuzerkrieg · 2024-11-20T15:57:35Z

Introduce the test_backup_benchmark test, which measures backup time under read stress conditions. This test performs multiple actions to consolidate all necessary data into a single table. Initially, it runs and measures the backup process, followed by the read stress test. Finally, it executes both processes asynchronously to observe how the performance of reading and backing up degrades.

More metrics could be added after scylladb/scylla-manager#4123 is completed
Test should be re-ran after scylladb/scylla-manager#4125 is completed

Argus results:
For 100GB run

	backup time [s]
Backup during read stress	00:18:48
Backup	00:10:18

	read time [s]
Read stress	00:11:29
Read stress during backup	00:11:31

fixes: #8752

kreuzerkrieg · 2024-11-21T09:18:12Z

Introduce the test_backup_benchmark test, which measures backup time under read stress conditions. This test performs multiple actions to consolidate all necessary data into a single table. Initially, it runs and measures the backup process, followed by the read stress test. Finally, it executes both processes asynchronously to observe how the performance of reading and backing up degrades.

Argus results: For [100GB run](https://argus.scylladb.com/tests/scylla-cluster-tests/6a18ccd0-1d6a-4f00-a473-

I have to admit it looks super suspicious that the read did not degrade even slightly

mikliapko · 2024-11-21T09:59:46Z

@kreuzerkrieg There was some refactoring done for Manager tests.
Please, consider rebasing to master and putting your new test to the right place or probably to a separate class for Backup benchmark.

kreuzerkrieg · 2024-11-21T12:32:21Z

@kreuzerkrieg There was some refactoring done for Manager tests. Please, consider rebasing to master and putting your new test to the right place or probably to a separate class for Backup benchmark.

rebased and moved the test to another class

sdcm/argus_results.py

mgmt_cli_test.py

configurations/manager/100GB_dataset.yaml

mgmt_cli_test.py

mikliapko · 2024-11-21T13:02:26Z

mgmt_cli_test.py

+            backup_task_status = task.wait_for_uploading_stage(timeout=200000)
+            assert backup_task_status, "Backup has failed!"


I'd recommend to use task.wait_for_status(...).
In such case there is no need in assertion - the waiter fails by itself if timeout pass.

what do you mean? the wait_for_uploading_stage calls task.wait_for_status and it returns bool then I'll have to assert on is_status_reached or something

I'm expecting task.wait_for_status raise WaitForTimeoutError exception if timeout reaches.
But whatever, the comment can be ignored, not significant at all.

Btw, your code might be a bit shortened:

assert task.wait_for_uploading_stage(timeout=200000), "Backup has failed!"

From my point of view, it improves readability but it's up to you.

mgmt_cli_test.py

regevran · 2024-11-24T12:05:54Z

The units in the issue should be ~~[s]~~ --> [h:m:s]

kreuzerkrieg · 2024-11-24T15:53:19Z

The units in the issue should be ~~[s]~~ --> [h:m:s]

It is the way it is presented, I think, I do send seconds to it and it presents it as a time

Michal-Leszczynski · 2024-11-25T09:11:59Z

	backup time [s]	upload time [s]	total [s]
Backup times	00:00:12	00:09:45	00:09:11
Backup during read stress	00:00:12	00:15:07	00:14:43

From the code I see that backup time is total backup time without upload, right? It would be nice to indicate it with a more descriptive column name.

Also, why total is smaller than upload time?

kreuzerkrieg · 2024-11-25T09:16:38Z

backup time [s] upload time [s] total [s]
Backup times 00:00:12 00:09:45 00:09:11
Backup during read stress 00:00:12 00:15:07 00:14:43
From the code I see that backup time is total backup time without upload, right? It would be nice to indicate it with a more descriptive column name.

Should I name it "snapshot time"?

Also, why total is smaller than upload time?

Good question, the total time is taken from backup task.duration the rest is measured in place and it doesnt add up. Ideas?

Michal-Leszczynski · 2024-11-25T09:30:25Z

Should I name it "snapshot time"?

It's not only about snapshot (e.g. we also fetch schema, create manifests, etc...).
In general I'm not sure why do we need to report the time up to the upload stage? Is it useful?
But I see that the upload time is also not only about upload (e.g. purging backup location from unnecessary files).
I will leave some comments in changed files in a bit.

I created and issue about displaying backup upload bandwidth/duration - in the future it could make things easier for such benchmarks.

mgmt_cli_test.py

configurations/manager/100GB_dataset.yaml

kreuzerkrieg · 2024-11-25T15:08:31Z

Should I name it "snapshot time"?

It's not only about snapshot (e.g. we also fetch schema, create manifests, etc...). In general I'm not sure why do we need to report the time up to the upload stage? Is it useful? But I see that the upload time is also not only about upload (e.g. purging backup location from unnecessary files). I will leave some comments in changed files in a bit.

I created and issue about displaying backup upload bandwidth/duration - in the future it could make things easier for such benchmarks.

Now when I see that it is negligible slice of the whole process I guess it is worth just dropping it

sdcm/argus_results.py

mgmt_cli_test.py

Add `test_backup_benchmark` test which measures backup time under read stress test

mikliapko

LGTM

kreuzerkrieg · 2024-11-27T09:28:55Z

@scylladb/qa-maintainers could you push it forward please?

soyacz · 2024-11-27T09:32:20Z

I briefly looked at 100gb dataset - for me ingestion of this size should take bit more than 7m - did you verify correctness of this element?
also there's no jenkinsfile related this test, how it is going to be used?

kreuzerkrieg · 2024-11-27T09:41:51Z

I briefly looked at 100gb dataset - for me ingestion of this size should take bit more than 7m - did you verify correctness of this element?

This test uses existing ingestion machinery. I assume that ingestion works.
In any case, looks like it takes more or less 7 or 8 minutes

also there's no jenkinsfile related this test, how it is going to be used?

As any other SM test, just drop the right test name in the existing jenkins pipeline

soyacz · 2024-11-27T10:05:01Z

ok, I did bit wrong calculations at the first time, looks 7m should be enough with this throughput.

github-actions bot assigned kreuzerkrieg Nov 20, 2024

kreuzerkrieg force-pushed the backup-baseline branch 2 times, most recently from 91974c9 to f0c02f5 Compare November 20, 2024 16:22

kreuzerkrieg marked this pull request as ready for review November 20, 2024 20:10

kreuzerkrieg requested review from mikliapko, cezarmoise, regevran and Michal-Leszczynski November 20, 2024 20:10

kreuzerkrieg force-pushed the backup-baseline branch from f0c02f5 to 1c6cd01 Compare November 21, 2024 11:43

kreuzerkrieg added the backport/none Backport is not required label Nov 21, 2024

kreuzerkrieg force-pushed the backup-baseline branch from 1c6cd01 to 8ee7444 Compare November 21, 2024 12:09

kreuzerkrieg force-pushed the backup-baseline branch from 8ee7444 to 0d922c4 Compare November 21, 2024 12:42

mikliapko requested changes Nov 21, 2024

View reviewed changes

kreuzerkrieg force-pushed the backup-baseline branch 2 times, most recently from 8c6061d to 886e8cd Compare November 24, 2024 12:51

kreuzerkrieg requested a review from mikliapko November 24, 2024 13:09

kreuzerkrieg force-pushed the backup-baseline branch 2 times, most recently from 660fa90 to b16c2d4 Compare November 24, 2024 15:13

Michal-Leszczynski mentioned this pull request Nov 25, 2024

Display upload bandwidth/duration for backup scylladb/scylla-manager#4123

Open

Michal-Leszczynski reviewed Nov 25, 2024

View reviewed changes

mgmt_cli_test.py Outdated Show resolved Hide resolved

Michal-Leszczynski reviewed Nov 25, 2024

View reviewed changes

mgmt_cli_test.py Show resolved Hide resolved

Michal-Leszczynski reviewed Nov 25, 2024

View reviewed changes

configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved

Michal-Leszczynski mentioned this pull request Nov 26, 2024

Investigate transfers impact on backup scylladb/scylla-manager#4125

Open

kreuzerkrieg force-pushed the backup-baseline branch from ae22f65 to 96cecc7 Compare November 26, 2024 12:21

kreuzerkrieg requested a review from Michal-Leszczynski November 26, 2024 13:35

Michal-Leszczynski approved these changes Nov 26, 2024

View reviewed changes

mikliapko reviewed Nov 26, 2024

View reviewed changes

sdcm/argus_results.py Outdated Show resolved Hide resolved

mgmt_cli_test.py Outdated Show resolved Hide resolved

kreuzerkrieg force-pushed the backup-baseline branch from 96cecc7 to 4882bf3 Compare November 27, 2024 07:44

test(backup): add backup benchmarking under read stress

359bdca

Add `test_backup_benchmark` test which measures backup time under read stress test

kreuzerkrieg force-pushed the backup-baseline branch from 4882bf3 to 359bdca Compare November 27, 2024 08:02

kreuzerkrieg requested review from mikliapko and soyacz November 27, 2024 09:19

mikliapko approved these changes Nov 27, 2024

View reviewed changes

soyacz approved these changes Nov 27, 2024

View reviewed changes

soyacz merged commit 1546416 into scylladb:master Nov 27, 2024
7 checks passed

scylladbbot added the promoted-to-master label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backup benchmarking under read stress #9307

Add backup benchmarking under read stress #9307

kreuzerkrieg commented Nov 20, 2024 •

edited

Loading

kreuzerkrieg commented Nov 21, 2024

mikliapko commented Nov 21, 2024

kreuzerkrieg commented Nov 21, 2024

mikliapko Nov 21, 2024

kreuzerkrieg Nov 21, 2024

mikliapko Nov 25, 2024

regevran commented Nov 24, 2024

kreuzerkrieg commented Nov 24, 2024

Michal-Leszczynski commented Nov 25, 2024

kreuzerkrieg commented Nov 25, 2024

Michal-Leszczynski commented Nov 25, 2024

kreuzerkrieg commented Nov 25, 2024

mikliapko left a comment

kreuzerkrieg commented Nov 27, 2024

soyacz commented Nov 27, 2024

kreuzerkrieg commented Nov 27, 2024 •

edited

Loading

soyacz commented Nov 27, 2024

		backup_task_status = task.wait_for_uploading_stage(timeout=200000)
		assert backup_task_status, "Backup has failed!"

Add backup benchmarking under read stress #9307

Add backup benchmarking under read stress #9307

Conversation

kreuzerkrieg commented Nov 20, 2024 • edited Loading

kreuzerkrieg commented Nov 21, 2024

mikliapko commented Nov 21, 2024

kreuzerkrieg commented Nov 21, 2024

mikliapko Nov 21, 2024

Choose a reason for hiding this comment

kreuzerkrieg Nov 21, 2024

Choose a reason for hiding this comment

mikliapko Nov 25, 2024

Choose a reason for hiding this comment

regevran commented Nov 24, 2024

kreuzerkrieg commented Nov 24, 2024

Michal-Leszczynski commented Nov 25, 2024

kreuzerkrieg commented Nov 25, 2024

Michal-Leszczynski commented Nov 25, 2024

kreuzerkrieg commented Nov 25, 2024

mikliapko left a comment

Choose a reason for hiding this comment

kreuzerkrieg commented Nov 27, 2024

soyacz commented Nov 27, 2024

kreuzerkrieg commented Nov 27, 2024 • edited Loading

soyacz commented Nov 27, 2024

kreuzerkrieg commented Nov 20, 2024 •

edited

Loading

kreuzerkrieg commented Nov 27, 2024 •

edited

Loading