Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backup benchmarking under read stress #9307

Merged
merged 1 commit into from
Nov 27, 2024

Conversation

kreuzerkrieg
Copy link
Contributor

@kreuzerkrieg kreuzerkrieg commented Nov 20, 2024

Introduce the test_backup_benchmark test, which measures backup time under read stress conditions. This test performs multiple actions to consolidate all necessary data into a single table. Initially, it runs and measures the backup process, followed by the read stress test. Finally, it executes both processes asynchronously to observe how the performance of reading and backing up degrades.

More metrics could be added after scylladb/scylla-manager#4123 is completed
Test should be re-ran after scylladb/scylla-manager#4125 is completed

Argus results:
For 100GB run

backup time [s]
Backup during read stress 00:18:48
Backup 00:10:18
read time [s]
Read stress 00:11:29
Read stress during backup 00:11:31

fixes: #8752

@kreuzerkrieg
Copy link
Contributor Author

Introduce the test_backup_benchmark test, which measures backup time under read stress conditions. This test performs multiple actions to consolidate all necessary data into a single table. Initially, it runs and measures the backup process, followed by the read stress test. Finally, it executes both processes asynchronously to observe how the performance of reading and backing up degrades.

Argus results: For [100GB run](https://argus.scylladb.com/tests/scylla-cluster-tests/6a18ccd0-1d6a-4f00-a473-

I have to admit it looks super suspicious that the read did not degrade even slightly

@mikliapko
Copy link
Contributor

@kreuzerkrieg There was some refactoring done for Manager tests.
Please, consider rebasing to master and putting your new test to the right place or probably to a separate class for Backup benchmark.

@kreuzerkrieg
Copy link
Contributor Author

@kreuzerkrieg There was some refactoring done for Manager tests. Please, consider rebasing to master and putting your new test to the right place or probably to a separate class for Backup benchmark.

rebased and moved the test to another class

sdcm/argus_results.py Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated
Comment on lines 1573 to 1585
backup_task_status = task.wait_for_uploading_stage(timeout=200000)
assert backup_task_status, "Backup has failed!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to use task.wait_for_status(...).
In such case there is no need in assertion - the waiter fails by itself if timeout pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean? the wait_for_uploading_stage calls task.wait_for_status and it returns bool then I'll have to assert on is_status_reached or something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm expecting task.wait_for_status raise WaitForTimeoutError exception if timeout reaches.
But whatever, the comment can be ignored, not significant at all.

Btw, your code might be a bit shortened:

assert task.wait_for_uploading_stage(timeout=200000), "Backup has failed!"

From my point of view, it improves readability but it's up to you.

mgmt_cli_test.py Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
@regevran
Copy link

The units in the issue should be [s] --> [h:m:s]

@kreuzerkrieg kreuzerkrieg force-pushed the backup-baseline branch 2 times, most recently from 8c6061d to 886e8cd Compare November 24, 2024 12:51
@kreuzerkrieg kreuzerkrieg force-pushed the backup-baseline branch 2 times, most recently from 660fa90 to b16c2d4 Compare November 24, 2024 15:13
@kreuzerkrieg
Copy link
Contributor Author

The units in the issue should be [s] --> [h:m:s]

It is the way it is presented, I think, I do send seconds to it and it presents it as a time

@Michal-Leszczynski
Copy link

backup time [s] upload time [s] total [s]
Backup times 00:00:12 00:09:45 00:09:11
Backup during read stress 00:00:12 00:15:07 00:14:43

From the code I see that backup time is total backup time without upload, right? It would be nice to indicate it with a more descriptive column name.

Also, why total is smaller than upload time?

@kreuzerkrieg
Copy link
Contributor Author

backup time [s] upload time [s] total [s]
Backup times 00:00:12 00:09:45 00:09:11
Backup during read stress 00:00:12 00:15:07 00:14:43
From the code I see that backup time is total backup time without upload, right? It would be nice to indicate it with a more descriptive column name.

Should I name it "snapshot time"?

Also, why total is smaller than upload time?

Good question, the total time is taken from backup task.duration the rest is measured in place and it doesnt add up. Ideas?

@Michal-Leszczynski
Copy link

Should I name it "snapshot time"?

It's not only about snapshot (e.g. we also fetch schema, create manifests, etc...).
In general I'm not sure why do we need to report the time up to the upload stage? Is it useful?
But I see that the upload time is also not only about upload (e.g. purging backup location from unnecessary files).
I will leave some comments in changed files in a bit.

I created and issue about displaying backup upload bandwidth/duration - in the future it could make things easier for such benchmarks.

@kreuzerkrieg
Copy link
Contributor Author

Should I name it "snapshot time"?

It's not only about snapshot (e.g. we also fetch schema, create manifests, etc...). In general I'm not sure why do we need to report the time up to the upload stage? Is it useful? But I see that the upload time is also not only about upload (e.g. purging backup location from unnecessary files). I will leave some comments in changed files in a bit.

I created and issue about displaying backup upload bandwidth/duration - in the future it could make things easier for such benchmarks.

Now when I see that it is negligible slice of the whole process I guess it is worth just dropping it

sdcm/argus_results.py Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
Add `test_backup_benchmark` test which measures backup time under read stress test
Copy link
Contributor

@mikliapko mikliapko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kreuzerkrieg
Copy link
Contributor Author

@scylladb/qa-maintainers could you push it forward please?

@soyacz
Copy link
Contributor

soyacz commented Nov 27, 2024

I briefly looked at 100gb dataset - for me ingestion of this size should take bit more than 7m - did you verify correctness of this element?
also there's no jenkinsfile related this test, how it is going to be used?

@kreuzerkrieg
Copy link
Contributor Author

kreuzerkrieg commented Nov 27, 2024

I briefly looked at 100gb dataset - for me ingestion of this size should take bit more than 7m - did you verify correctness of this element?

This test uses existing ingestion machinery. I assume that ingestion works.
In any case, looks like it takes more or less 7 or 8 minutes
image

also there's no jenkinsfile related this test, how it is going to be used?

As any other SM test, just drop the right test name in the existing jenkins pipeline

image

@soyacz
Copy link
Contributor

soyacz commented Nov 27, 2024

ok, I did bit wrong calculations at the first time, looks 7m should be enough with this throughput.

@soyacz soyacz merged commit 1546416 into scylladb:master Nov 27, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SCT for backup
6 participants