Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(eviction): Tune eviction threshold in cache mode #4142

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

BagritsevichStepan
Copy link
Contributor

@BagritsevichStepan BagritsevichStepan commented Nov 18, 2024

fixes #4139

@BagritsevichStepan BagritsevichStepan changed the title cache(cache_mode): Tune eviction threshold in cache mode fix(cache_mode): Tune eviction threshold in cache mode Nov 18, 2024
@BagritsevichStepan BagritsevichStepan force-pushed the memory/tune-eviction-threshold-in-cache-mode branch from cdd7fc0 to b5e1611 Compare November 18, 2024 10:14
@BagritsevichStepan BagritsevichStepan changed the title fix(cache_mode): Tune eviction threshold in cache mode fix(eviction): Tune eviction threshold in cache mode Nov 18, 2024
@BagritsevichStepan BagritsevichStepan force-pushed the memory/tune-eviction-threshold-in-cache-mode branch 5 times, most recently from 5e6c873 to 2d7dd27 Compare November 25, 2024 10:12
@BagritsevichStepan BagritsevichStepan self-assigned this Nov 25, 2024
@BagritsevichStepan
Copy link
Contributor Author

@adiholden I believe I've identified the issue: I'm calculating rss_threshold for each shard individually, but the current rss_memory value is a global metric for all shards

@BagritsevichStepan BagritsevichStepan force-pushed the memory/tune-eviction-threshold-in-cache-mode branch 3 times, most recently from 9bd58cc to fe22e03 Compare November 26, 2024 18:05
@BagritsevichStepan BagritsevichStepan force-pushed the memory/tune-eviction-threshold-in-cache-mode branch 3 times, most recently from 692a3fa to 051412e Compare December 2, 2024 09:44
BorysTheDev
BorysTheDev previously approved these changes Dec 2, 2024
@BagritsevichStepan
Copy link
Contributor Author

BagritsevichStepan commented Dec 3, 2024

Upd.:

  1. Found a bug where using ssize_t still can cause overflow. I refactored the code and completely removed the use of ssize_t during the heartbeat
  2. Identified why DflyEngineTest.Bug207 was failing. The issue was that we were only evicting items during AddOrFindInternal. During each heartbeat, we expected to evict a significant amount of data but successfully evicted 0 bytes. This happened because the strings were small, inlined, and HasAllocated returned false for them (this issue also exists in the main branch). To fix this, I increased the length of the values in the test.
  3. Added the eviction_memory_budget_threshold flag with a default value of 0.1. To start eviction earlier, this value can be increased.
  4. Added additional logs for better debugging


assert (
memory_info["used_memory_rss"] > rss_before_connections
), "RSS memory should have increased."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a check to do after adding the connections I believe.
In this stage of the test I think you should check that rss usage went below the threshold (rss_max_memory * 0.9)

const double rss_oom_deny_ratio = ServerState::tlocal()->rss_oom_deny_ratio;

const bool should_evict =
eviction_memory_budget_threshold > 0.0 && GetFlag(FLAGS_enable_heartbeat_eviction);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you check eviction_memory_budget_threshold > 0.0 ?
nit: maybe rename this var should_evict -> eviction_enabled


if (should_evict_depending_on_rss_memory) {
// Calculate how much rss memory is used by all shards
io::Result<io::StatusData> sdata_res = io::ReadStatusInfo();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this RetireExpiredAndEvict function runs in the heart beat flow that runs every 10ms
today UpdateMemoryGlobalStats runs every 100ms, and I believe there is a reason for this, not to read the rss data so many time. In this logic now you run this read several times in this heartbeat flow (as the count of dbs) and I dont think we want this.
Further more you use above the used_mem_current that is updated in the UpdateMemoryGlobalStats and will not get updated in this for loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But used_mem_current can be updated by other threads

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. The flow is indead very confusing!
I suggest to make this simpler by calculating the bytes goal to evict before the forloop on the dbs and once you enter the for loop you have the goal and you update it with the returned evicted byts value from FreeMemWithEvictionStep
In a later stage (not this PR) we can improve to update the rss fetch in this flow so it will be updated in case of eviction step if needed

// If rss_oom_deny_ratio is set, we should evict depending on rss memory too
const bool should_evict_depending_on_rss_memory = should_evict && rss_oom_deny_ratio > 0.0;

size_t shard_memory_budget_threshold;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition I believe that it will be more readable if you write a function that will return the byte goal for eviction in this heartbeat invocation. It will have inside all this new logic that you introduced and call this function only if eviction is enabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove oom_deny_ratio flag
3 participants