Move output IO throttler to IO queue level #2332

xemul · 2024-07-05T15:16:35Z

IO queue is responsible for two things:

cross-class shares-based fair balancing
throttling the flow of dispatching requests to obey disk model

The former is done with the help of fair_queue that implements rather simple "virtual capacity" model. The latter is done with the help of shared token bucket, but historically the throttling code was implemented as a part of fair_queue. That's not correct, fair queue has nothing to do with output flow control. This PR moves the throttling code out of fair_queue and makes IO-queue own it and use.

Current tests on fair queue try to make the queue submit requests in extremely controllable way -- one-by-one. However, the fair queue nowadays is driven by rated token bucket and is very sensitive to time and durations. It's better to teach the test accept the fact that it cannot control fair-queue requests submissions on per-request granularity and tunes its accounting instead. The change affects two places. Main loop. Before the change it called fair_queue::dispatch_requests() as many times are the number of requests test case wants to pass, then performed the necessary checks. Now, the method is called infinitely, and the handling only processes the requested amount of requests. The rest is ignored. Drain. Before the change it called dispatch_requests() in a loop until it returned anything. Now it's called in a loop until fair queue explicitly reports that it's empty. Signed-off-by: Pavel Emelyanov <[email protected]>

The class in question only controls the output flow of capacities, it's not about fair queueing at all. There is an effort to make cross-shard fairness, that needs fair_group however, but we're not yet there. refs: scylladb#2294 Signed-off-by: Pavel Emelyanov <[email protected]>

The io_group carries throttle instances for each of the streams, but is still named as _f[air]g[roup]s. Effectively, this is the continuation of the previous patch. Signed-off-by: Pavel Emelyanov <[email protected]>

The move is very smoth, no changes in code needed but the definition of throttle::capacity_t. It used to be the same as fair_queue_entry::capacity_t, since we use the same notion of capacity for both -- fair queuing and capacity throttling. Signed-off-by: Pavel Emelyanov <[email protected]>

The capacity grabbing method now accepts fq entry referece, but this code is going to be moved to throttle.(cc|hh) which doesn't work with fq entries. Signed-off-by: Pavel Emelyanov <[email protected]>

Shard-local fair_queue class makes some shard local bookkeeping of throttled capacities. Shared bookkeeping was done by fair_group which was renamed and moved away, now it's time to collect shard-local throttle facilities. Signed-off-by: Pavel Emelyanov <[email protected]>

No function changes, just move the code. Signed-off-by: Pavel Emelyanov <[email protected]>

IO queue uses the concept of "stream" to handle duplex devices. The stream is now a fair_queue, but for future patching it's needed to put more onto stream, so this patch wraps the on-io-queue fair_queues into helper struct stream. Signed-off-by: Pavel Emelyanov <[email protected]>

The fair queue code doesn't "know" which entries it schedules. For that there's an "abstract" fair_queue_entry that are scheduled and to dispatch those the caller must provide std::function<> callback to the fair_queue::dispatch_requests() method. Next patches will need to make fair_queue call more code on those entries. Not to introduce another std::function<> callback, change the dispatch indirection from callback to pure virtual .dispatch() method of the fair_queue_entry. Signed-off-by: Pavel Emelyanov <[email protected]>

When fair_queue tries to dispatch a request it tries to grab the capacity from throttler on its own. That's wrong place, fair group is just about providing cross-classes fairness of requests. Throttling output stream of requests should be done by io_queue itself. Signed-off-by: Pavel Emelyanov <[email protected]>

Those notifications were used to update output throttler with the result of requests dispatching/cancellation/etc. Now when the throttler is on the io_queue level, fair queue can be freed from this duty. Signed-off-by: Pavel Emelyanov <[email protected]>

The method is empty now. Signed-off-by: Pavel Emelyanov <[email protected]>

pwrobelse · 2024-07-06T21:30:56Z

src/core/io_queue.cc

@@ -297,6 +298,10 @@ class queued_io_request final : private internal::io_request, public fair_queue_
    queued_io_request(queued_io_request&&) = delete;
    ~queued_io_request() = default;

+    virtual throttle::grab_result can_dispatch() const noexcept {


It seems that the compiler on CI complains that it does not use override keyword.

io_queue.cc:301:35: error: 'can_dispatch' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 301 | virtual throttle::grab_result can_dispatch() const noexcept {

pwrobelse · 2024-07-06T21:35:17Z

src/core/fair_queue.cc

 void fair_queue::dispatch_requests() {
    capacity_t dispatched = 0;
    boost::container::small_vector<priority_class_ptr, 2> preempt;

-    while (!_handles.empty() && (dispatched < _throttle.per_tick_grab_threshold())) {


Minor: dispatched variable becomes unused after the removal. The compiler on CI complains about it:

fair_queue.cc:238:16: error: variable 'dispatched' set but not used [-Werror,-Wunused-but-set-variable] 238 | capacity_t dispatched = 0;

I am not entirely sure why the compiler raises error -- below there is dispatched += req_cap;

Because this variable becomes write-only, which effectively means unused. This place is obsoleted with #2294 anyway :(

pwrobelse · 2024-07-08T08:07:38Z

src/core/io_queue.cc

@@ -602,17 +602,17 @@ shared_throttle::config io_group::make_throttle_config(const io_queue::config& q
 }

 std::chrono::duration<double> io_group::io_latency_goal() const noexcept {
-    return _fgs.front().rate_limit_duration();
+    return _throttle.front().rate_limit_duration();
 }

 io_group::io_group(io_queue::config io_cfg, unsigned nr_queues)
    : _config(std::move(io_cfg))
    , _allocated_on(this_shard_id())
 {
    auto fg_cfg = make_throttle_config(_config);


Minor: maybe we could also adjust the variable name to throttle_cfg?

tests/unit/fair_queue_test.cc

pwrobelse · 2024-07-08T08:17:42Z

src/core/fair_queue.cc

@@ -161,7 +161,7 @@ bool fair_queue::class_compare::operator() (const priority_class_ptr& lhs, const
    return lhs->_accumulated > rhs->_accumulated;
 }

-fair_queue::fair_queue(fair_group& group, config cfg)
+fair_queue::fair_queue(shared_throttle& group, config cfg)


Minor: in fair_queue.hh the names of parameters of the constructor do not match the ones that are used here. In the header there is fair_queue(shared_throttle& shared, config cfg).

Do we want to keep both places consistent?

Yes, I will need to update header name too

tests/perf/fair_queue_perf.cc

pwrobelse · 2024-07-08T08:24:06Z

include/seastar/core/internal/throttle.hh

+/// This is a fair group. It's attached by one or mode fair queues. On machines having the
+/// big* amount of shards, queues use the group to borrow/lend the needed capacity for
+/// requests dispatching.


Minor: just a short question about the comment with description. In previous patches we renamed fair_group --> shared_throttle. Are any changes needed in the case of the description? It uses the old nomenclature.

Ah, yes. I fixed this comment in one of the rebases and lost it along the way

pwrobelse · 2024-07-08T08:26:34Z

src/core/throttle.cc

+        , _per_tick_threshold(_token_bucket.limit() / nr_queues)
+{
+    if (tokens_capacity(cfg.min_tokens) > _token_bucket.threshold()) {
+        throw std::runtime_error("Fair-group replenisher limit is lower than threshold");


Minor: we renamed fair_group to shared_throttle. Do we want to update also this error message?

Yup, good catch

pwrobelse · 2024-07-08T08:35:32Z

src/core/throttle.cc

+#ifdef SEASTAR_MODULE
+module;
+#endif
+
+#include <seastar/core/internal/throttle.hh>


A general question related to the build with modules -- as far as I understand ./src/seastar.cc includes all headers to provide the declarations and definitions when we build the code with modules.

The source files (e.g. fair_queue.cc, io_queue.cc and reactor.cc) use ifndef e.g.:

#ifdef SEASTAR_MODULE module seastar; #else // include the headers related to seastar #endif

The comment from seastar.cc:

// include all declaration and definitions to be exported in the the module // purview

My question is as follows -- do we want to follow the same approach in throttle.cc? Also, is there an explicit rule in seastar project related to inclusion of files and SEASTAR_MODULE macro?

I'm not sure if there's any set of rules on how to manipulate SEASTAR_MODULE and related. @tchaikov works on modules support, he can shed more light on this

@pwrobelse hi Patryk, thank you for bringing this up.

seastar.cc includes both the global module fragment and the purview of the seastar module. we put the implementation into the module implementation units which starts with module seastar, for instance, see fair_queue.cc as you pointed out.

when it comes to throttle.cc, it needs to access the seastar module, so i think we would need to start it with module seastar; instead of #include <seastar/core/internal/throttle.hh> when C++20 modules is enabled, which is checked using #ifdef SEASTAR_MODULE in general.

I'm not sure if there's any set of rules on how to manipulate SEASTAR_MODULE and related.

@xemul good point. i will note this down. could you suggest which document i should put this in? as its target audience will be library developer, instead of library user, the closest one i can find is coding-style.md. but this document focuses on coding style, it's more about taste, instead of the guideline on the general design or a perspective of the code structure. shall i create another document on C++20 modules?

pwrobelse · 2024-07-08T09:39:38Z

tests/perf/fair_queue_perf.cc

@@ -103,10 +92,9 @@ future<> perf_fair_queue::test(bool loc) {

    auto invokers = local_fq.invoke_on_all([loc] (local_fq_and_class& local) {
        return parallel_for_each(boost::irange(0u, requests_to_dispatch), [&local, loc] (unsigned dummy) {
-            auto cap = local.queue(loc).tokens_capacity(double(1) / std::numeric_limits<int>::max() + double(1) / std::numeric_limits<int>::max());
-            auto req = std::make_unique<local_fq_entry>(cap, [&local, loc, cap] {
+            auto req = std::make_unique<local_fq_entry>(100, [&local, loc] {


Just to double-check -- this patch replaces the calculation with hard-coded value of 100. Was the value the same before? Does it matter for the test?

Do I understand correctly that due to returning throttle::grab_result::grabbed from can_dispatch() the capacity does not matter to the test? Because we always can dispatch (because that's how the test is now) ?

Yes, this capacity can be any, since can_dispatch() always says "yes you can"

pwrobelse · 2024-07-08T09:59:55Z

tests/unit/fair_queue_test.cc

@@ -101,7 +90,6 @@ class test_env {
    unsigned tick(unsigned n = 0) {
        unsigned processed = 0;
        while (true) {
-            _fg.replenish_capacity(_fg.replenished_ts() + std::chrono::microseconds(1));
            _fq.dispatch_requests();


Do I understand correctly that after defining:

virtual throttle::grab_result can_dispatch() const noexcept override { return throttle::grab_result::grabbed; }

in request class, then this line will dispatch all request that were queued before? When reading the code of dispatch_requests() I had the impression, that it is the case.

If so, are we leaving the code as it was (mainly while(true) part) just in case if anybody wanted to define another request type that do not allow to be dispatched immediately? Is this while(true) needed (of course if I did not misunderstand the code) ?

I suspect, that this is the root cause of the failure of the test.

Let's take test_fair_queue_longer_run_different_shares as an example.

We firstly queue a lot of ops:

for (int i = 0; i < 20000; ++i) { env.do_op(a, 1); env.do_op(b, 0); }

And then we want to let them in over time:

for (int i = 0; i < 1000; ++i) { sleep(1ms).get(); env.tick(3); }

However, because all requests can be dispatched (new implementation of can_dispatch()) and the first call to tick(3) invokes dispatch_requests() that submits all requests at once, then the second call to tick(3) is stuck in infinite loop.

When I added a debug print in the loop that calls tick(3), then I saw:

23: Call for i = 0 23: Call for i = 1 <reactor stall errors>

This test fails because there's no (easy) way to tell dispatch loop to stop dispatching. So I marked this PR as draft to fix it later

pwrobelse · 2024-07-08T10:51:14Z

src/core/io_queue.cc

@@ -551,7 +555,6 @@ void
 io_queue::complete_request(io_desc_read_write& desc) noexcept {
    _requests_executing--;
    _requests_completed++;
-    _streams[desc.stream()].notify_request_finished(desc.capacity());


It seems, that the function does not need to take a parameter -- it is just responsible for accounting.

pwrobelse · 2024-07-08T10:53:31Z

src/core/io_queue.cc

+        : fq(std::move(cfg))
+        , thr(st)
+{ }
+
 class io_desc_read_write final : public io_completion {


Because of the removal of the need to notify about completion, this class does not need to keep _stream and fq_capacity members.

pwrobelse · 2024-07-08T14:26:02Z

src/core/fair_queue.cc

    capacity_t dispatched = 0;
    boost::container::small_vector<priority_class_ptr, 2> preempt;

-    while (!_handles.empty() && (dispatched < _group.per_tick_grab_threshold())) {


We removed here the check related to per_tick_grab_threshold() -- when I search for that function call in this PR then we did not introduce its usage anywhere.

Is it expected?

Yes. This per-tick threshold is completely removed with #2294

pwrobelse · 2024-07-08T15:06:02Z

tests/unit/fair_queue_test.cc

I pulled the change to my local repository and fair_queue_test.cc fails.

ACK, fails for me too

There's one interesting even that is generated by fair-queue -- when the queue starts waiting for shared capacity. In fact, the output throttler code should exist on IO-queue level (see scylladb#2332), but currently it's in the fair-queue, so it will need to generate this event. This patch adds tracer reference on fair-queue for future patching. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul added 12 commits July 5, 2024 17:42

io_queue: Rename _fgs to _throttle

f2229aa

The io_group carries throttle instances for each of the streams, but is still named as _f[air]g[roup]s. Effectively, this is the continuation of the previous patch. Signed-off-by: Pavel Emelyanov <[email protected]>

fair_queue: Grab request capacity by capacity value

c618cdf

The capacity grabbing method now accepts fq entry referece, but this code is going to be moved to throttle.(cc|hh) which doesn't work with fq entries. Signed-off-by: Pavel Emelyanov <[email protected]>

fair_queue: Move shard-local throttle closer to shared_throttle

8f1e3f9

No function changes, just move the code. Signed-off-by: Pavel Emelyanov <[email protected]>

io_queue: Remove empty notifications

f3a3afd

The method is empty now. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul requested a review from pwrobelse July 5, 2024 15:16

pwrobelse reviewed Jul 6, 2024

View reviewed changes