-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use io_uring to batch handle clients pending writes to reduce SYSCALL count. #112
base: unstable
Are you sure you want to change the base?
Conversation
If you merge latest unstable, the spellcheck is fixed. Can you add a check for
|
I have not taken a closer look at the PR but at the minimum I think we need a config to opt in io_uring. I am also suspecting that the gain is likely coming from tapping into the additional cores that couldn't be utilized efficiently by the current io-threads. If so this would lead to two questions
Lastly,I haven't checked recently but in the past, io-uring has seen quite amount of vulnerabilities. So in addition to the design and PR review, we should also take a hard look at the security implications. |
The core numbers will not affect the perf gain, you can refer the server config I post in top comment, io-threads is disabled, the maximum CPU utilized is 1, the benchmark clients will make sure server CPU is fully utilized. Just as I described in top comment, the gain benefit from the reduce of SYSCALL .
More context about the multi threading improvements from community?
I am not security experts, can you give more details about your concern, will this a blocker for community to adopt io_uring? |
io-uring comes with busy polling outside of the Valkey (io/main) threads. Does this CPU usage include that or just the CPU cycles accumulated by the Valkey threads? Going back to your original post, it seems to indicate this is just the Valkey CPU usage? I think a more deterministic setup would be using a 2/4 core machine.
https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
I would say this would be a serious concern for me. There are two possible outcomes
|
Actually, I didn't use the busy polling model of io_uring in this patch, no background threads started of io_uring, all the cycles generated by io_uring are classified into Valkey server. Just as I post in top comment, the perf gain benefit from the feature of https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023#batching
I can get a similar result with 2-4 CPUs allocated.
Just glanced the vulns list: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring, seems most of the fixes happened in the kernel side, there isn't application developers can do. I think we can keep supporting the io_uring, user can disable the io_uring if they want. |
I think this is really good stuff. Performance is one of the areas we should prioritize IMO. |
Thanks @zuiderkwast, the question is how can we push this patch forward? |
@lipzhu We are a new team, new project, first release and we are still busy with rebranding from redis to valkey and getting a website up. I think you just need some patience to let other team members have some time to look and think about it. I'll add it to the backlog for Valkey 8. It will not forgotten. |
Yeah this kind of changes requires the reviewers to block off some decent amount of time and to really think through it holistically. This week is really busy for the team as many of us are having in person engagement with the OSS community. We really appreciate your patience. |
As far as I know, IO_Uring is a high efficient IO engine. |
Sure, but at the beginning when we decide to introduce io_uring, we want to search the scenarios that io_uring really helps on perf gain, this patch is straightforward. |
Thanks @lipzhu! I am generally aligned with the high level idea (and good to know that you don't use polling). I do have some high level feedback around the code structure and I will list them here too
BTW, I don't have all the details on #22 at the moment so there is a chance that we might have to revisit/rethink this PR, depending on the relative pace of the two. That said, let's continue collaborating on this PR, assuming we would like to incorporate io_uring in Valkey. |
@lipzhu, looking at your results above, the amount of the read calls jumps out too. It will be great if you could apply io-uring to the query path as well. |
@PingXie Thanks for your comments :)
The counter data is based on the time duration(10s), each query is pair to
@PingXie I have done this before, but some issues I found are:
|
Ok, I will introduce a new config like
I will refactor the code to explore a cleaner separation.
The reason I didn't do for scatter/gather IOs because I remember I didn't observe the perf gain with io_uring, I will double confirm this later, for the replication stream, thanks to point it out, I will check it later.
Sure, thanks for your guidance and patience. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #112 +/- ##
============================================
- Coverage 70.54% 70.53% -0.02%
============================================
Files 114 115 +1
Lines 61644 61718 +74
============================================
+ Hits 43488 43530 +42
- Misses 18156 18188 +32
|
Update for this patch:
I suggest to only use io_uring for the clients static buffer write in first phase, most of the static response buffer is small which make total count of syscall is high and cycles ratio of SYCALL write is high correspondingly and the perf gain is significant indeed. @PingXie @zuiderkwast What do you think? |
958ff60
to
0e9afa9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks mostly good now. It has some refactorings that will conflict with the Async IO threading feature, so I think we should merge the async IO theading first.
The error messages and log messages can probably be improved, but I will review those later.
b3169ab
to
f6c6dd7
Compare
Signed-off-by: Lipeng Zhu <[email protected]>
@lipzhu - are the performance numbers in the PR description updated? If not, can you help re-benchmark the improvements? Let's make sure there is still meaningful improvement with async IO changes merged before diving into the code review? |
@PingXie Thanks, I just updated the performance boost info in top comments, we can still have 6% performance boost based on the SET/GET benchmark. |
Thanks, @lipzhu! Sorry I didn't make it clear earlier. I don't think the current test setup (controlling CPU allocation via |
@PingXie I setup an environment which separate server and client and double confirm the perf boost. Below are the brief summary of my local test env. Start server.
Start client.
|
Signed-off-by: Lipeng Zhu <[email protected]>
Just a quick confirmation - there were 8 CPUs in total and 8 io-threads in these tests? |
Both server and client have 8 CPUs. Doesn't enable |
Signed-off-by: Lipeng Zhu <[email protected]>
Io-threading is important because both io-threading and io-uring are targeted at the same problem, which is how to better utilize the CPUs on the system. It is not a fair comparison when one test can use only one CPU (when io-uring is off) while the other can use other CPUs via io-uring. In a broader sense, io-uring is essentially a more generic form of io-threading done in the kernel. Do you mind trying out the tests one more time but with 8 io-threads? |
Actually, io_uring will not steal CPU resource in this scenario. Maybe you are talking about the feature of async_io_threads which can be set by IOSQE_ASYNC, but we didn't use this feature. The perf also shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources. perf stat -p `pidof valkey-server` sleep 10
# w/o io_uring
Performance counter stats for process id '2267781':
9,993.95 msec task-clock # 0.999 CPUs utilized
625 context-switches # 62.538 /sec
0 cpu-migrations # 0.000 /sec
94,933 page-faults # 9.499 K/sec
33,894,880,825 cycles # 3.392 GHz
39,284,579,699 instructions # 1.16 insn per cycle
7,750,350,988 branches # 775.504 M/sec
73,791,242 branch-misses # 0.95% of all branches
169,474,584,465 slots # 16.958 G/sec
39,212,071,735 topdown-retiring # 23.1% retiring
11,962,902,869 topdown-bad-spec # 7.1% bad speculation
43,199,367,984 topdown-fe-bound # 25.5% frontend bound
75,159,711,305 topdown-be-bound # 44.3% backend bound
10.001262795 seconds time elapsed
# w/ io_uring
Performance counter stats for process id '2273716':
9,970.38 msec task-clock # 0.997 CPUs utilized
1,077 context-switches # 108.020 /sec
1 cpu-migrations # 0.100 /sec
124,080 page-faults # 12.445 K/sec
33,813,062,268 cycles # 3.391 GHz
41,455,816,158 instructions # 1.23 insn per cycle
8,063,017,730 branches # 808.697 M/sec
68,008,453 branch-misses # 0.84% of all branches
169,066,451,360 slots # 16.957 G/sec
38,077,547,648 topdown-retiring # 22.0% retiring
28,509,121,765 topdown-bad-spec # 16.5% bad speculation
41,083,738,441 topdown-fe-bound # 23.8% frontend bound
65,062,545,805 topdown-be-bound # 37.7% backend bound
10.001785198 seconds time elapsed
|
I, again, forgot this point. I am convinced by your test results. Will find time next to resume the code review :). Thanks a lot for your patience, @lipzhu! |
Whenever you get a chance, can you incorporate these performance numbers along with your test setup to the PR description so they are more discoverable? |
Thanks @PingXie.
Really appreciate your effort on this :).
Done. |
Kindly ping @PingXie @zuiderkwast . |
Signed-off-by: Lipeng Zhu <[email protected]>
Just reading through this discussion again. It seems there are no blockers for this one. @PingXie Can we continue to get it merged? Recap: It doesn't use more CPUs, just less syscalls. I remember we did a lot of reviewing and discussion on #599 but in the end we didn't merge it because it affects the read-committed guarantees. But whatever we decided about io-uring configs and abstraction should apply to this PR as well. @lipzhu maybe you want to refresh the memory. |
Yeah that is the tl;dr.
I am aligned directionally. That said, I think a more seamless integration could be achieved at a lower level, specifically within the connection layer, in @lipzhu what do you think? BTW, we've decided to phase out the use of the |
@zuiderkwast @PingXie Thanks for revisiting this thread.
+1, I will recover the context :) and try to refactor it based on the suggestion. Booked by the other things recently, sorry for the late response. |
@PingXie after reconsidering the suggestion, I realized that integrating io_uring into the connection layer would be somewhat complex. Or maybe I'm overcomplicating it? :) The existing connection implementations (socket, unix, tls, rdma) all use synchronous methods, and the high-level call stack (e.g. This patch attempts to submit write requests to io_uring, which will handle them asynchronously, and later retrieve all the clients’ written bytes in a batch mode. It is somewhat similar to what io-threads did. How about adding a new method like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach in general. I'd be fine with merging this soon. It's a first introduction of io_uring.
I don't know if there is a better integration point for this, such as in the connection abstraction or in the event loop logic, but the abstractions can be improved in the future, as well as using uring for more things, like reads and fsync.
@pizhenwei you have created much of the connection abstraction. Do you have any ideas about how we can integrate io_uring?
} | ||
|
||
void freeIOUring(void) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These dummy stubs are never called, right? They're defined just to make it compile for when we don't have liburing?
Should we mark them as dead code in some way as? Assert that they're never called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert sounds like a great idea.
I also like @pizhenwei's suggestion of failing the dummy initIOUring
, which should be sufficient to cover the user errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are unused, just to make sure it can compile.
Should we mark them as dead code in some way as? Assert that they're never called?
How about adding the __attribute__((unused))
?
if (server.io_uring_enabled && server.io_threads_num == 1) { | ||
/* Currently, we only use io_uring to handle the static buffer write requests. | ||
* If io-threads or tls is enabled, skip the io_uring. */ | ||
return connIsTLS(c->conn) == 0 && getClientType(c) != CLIENT_TYPE_REPLICA && listLength(c->reply) == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These conditions don't cover RDMA. Does it work or should we exclude that too? What about other fake clients, like the fake client used from Lua?
Rather then defining a negated condition for skipping it, like "not TLS", it's usually better to a have a positive condition for when it's known to work. In the future, we may add more connection types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
In fact, RDMA supports async API only, it uses send queue
, receive queue
and completion queue
, but they are different from IO uring queue, so IO uring can't support RDMA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think we can safely say RDMA and io-uring are "mutually exclusive" or "non-overlapping".
@lipzhu If we want to use IO uring from within the connection abstractions, maybe we can try to change the connection abstraction to an async API? I'm just speculating.
Yes, this sounds good too. There seem to be some similarities to IO threads. Is there any chance we can use IO uring for TLS too in the future? That would be a good thing. I like an incremental approach. We don't need a perfect solution immediately IMO. |
The current connection framework(of sync APIs) and event handling framework(of both POLLIN and POLLOUT events driven) are designed for classic TCP programming API. As far as I can see, a new async IO framework would change the networking a lot. For example:
What about trying to reuse |
* point to the shared memory containing the io_uring queues. | ||
* On failure -errno is returned. */ | ||
if (io_uring_queue_init_params(IO_URING_DEPTH, _io_uring, ¶ms) < 0) return IO_URING_ERR; | ||
return IO_URING_OK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deal IO_URING_OK
in io_uring.c
only, outside should not handle IO uring related code any more. Please convert IO_URING_OK
to C_OK
, so does IO_URING_ERR
.
#define UNUSED(V) ((void)V) | ||
#endif | ||
|
||
int initIOUring(void) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called if server.io_uring_enabled
is true, this means a user specifies io-uring-enabled yes
. So I think error log should be printed here and return error instead of a silent error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your point, maybe I should simply return C_ERR in the dummy initIOUring, error will be printed and exit.
void InitServerLast(void) {
bioInit();
initIOThreads();
if (server.io_uring_enabled) {
if (initIOUring() == IO_URING_ERR) {
serverLog(LL_WARNING, "Failed to initialize io_uring.");
exit(1);
}
}
set_jemalloc_bg_thread(server.jemalloc_bg_thread);
server.initial_memory_usage = zmalloc_used_memory();
}
@zuiderkwast |
I’m aligned with introducing the proper async IO framework "asynchronously" :) Since we’re adding quite a bit of complexity here, I wonder if we should enable io-uring on the IO threads as well? io-threads is likely to become more common going forward, and reducing the syscall overhead would be beneficial for IO threads too.
Yeah, but I believe io-uring can complement IO threads effectively.
That integration would need to happen in the OpenSSL library, not in Valkey. |
I think it is not easy to enable io_uring for io-threads now. The |
Hi @PingXie @zuiderkwast @pizhenwei, Thanks for your review and comments. To push this patch forward, I have summarized the conclusions based on our discussions. Please correct me if I misunderstand anything. We all agree to introduce io_uring as the proper Async IO framework. Currently, implementing it in the connection layer is complex, so let’s consider it a long-term topic. We can continue with this approach in general, addressing the review comments and resolving conflicts. |
Description
This patch try to benefit the io_uring batching feature to reduce the SYSCALL count for valkey when
handleClientsWithPendingWrites
.With this patch, we can observe more than 6% perf gain for SET/GET.
This patch was implemented based on below discussion during the review:
io_uring.h
to handle the io_uing related API to split it from server logic.io_uring.h
independent ofserver.h
.io_uring
to gain performance when write client static buffer.Benchmark Result
Test Env
Test Steps
Test Result
QPS of SET and GET can increase 6.5%, 6.6% correspondingly.
Perf Stat
The perf stat info shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources.
NOTE