rgw/sfs: abstract how we obtain connections, potentially change between threading models #727

jecluis · 2023-09-27T09:37:08Z

Currently, DBConn keeps an instance of Storage, which is created by sqlite_orm::make_storage(). That first instance of Storage is long-lived (the DBConn c'tor calls storage->open_forever()) so the sqlite database is open for the entire life of the program, but this first Storage object and its associated sqlite database connection pointer are largely not used for anything much after initialization. The exception is the SFS status page, which also uses this connection to report some sqlite statistics.

All the other threads (the workers, the garbage collector, ...) call DBConn::get_storage(), which returns a copy of the first Storage object. These copies don't have open_forever() called on them, which means every time they're used for queries we get a pair of calls to sqlite3_open() and sqlite3_close() at the start end end of the query. These calls don't open the main database file again (it's already open) but they do open and close the WAL. An easy way to demonstrate this is:

Run s3gw with --debug-rgw 20 --rgw-sfs-sqlite-profile, and tail its log, grepping for "SQLITE PROFILE".
Meanwhile, run strace -p $(pgrep radosgw) -f 2>&1 | grep 'openat(\|close('
Do something interesting, like make a bucket: s3cmd mb -P s3://foo
Observe that the "SQLITE PROFILE" log lines show several queries, and that the strace shows matching pairs of openat() and close() for s3gw.db-wal, one per query.

(Note: I have no idea if there's any noticeable performance impact from all those opens and closes - they might be fine, or they might not.)

This all mostly works just fine, but I (Tim) thought it was weird that we had all these connections opening and closing all the time. I later discovered there are two problems with the current implementation:

Some of the information on the SFS SQLite status page and in the prometheus metrics we expose is based on calls to sqlite3_db_status(), to which we pass the pointer to the first database connection. This means we don't see any of the activity on any of the worker threads when looking at stats like cache hits and misses and malloc counts. Maybe we don't actually care about these stats (I don't know) but in any case they're wrong/misleading given they don't pick up any of the worker activity.
The fact that we have multiple threads with multiple sqlite database connections is what allows the WAL growth problem to happen (https://github.com/aquarist-labs/s3gw/issues/681). As best as I can figure, with multiple connections and multiple threads, when one thread finishies writing and begins a checkpoint, if another thread sneaks in before the checkpoint is complete and starts another write, the WAL can't ever be truncated (some other thread is always appending). This issue was mitigated by rgw/sfs: dbconn: add WAL checkpoints, truncate if over 16MB aquarist-labs/ceph#213

One way to fix the above two problems is to switch to using one database connection which is shared by all the threads (see ramblings in aquarist-labs/ceph#209). This makes the stats accurate, and causes the WAL growth problem to not happen at all. Unfortunately @irq0 found that a single connection can be about 20x slower than multiple connections when he did some microbenchmarks:

Setup: 100000 distinct objects. 1M queries spread over $nproc threads (nproc=32). Measuring query start to finish time (execution + busy timeout)

multiple connections + sqlite multithreaded: avg rate: 11273.957159/s - all CPUs 100%
single connection + sqlite serialized: avg rate: 534.600408/s - roughly 200 % combined CPU
multiple connections + sqlite serialized: avg rate: 10387.883573/s - all CPUs 100%

So we really do want multiple connections.

SQLite has two threading modes that we care about, multithreaded and serialized (see https://www.sqlite.org/threadsafe.html). The default is serialized, but from the above it looks like multithreaded might be about 8% faster, so is probably worth experimenting with.

We've already effectively dealt with the WAL growth, so it's no problem to continue to use multiple conections from that perspective. That leaves stats gathering (assuming we care about this). If we switch to using a connection pool, e.g. one Storage object per thread, and call open_forever() on each of those (and keep copies of their connection pointers) we could later interrogate each one to get reliable stats.

Note that multithreaded mode requires that "no single database connection nor any object derived from database connection, such as a prepared statement, is used in two or more threads at the same time". A connection pool as suggested above should ensure this is met (although I suspect that our current implementation probably doesn't run afoul of this restriction given each thread has its own separate copy of Storage).

By default, radosgw has 512 worker threads, which means that in the connection pool scenario outlined above we pretty quickly end up with 512 Storage objects, even under light but repeated load (e.g. a single client doing a bunch of reads). This means we easily go over 1024 open file descriptors as each thread has the DB file and the WAL open. I doubt this is a problem in an actual k8s environment (testing on k3s shows an FD limit of 1048576) but you will trip over it in manual testing on baremetal where ulimit -n is probably still 1024.

We have a couple of ways forward here:

Get rid of the per-db-connection stats reporting (because it's currently lies), and do nothing else. Don't bother with a connection pool, don't change threading models. Everything else already works fine, or is fine enough.
Rejig DBConn::get_storage() to enable some sort of (possibly configurable) connection pool along the lines of the above, which would allow accurate stats, and possibly safer experimentation with sqlite's multithreaded mode. We've already done a bunch of work in this direction, and I (Tim) am happy to continue if it seems useful, but don't mind dropping it if it's ultimately not going to be worthwhile.

The text was updated successfully, but these errors were encountered:

tserong · 2023-10-10T09:18:25Z

(Note: I have no idea if there's any noticeable performance impact from all those opens and closes - they might be fine, or they might not.)

In the interest of testing that, I applied Marcel's microbenchmark code (irq0/ceph@54ce922#diff-521007125cbd4106a8fc8043757557c75c013b766dee299df2df34824eb98b2a) to the current s3gw branch, and ran this on my (relatively old) 8 core desktop:

> bin/unittest_rgw_sfs_concurrency --gtest_filter='Common/TestSFSConcurrency.performance_multi_thread/create_new_object
[ RUN      ] Common/TestSFSConcurrency.performance_multi_thread/create_new_object
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <100µs		0
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <190µs		0
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <370µs		0
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <730µs		466384
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <1450µs		386960
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <2890µs		52675
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <5770µs		4633
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <11530µs		1589
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <23050µs		43901
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <46090µs		22074
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <92170µs		10974
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <184330µs		8135
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <368650µs		2305
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <737290µs		370
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <1474570µs		0
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <2949130µs		0
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 <∞		10
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 count: 1000000, time total: 4478404ms, avg time: 4.478404ms avg rate: 223.293834/s
2023-10-10T18:31:57.087+1100 7fe30ba77f40 -1 total: 1200002 failed:341 retried: 516217
[       OK ] Common/TestSFSConcurrency.performance_multi_thread/create_new_object (762037 ms)

I also did the same with the single connection:

2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <100µs		0
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <190µs		255897
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <370µs		130672
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <730µs		4590
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <1450µs		129944
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <2890µs		291132
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <5770µs		135347
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <11530µs		45081
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <23050µs		7025
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <46090µs		294
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <92170µs		17
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <184330µs		1
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <368650µs		0
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <737290µs		0
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <1474570µs		0
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <2949130µs		0
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 <∞		10
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 count: 1000000, time total: 1776194ms, avg time: 1.776194ms avg rate: 563.001564/s
2023-10-10T18:45:29.133+1100 7f3f4da0bf40 -1 total: 1200002 failed:0 retried: 0

And the pool:

2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <100µs		0
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <190µs		582770
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <370µs		383543
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <730µs		15147
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <1450µs		1269
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <2890µs		10916
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <5770µs		1075
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <11530µs		378
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <23050µs		91
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <46090µs		322
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <92170µs		578
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <184330µs		708
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <368650µs		806
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <737290µs		2397
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <1474570µs		0
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <2949130µs		0
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 <∞		10
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 count: 1000000, time total: 1674415ms, avg time: 1.674415ms avg rate: 597.223508/s
2023-10-10T19:02:12.944+1100 7fbc22817f40 -1 total: 1200002 failed:2377 retried: 38791

Observations:

The current implementation (s3gw v0.21, multiple connections that open/close as necessary, no pool) has a average rate of 223.293834/s.
The single connection is easily > twice as fast (563.001564/s)
The pool is a bit faster again (597.223508/s)

I assume I'm not getting as good a result with the pool as Marcel, given I've got 8 cores and he's got 32.

Also interesting to note that of the above three, the single connection test is the only one that doesn't see any retries or failures.

Then I scaled the microbenchmarks down from one million to one hundred thousand queries, to cause my system to suffer a bit less. Here's what I got:

s3gw v0.21: avg rate: 255.704773/s (total: 120002 failed:17 retried: 45034)
single connection: avg rate: 583.348889/s (total: 120002 failed:0 retried: 0)
pool: avg rate: 782.778865/s (total: 120002 failed:183 retried: 2877)

Note the fail and retry counts are lower here, and I'm starting to see the advantage of the pool over the single connection.

TL;DR: the current implementation with it's gadzillion open/close calls really is noticeably slower than both the single connection and the pool. So I think this is definitely worth pursuing further.

@irq0, if you get a chance, would you mind running those microbenchmarks against an otherwise unmodified s3gw v0.21 to see how it behaves with your 32 cores? I'm curious to see if you have a similar experience.

(I should also mention, when I ran all the above, the sqlite database was on a tmpfs, to avoid getting disk bound on my crappy old SSD)

irq0 · 2023-10-10T09:23:35Z

SQLite has two threading modes that we care about, multithreaded and serialized (see sqlite.org/threadsafe.html). The default is serialized, but from the above it looks like multithreaded might be about 8% faster, so is probably worth experimenting with.

Both modes differ in a per connection mutex. So if we go pooled we have a mostly (? - maybe async request handling comes in here) uncontended mutex. Not sure that accounts for the 8% or just margin of error. I think serialized is a good start that we can iterate on.

irq0 · 2023-10-10T09:32:03Z

In the interest of testing that, I applied Marcel's microbenchmark code (irq0/ceph@54ce922#diff-521007125cbd4106a8fc8043757557c75c013b766dee299df2df34824eb98b2a) to the current s3gw branch, and ran this on my (relatively old) 8 core desktop:

The results are all for the create_new_object test? Can you check the get_random_object one?

Database writes are serialized and the single threaded result data is dominated by queue time. In the copy case the copies likely add up.

The get_random_object is reads only and exercises all the read concurrency we can get from SQLite. Curious about the data from your system.

tserong · 2023-10-10T10:09:23Z

Here we go. get_random_object with 100K queries:

s3gw v0.21:

[ RUN      ] Common/TestSFSConcurrency.performance_multi_thread/get_random_object
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <100µs		0
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <190µs		0
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <370µs		0
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <730µs		108
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <1450µs		600
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <2890µs		17481
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <5770µs		76120
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <11530µs		5428
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <23050µs		243
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <46090µs		19
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <92170µs		1
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <184330µs		0
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <368650µs		0
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <737290µs		0
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <1474570µs		0
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <2949130µs		0
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 <∞		10
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 count: 100000, time total: 375107ms, avg time: 3.751070ms avg rate: 266.590599/s
2023-10-10T20:37:54.867+1100 7fbdf686cf40 -1 total: 20002 failed:0 retried: 0
[       OK ] Common/TestSFSConcurrency.performance_multi_thread/get_random_object (65302 ms)

single connection (5.6x faster than above):

[ RUN      ] Common/TestSFSConcurrency.performance_multi_thread/get_random_object
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <100µs		1694
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <190µs		4764
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <370µs		12178
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <730µs		47376
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <1450µs		29913
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <2890µs		3476
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <5770µs		565
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <11530µs		34
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <23050µs		0
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <46090µs		0
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <92170µs		0
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <184330µs		0
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <368650µs		0
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <737290µs		0
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <1474570µs		0
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <2949130µs		0
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 <∞		10
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 count: 100000, time total: 67601ms, avg time: 0.676010ms avg rate: 1479.268058/s
2023-10-10T20:39:15.227+1100 7f54b0a50f40 -1 total: 20002 failed:0 retried: 0
[       OK ] Common/TestSFSConcurrency.performance_multi_thread/get_random_object (12752 ms)

pool (4.3x faster than single connection, 24x faster than original):

[ RUN      ] Common/TestSFSConcurrency.performance_multi_thread/get_random_object
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <100µs		6783
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <190µs		85775
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <370µs		3603
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <730µs		2078
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <1450µs		1102
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <2890µs		407
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <5770µs		210
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <11530µs		37
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <23050µs		5
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <46090µs		0
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <92170µs		0
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <184330µs		0
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <368650µs		0
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <737290µs		0
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <1474570µs		0
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <2949130µs		0
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 <∞		10
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 count: 100000, time total: 15673ms, avg time: 0.156730ms avg rate: 6380.399413/s
2023-10-10T20:40:28.799+1100 7f08f784ff40 -1 total: 20002 failed:0 retried: 0
[       OK ] Common/TestSFSConcurrency.performance_multi_thread/get_random_object (6273 ms)

I also tried with 1 million queries, and the rates remained it the same ballpark, it just all took 10x longer (s3gw v0.21: 260.238837/s, single connection: 1470.839143/s, pool: 6457.070169/s).

jecluis · 2023-10-16T06:53:55Z

Related to aquarist-labs/ceph#209

jecluis · 2023-10-16T06:54:41Z

Tentatively pushing to v0.23.0

Currently, `DBConn` keeps an instance of `Storage`, which is created by `sqlite_orm::make_storage()`. That first instance of `Storage` is long-lived (the `DBConn` c'tor calls `storage->open_forever()`) so the sqlite database is open for the entire life of the program, but this first `Storage` object and its associated sqlite database connection pointer are largely not used for anything much after initialization. The exception is the SFS status page, which also uses this connection to report some sqlite statistics. All the other threads (the workers, the garbage collector, ...) call `DBConn::get_storage()`, which returns a copy of the first `Storage` object. These copies don't have `open_forever()` called on them, which means every time they're used for queries we get a pair of calls to `sqlite3_open()` and `sqlite3_close()` at the start end end of the query. These calls don't open the main database file again (it's already open) but they do open and close the WAL. There's a couple of problems with this. One is that the SFS status page only sees the main thread (which is largely boring), and can't track any of the worker threads. The other problem is that something about not keeping the connection open on the worker threads is relatively expensive. If we keep connections open rather than opening and closing with every query, we can get something like a 20x speed increase on read queries, and at least 2x on writes. This new implementation gives one `Storage` object per thread, created on demand as a copy of the first `Storage` object created in the `DBConn` constructor. Fixes: https://github.com/aquarist-labs/s3gw/issues/727 Signed-off-by: Tim Serong <[email protected]>

jecluis added kind/enhancement Change that positively impacts existing code area/rgw-sfs RGW & SFS related labels Sep 27, 2023

jecluis added this to S3GW Sep 27, 2023

github-project-automation bot moved this to Backlog in S3GW Sep 27, 2023

github-actions bot added the triage/waiting Waiting for triage label Sep 27, 2023

tserong self-assigned this Oct 4, 2023

tserong moved this from Backlog to In Progress 🏗️ in S3GW Oct 12, 2023

jecluis added this to the v0.23.0 milestone Oct 16, 2023

jecluis added priority/1 Should be fixed for next release and removed triage/waiting Waiting for triage labels Oct 16, 2023

tserong moved this from In Progress 🏗️ to In Review 👀 in S3GW Oct 19, 2023

tserong closed this as completed in aquarist-labs/ceph@fc9d9f2 Nov 6, 2023

github-project-automation bot moved this from In Review 👀 to Done in S3GW Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw/sfs: abstract how we obtain connections, potentially change between threading models #727

rgw/sfs: abstract how we obtain connections, potentially change between threading models #727

jecluis commented Sep 27, 2023 •

edited by tserong

Loading

tserong commented Oct 10, 2023 •

edited

Loading

irq0 commented Oct 10, 2023

irq0 commented Oct 10, 2023

tserong commented Oct 10, 2023 •

edited

Loading

jecluis commented Oct 16, 2023

jecluis commented Oct 16, 2023

rgw/sfs: abstract how we obtain connections, potentially change between threading models #727

rgw/sfs: abstract how we obtain connections, potentially change between threading models #727

Comments

jecluis commented Sep 27, 2023 • edited by tserong Loading

tserong commented Oct 10, 2023 • edited Loading

irq0 commented Oct 10, 2023

irq0 commented Oct 10, 2023

tserong commented Oct 10, 2023 • edited Loading

jecluis commented Oct 16, 2023

jecluis commented Oct 16, 2023

jecluis commented Sep 27, 2023 •

edited by tserong

Loading

tserong commented Oct 10, 2023 •

edited

Loading

tserong commented Oct 10, 2023 •

edited

Loading