rgw/sfs: check number of file descriptors on start #752

jecluis · 2023-10-11T09:38:44Z

We should ensure the process is able to allocate more than just the 1024 file descriptors (default value), because otherwise we could end up having issues after exhausting the number of file descriptors.

The proposal is to allocate a bunch of file descriptors on start, and ensuring that we can do it. If not, die with a message to the user. Otherwise, continue. The expectation is that this would prevent potential problems down the line, with a few hundred milliseconds as the trade-off on start.

tserong · 2023-10-12T07:04:01Z

JFTR, if FDs are exhausted when making requests, you see things like this in the logs:

2023-10-12T17:58:40.756+1100 7f52a1e526c0 10 req 0 0.003333300s s3:put_obj > multipart_writer_v2::prepare upload_id: 20231012T065837.652623976Z, part: 1
2023-10-12T17:58:40.759+1100 7f52a1e526c0 10 bucket::get_multipart_upload: oid: test-single-C-1, upload id:  20231012T065837.652623976Z
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) cannot open file at line 43451 of [831d0fb283]
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) os_unix.c:43451: (24) open(/scratch/s3gw/qa/s3gw.db-wal) - 
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) unable to open database file in "PRAGMA journal_mode=WAL;PRAGMA synchronous=normal;PRAGMA temp_store = memory;PRAGMA case_sensitive_like=ON;PRAGMA mmap_size = 30000000000;PRAGMA journal_size_limit = -1;"
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) cannot open file at line 43451 of [831d0fb283]
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) os_unix.c:43451: (24) open(/scratch/s3gw/qa/s3gw.db-wal) - 
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) unable to open database file in "SELECT "multiparts"."id", "multiparts"."bucket_id", "multiparts"."upload_id", "multiparts"."state", "multiparts"."state_change_time", "multiparts"."object_name", "multiparts"."
2023-10-12T17:58:40.759+1100 7f52a1e526c0  0 req 0 0.006666603s s3:put_obj !!! BUG Unhandled exception while executing operation put_obj: unable to open database file: unable to open database file. replying internal error
2023-10-12T17:58:40.766+1100 7f52a1e526c0  0 req 0 0.013333205s s3:put_obj START BACKTRACE (exception St12system_error)

Or maybe this:

2023-10-12T17:42:27.482+1100 7f85c3eb86c0 -1 [SQLITE] (14) cannot open file at line 43451 of [831d0fb283]
2023-10-12T17:42:27.482+1100 7f85c3eb86c0 -1 [SQLITE] (14) os_unix.c:43451: (24) open(/scratch/s3gw/qa/s3gw.db) - 
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  5 req 0 0.003333302s s3:list_buckets auth engine throwed unexpected err: unable to open database file: unable to open database file
2023-10-12T17:42:27.482+1100 7f85c3eb86c0 10 failed to authorize request
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  1 req 0 0.003333302s op->ERRORHANDLER: err_no=-1 new_err_no=-1
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  2 req 0 0.003333302s s3:list_buckets op status=0
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  2 req 0 0.003333302s s3:list_buckets http status=403
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  1 ====== req done req=0x7f8667bf16e0 op status=0 http_status=403 latency=0.003333302s ======
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  1 beast: 0x7f8667bf16e0: 127.0.0.1 - - [12/Oct/2023:17:42:27.478 +1100] "GET / HTTP/1.1" 403 95 - - - latency=0.003333302s

After that, subsequent requests will tend to just hang (or, if you're lucky, maybe fail with "access denied")

tserong · 2023-10-12T11:03:06Z

The proposal is to allocate a bunch of file descriptors on start, and ensuring that we can do it.

I can't remember whether we discussed this detail, but is there any reason not use getrlimit(RLIMIT_NOFILE) to check the current limit, rather than trying to allocate FDs? Because I now have some code that works using getrlimit() :-)

This is somewhat arbitrary, but the idea is that we potentially need at least 4 FDs per worker thread (two for the sqlite db and its WAL, and another two to accommodate files that may be being read or written), plus about 40 for various pipes and sockets and things that appear in in /proc/$(pgrep radosgw)/fd before anything interesting happens, so let's round that 40 up to 64 just in case. Fixes: https://github.com/aquarist-labs/s3gw/issues/752 Signed-off-by: Tim Serong <[email protected]>

jecluis · 2023-10-12T11:10:35Z

I think we just didn't know about that, or it didn't occur to anyone. If that works for the intended purpose, all the better. @irq0 thoughts?

tserong · 2023-10-12T11:12:49Z

I think we just didn't know about that, or it didn't occur to anyone.

I only found out about it today when I did some further digging :-)

irq0 · 2023-10-12T13:26:23Z

The proposal is to allocate a bunch of file descriptors on start, and ensuring that we can do it.

I can't remember whether we discussed this detail, but is there any reason not use getrlimit(RLIMIT_NOFILE) to check the current limit, rather than trying to allocate FDs? Because I now have some code that works using getrlimit() :-)

No limit / high limit doesn't mean they are actually free and there is no other mechanism that limits them 🙃

Some interesting more info in https://0pointer.net/blog/file-descriptor-limits.html - I think we should follow the advice at the end about soft / hard limits.

tserong · 2023-10-13T06:58:24Z

Some interesting more info in https://0pointer.net/blog/file-descriptor-limits.html - I think we should follow the advice at the end about soft / hard limits.

Fascinating. Thanks for the link. Most straightforward then is to do what Lennart says and bump the soft limit to the hard limit (which I can confirm is 524288 on my Tumbleweed desktop), but also maybe double check that the hard limit is nice and high, just out of paranoia.

I've attempted to confirm that there's no use of select() anywhere in the ceph codebase. All I could find is https://github.com/aquarist-labs/ceph/blob/533b54b55692534fab0a681fb3712f2742b8fa1a/src/msg/async/EventSelect.cc#L79 which isn't used on Linux anyway (that's a fallback if epoll() isn't available) and https://github.com/aquarist-labs/ceph/blob/533b54b55692534fab0a681fb3712f2742b8fa1a/src/tools/rbd/action/Perf.cc#L561 which is in the rbd command line tool so isn't relevant for us.

Wwe potentially need at least 4 FDs per worker thread (two for the sqlite db and its WAL, and another two to accommodate files that may be being read or written), plus about 40 for various pipes and sockets and things that appear in in /proc/$(pgrep radosgw)/fd before anything interesting happens. That's more than two thousand FDs, but the default soft FD limit is only 1024. The most straightforward and probably safest thing to do is just bump the RLIMIT_NOFILE soft limit (1024) to the hard limit (which these days should be 524288) on startup. In case the hard limit is somehow low, this commit also includes a check to see if it's at least as high as what we imagine we need. See https://0pointer.net/blog/file-descriptor-limits.html for discussion on bumping RLIMIT_NOFILE. Fixes: https://github.com/aquarist-labs/s3gw/issues/752 Signed-off-by: Tim Serong <[email protected]>

We potentially need at least 4 FDs per worker thread (two for the sqlite db and its WAL, and another two to accommodate files that may be being read or written), plus about 40 for various pipes and sockets and things that appear in in /proc/$(pgrep radosgw)/fd before anything interesting happens. That's more than two thousand FDs, but the default soft FD limit is only 1024. The most straightforward and probably safest thing to do is just bump the RLIMIT_NOFILE soft limit (1024) to the hard limit (which these days should be 524288) on startup. In case the hard limit is somehow low, this commit also includes a check to see if it's at least as high as what we imagine we need. See https://0pointer.net/blog/file-descriptor-limits.html for discussion on bumping RLIMIT_NOFILE. Fixes: https://github.com/aquarist-labs/s3gw/issues/752 Signed-off-by: Tim Serong <[email protected]>

tserong · 2023-10-13T08:20:02Z

OK, I've updated aquarist-labs/ceph#229 to try to bump the soft limit (1024) to the hard limit (which should be 524288 on any reasonably modern system). Under the circumstances, given that limit is huge, I don't know that we need to try to actually allocate the couple thousand FDs we suspect we actually need at maximum.

We potentially need at least 4 FDs per worker thread (two for the sqlite db and its WAL, and another two to accommodate files that may be being read or written), plus about 40 for various pipes and sockets and things that appear in in /proc/$(pgrep radosgw)/fd before anything interesting happens. That's more than two thousand FDs, but the default soft FD limit is only 1024. The most straightforward and probably safest thing to do is just bump the RLIMIT_NOFILE soft limit (1024) to the hard limit (which these days should be 524288) on startup. In case the hard limit is somehow low, this commit also includes a check to see if it's at least as high as what we imagine we need. See https://0pointer.net/blog/file-descriptor-limits.html for discussion on bumping RLIMIT_NOFILE. Fixes: https://github.com/aquarist-labs/s3gw/issues/752 Signed-off-by: Tim Serong <[email protected]>

jecluis added kind/enhancement Change that positively impacts existing code area/rgw-sfs RGW & SFS related labels Oct 11, 2023

jecluis assigned tserong Oct 11, 2023

jecluis added this to S3GW Oct 11, 2023

github-project-automation bot moved this to Backlog in S3GW Oct 11, 2023

github-actions bot added the triage/waiting Waiting for triage label Oct 11, 2023

tserong moved this from Backlog to In Progress 🏗️ in S3GW Oct 12, 2023

tserong mentioned this issue Oct 12, 2023

rgw/sfs: Increase RLIMIT_NOFILE soft limit to hard limit aquarist-labs/ceph#229

Merged

tserong closed this as completed in aquarist-labs/ceph#229 Oct 13, 2023

github-project-automation bot moved this from In Progress 🏗️ to Done in S3GW Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw/sfs: check number of file descriptors on start #752

rgw/sfs: check number of file descriptors on start #752

jecluis commented Oct 11, 2023

tserong commented Oct 12, 2023

tserong commented Oct 12, 2023

jecluis commented Oct 12, 2023

tserong commented Oct 12, 2023

irq0 commented Oct 12, 2023

tserong commented Oct 13, 2023 •

edited

Loading

tserong commented Oct 13, 2023

rgw/sfs: check number of file descriptors on start #752

rgw/sfs: check number of file descriptors on start #752

Comments

jecluis commented Oct 11, 2023

tserong commented Oct 12, 2023

tserong commented Oct 12, 2023

jecluis commented Oct 12, 2023

tserong commented Oct 12, 2023

irq0 commented Oct 12, 2023

tserong commented Oct 13, 2023 • edited Loading

tserong commented Oct 13, 2023

tserong commented Oct 13, 2023 •

edited

Loading