Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/sfs: check number of file descriptors on start #752

Closed
jecluis opened this issue Oct 11, 2023 · 7 comments · Fixed by aquarist-labs/ceph#229
Closed

rgw/sfs: check number of file descriptors on start #752

jecluis opened this issue Oct 11, 2023 · 7 comments · Fixed by aquarist-labs/ceph#229
Assignees
Labels
area/rgw-sfs RGW & SFS related kind/enhancement Change that positively impacts existing code triage/waiting Waiting for triage

Comments

@jecluis
Copy link
Contributor

jecluis commented Oct 11, 2023

We should ensure the process is able to allocate more than just the 1024 file descriptors (default value), because otherwise we could end up having issues after exhausting the number of file descriptors.

The proposal is to allocate a bunch of file descriptors on start, and ensuring that we can do it. If not, die with a message to the user. Otherwise, continue. The expectation is that this would prevent potential problems down the line, with a few hundred milliseconds as the trade-off on start.

@jecluis jecluis added kind/enhancement Change that positively impacts existing code area/rgw-sfs RGW & SFS related labels Oct 11, 2023
@jecluis jecluis added this to S3GW Oct 11, 2023
@github-project-automation github-project-automation bot moved this to Backlog in S3GW Oct 11, 2023
@github-actions github-actions bot added the triage/waiting Waiting for triage label Oct 11, 2023
@tserong
Copy link
Contributor

tserong commented Oct 12, 2023

JFTR, if FDs are exhausted when making requests, you see things like this in the logs:

2023-10-12T17:58:40.756+1100 7f52a1e526c0 10 req 0 0.003333300s s3:put_obj > multipart_writer_v2::prepare upload_id: 20231012T065837.652623976Z, part: 1
2023-10-12T17:58:40.759+1100 7f52a1e526c0 10 bucket::get_multipart_upload: oid: test-single-C-1, upload id:  20231012T065837.652623976Z
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) cannot open file at line 43451 of [831d0fb283]
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) os_unix.c:43451: (24) open(/scratch/s3gw/qa/s3gw.db-wal) - 
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) unable to open database file in "PRAGMA journal_mode=WAL;PRAGMA synchronous=normal;PRAGMA temp_store = memory;PRAGMA case_sensitive_like=ON;PRAGMA mmap_size = 30000000000;PRAGMA journal_size_limit = -1;"
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) cannot open file at line 43451 of [831d0fb283]
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) os_unix.c:43451: (24) open(/scratch/s3gw/qa/s3gw.db-wal) - 
2023-10-12T17:58:40.759+1100 7f52a1e526c0 -1 [SQLITE] (14) unable to open database file in "SELECT "multiparts"."id", "multiparts"."bucket_id", "multiparts"."upload_id", "multiparts"."state", "multiparts"."state_change_time", "multiparts"."object_name", "multiparts"."
2023-10-12T17:58:40.759+1100 7f52a1e526c0  0 req 0 0.006666603s s3:put_obj !!! BUG Unhandled exception while executing operation put_obj: unable to open database file: unable to open database file. replying internal error
2023-10-12T17:58:40.766+1100 7f52a1e526c0  0 req 0 0.013333205s s3:put_obj START BACKTRACE (exception St12system_error)

Or maybe this:

2023-10-12T17:42:27.482+1100 7f85c3eb86c0 -1 [SQLITE] (14) cannot open file at line 43451 of [831d0fb283]
2023-10-12T17:42:27.482+1100 7f85c3eb86c0 -1 [SQLITE] (14) os_unix.c:43451: (24) open(/scratch/s3gw/qa/s3gw.db) - 
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  5 req 0 0.003333302s s3:list_buckets auth engine throwed unexpected err: unable to open database file: unable to open database file
2023-10-12T17:42:27.482+1100 7f85c3eb86c0 10 failed to authorize request
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  1 req 0 0.003333302s op->ERRORHANDLER: err_no=-1 new_err_no=-1
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  2 req 0 0.003333302s s3:list_buckets op status=0
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  2 req 0 0.003333302s s3:list_buckets http status=403
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  1 ====== req done req=0x7f8667bf16e0 op status=0 http_status=403 latency=0.003333302s ======
2023-10-12T17:42:27.482+1100 7f85c3eb86c0  1 beast: 0x7f8667bf16e0: 127.0.0.1 - - [12/Oct/2023:17:42:27.478 +1100] "GET / HTTP/1.1" 403 95 - - - latency=0.003333302s

After that, subsequent requests will tend to just hang (or, if you're lucky, maybe fail with "access denied")

@tserong tserong moved this from Backlog to In Progress 🏗️ in S3GW Oct 12, 2023
@tserong
Copy link
Contributor

tserong commented Oct 12, 2023

The proposal is to allocate a bunch of file descriptors on start, and ensuring that we can do it.

I can't remember whether we discussed this detail, but is there any reason not use getrlimit(RLIMIT_NOFILE) to check the current limit, rather than trying to allocate FDs? Because I now have some code that works using getrlimit() :-)

tserong referenced this issue in tserong/ceph Oct 12, 2023
This is somewhat arbitrary, but the idea is that we potentially need
at least 4 FDs per worker thread (two for the sqlite db and its WAL,
and another two to accommodate files that may be being read or written),
plus about 40 for various pipes and sockets and things that appear in
in /proc/$(pgrep radosgw)/fd before anything interesting happens, so
let's round that 40 up to 64 just in case.

Fixes: https://github.com/aquarist-labs/s3gw/issues/752
Signed-off-by: Tim Serong <[email protected]>
@jecluis
Copy link
Contributor Author

jecluis commented Oct 12, 2023

I think we just didn't know about that, or it didn't occur to anyone. If that works for the intended purpose, all the better. @irq0 thoughts?

@tserong
Copy link
Contributor

tserong commented Oct 12, 2023

I think we just didn't know about that, or it didn't occur to anyone.

I only found out about it today when I did some further digging :-)

@irq0
Copy link
Contributor

irq0 commented Oct 12, 2023

The proposal is to allocate a bunch of file descriptors on start, and ensuring that we can do it.

I can't remember whether we discussed this detail, but is there any reason not use getrlimit(RLIMIT_NOFILE) to check the current limit, rather than trying to allocate FDs? Because I now have some code that works using getrlimit() :-)

No limit / high limit doesn't mean they are actually free and there is no other mechanism that limits them 🙃

Some interesting more info in https://0pointer.net/blog/file-descriptor-limits.html - I think we should follow the advice at the end about soft / hard limits.

@tserong
Copy link
Contributor

tserong commented Oct 13, 2023

Some interesting more info in https://0pointer.net/blog/file-descriptor-limits.html - I think we should follow the advice at the end about soft / hard limits.

Fascinating. Thanks for the link. Most straightforward then is to do what Lennart says and bump the soft limit to the hard limit (which I can confirm is 524288 on my Tumbleweed desktop), but also maybe double check that the hard limit is nice and high, just out of paranoia.

I've attempted to confirm that there's no use of select() anywhere in the ceph codebase. All I could find is https://github.com/aquarist-labs/ceph/blob/533b54b55692534fab0a681fb3712f2742b8fa1a/src/msg/async/EventSelect.cc#L79 which isn't used on Linux anyway (that's a fallback if epoll() isn't available) and https://github.com/aquarist-labs/ceph/blob/533b54b55692534fab0a681fb3712f2742b8fa1a/src/tools/rbd/action/Perf.cc#L561 which is in the rbd command line tool so isn't relevant for us.

tserong referenced this issue in tserong/ceph Oct 13, 2023
Wwe potentially need at least 4 FDs per worker thread (two for the
sqlite db and its WAL, and another two to accommodate files that
may be being read or written), plus about 40 for various pipes and
sockets and things that appear in in /proc/$(pgrep radosgw)/fd
before anything interesting happens.  That's more than two thousand
FDs, but the default soft FD limit is only 1024.

The most straightforward and probably safest thing to do is just
bump the RLIMIT_NOFILE soft limit (1024) to the hard limit (which
these days should be 524288) on startup.  In case the hard limit
is somehow low, this commit also includes a check to see if it's
at least as high as what we imagine we need.

See https://0pointer.net/blog/file-descriptor-limits.html for
discussion on bumping RLIMIT_NOFILE.

Fixes: https://github.com/aquarist-labs/s3gw/issues/752
Signed-off-by: Tim Serong <[email protected]>
tserong referenced this issue in tserong/ceph Oct 13, 2023
We potentially need at least 4 FDs per worker thread (two for the
sqlite db and its WAL, and another two to accommodate files that
may be being read or written), plus about 40 for various pipes and
sockets and things that appear in in /proc/$(pgrep radosgw)/fd
before anything interesting happens.  That's more than two thousand
FDs, but the default soft FD limit is only 1024.

The most straightforward and probably safest thing to do is just
bump the RLIMIT_NOFILE soft limit (1024) to the hard limit (which
these days should be 524288) on startup.  In case the hard limit
is somehow low, this commit also includes a check to see if it's
at least as high as what we imagine we need.

See https://0pointer.net/blog/file-descriptor-limits.html for
discussion on bumping RLIMIT_NOFILE.

Fixes: https://github.com/aquarist-labs/s3gw/issues/752
Signed-off-by: Tim Serong <[email protected]>
@tserong
Copy link
Contributor

tserong commented Oct 13, 2023

OK, I've updated aquarist-labs/ceph#229 to try to bump the soft limit (1024) to the hard limit (which should be 524288 on any reasonably modern system). Under the circumstances, given that limit is huge, I don't know that we need to try to actually allocate the couple thousand FDs we suspect we actually need at maximum.

@github-project-automation github-project-automation bot moved this from In Progress 🏗️ to Done in S3GW Oct 13, 2023
0xavi0 referenced this issue in aquarist-labs/ceph Oct 18, 2023
We potentially need at least 4 FDs per worker thread (two for the
sqlite db and its WAL, and another two to accommodate files that
may be being read or written), plus about 40 for various pipes and
sockets and things that appear in in /proc/$(pgrep radosgw)/fd
before anything interesting happens.  That's more than two thousand
FDs, but the default soft FD limit is only 1024.

The most straightforward and probably safest thing to do is just
bump the RLIMIT_NOFILE soft limit (1024) to the hard limit (which
these days should be 524288) on startup.  In case the hard limit
is somehow low, this commit also includes a check to see if it's
at least as high as what we imagine we need.

See https://0pointer.net/blog/file-descriptor-limits.html for
discussion on bumping RLIMIT_NOFILE.

Fixes: https://github.com/aquarist-labs/s3gw/issues/752
Signed-off-by: Tim Serong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rgw-sfs RGW & SFS related kind/enhancement Change that positively impacts existing code triage/waiting Waiting for triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants