Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] concurrent multipart uploads occasionally raises unhandled exception #710

Closed
tserong opened this issue Sep 15, 2023 · 4 comments · Fixed by aquarist-labs/ceph#225
Closed
Assignees
Labels
kind/bug Something isn't working priority/0 Needs to go into the next release or force a patch
Milestone

Comments

@tserong
Copy link
Contributor

tserong commented Sep 15, 2023

Describe the bug

Running concurrent multipart puts results in occasional failures, e.g.:

> s3cmd -c .s3cfg-test put - s3://foo/test-single-B-132
[...]
WARNING: Upload failed: /test-single-B-132?partNumber=1&uploadId=20230915T094808.800730568Z (500 (InternalError))
WARNING: Waiting 3 sec...

The s3gw log will show something like this:

2023-09-15T19:47:39.752+1000 7f4d4e9b16c0  0 req 0 0.003333303s s3:put_obj !!! BUG Unhandled exception while executing operation put_obj: database is locked
: database is locked. replying internal error
2023-09-15T19:47:39.752+1000 7f4d541bc6c0  1 ====== req done req=0x7f4ce20d76e0 op status=0 http_status=200 latency=0.003333303s ======
2023-09-15T19:47:39.752+1000 7f4d541bc6c0  1 beast: 0x7f4ce20d76e0: 127.0.0.1 - testid [15/Sep/2023:19:47:39.749 +1000] "POST /foo/test-single-B-34?uploads 
HTTP/1.1" 200 243 - - - latency=0.003333303s
2023-09-15T19:47:39.755+1000 7f4d4e9b16c0  0 req 0 0.006666606s s3:put_obj START BACKTRACE (exception St12system_error)
 ceph version s3gw-v0.20.0-rc2-34-g2cf997b980c (2cf997b980ca03ee54b87ec46241ebbdac036ed9) reef (dev)
 1: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0x106d) [0x55856a0b9756]
 2: (process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO
*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigne
d long, std::ratio<1l, 1000000000l> >*, int*)+0x13ab) [0x55856a0bb0ba]
 3: bin/radosgw(+0x208ba0e) [0x55856a016a0e]
 4: bin/radosgw(+0x208c546) [0x55856a017546]
 5: bin/radosgw(+0x208c703) [0x55856a017703]
 6: bin/radosgw(+0x208c874) [0x55856a017874]
 7: bin/radosgw(+0x208c943) [0x55856a017943]
 8: make_fcontext()

END BACKTRACE

s3cmd retries the put, so everything ultimately works, but I don't like the look of that !!! BUG Unhandled exception thing. Three exclamation marks must be bad, right? ;-)

To Reproduce:

  1. Run the radosgw binary (I'm using bin/radosgw -i bar -d --no-mon-config --debug-rgw 1 --rgw-backend-store sfs --rgw-data $(pwd)/dev/rgw.bar --run-dir $(pwd)/dev/rgw.bar --rgw-sfs-data-path $(pwd)/dev/rgw.bar --rgw-gc-processor-period 3600 --rgw-frontends 'beast port=7480,status port=9090' > $(pwd)/dev/rgw.bar/s3gw.log in my source checkout)
  2. Run several instances of something like echo test | s3cmd put - s3://foo/test in a loop (e.g.: for i in $(seq 1 100) ; do echo test | s3cmd put - s3://foo/test-single-A-$i ; done in a few different terminal windows)
  3. Watch the terminal windows and s3gw.log and wait for errors.

Expected behavior

No errors.

Additional context

Note: this is running from the head of our s3gw branch of ceph on 2023-09-15, i.e. aquarist-labs/ceph@2cf997b

@github-project-automation github-project-automation bot moved this to Backlog in S3GW Sep 15, 2023
@github-actions github-actions bot added the triage/waiting Waiting for triage label Sep 15, 2023
@jhmarina jhmarina added priority/0 Needs to go into the next release or force a patch and removed triage/waiting Waiting for triage labels Sep 19, 2023
@jhmarina jhmarina added this to the v0.22.0 milestone Sep 19, 2023
@jhmarina jhmarina moved this from Backlog to Scheduled in S3GW Sep 19, 2023
@jhmarina jhmarina modified the milestones: v0.22.0, v.0.21.0 Sep 19, 2023
@jhmarina jhmarina moved this from Scheduled to To Do in S3GW Sep 21, 2023
@jecluis jecluis moved this from To Do to Scheduled in S3GW Oct 4, 2023
@vmoutoussamy vmoutoussamy modified the milestones: v.0.21.0, v0.22.0 Oct 4, 2023
@vmoutoussamy vmoutoussamy added the kind/bug Something isn't working label Oct 4, 2023
@0xavi0
Copy link
Contributor

0xavi0 commented Oct 9, 2023

MMmm. this looks related (or a duplicate) of https://github.com/aquarist-labs/s3gw/issues/626.
I proposed aquarist-labs/ceph#224... I think it also fixes this one

@tserong
Copy link
Contributor Author

tserong commented Oct 11, 2023

Unfortunately I can still hit this with aquarist-labs/ceph#224 in place:

2023-10-11T18:18:12.170+1100 7f25d198b6c0 10 req 0 0.006666600s s3:complete_multipart > multipart_upload_v2::complete upload_id: 20231011T071812.102209581Z, accounted_size: 0, tag: , owner
: M. Tester, olh_epoch: 0, target_obj: test-single-C-87
2023-10-11T18:18:12.170+1100 7f25d198b6c0 10 req 0 0.006666600s s3:complete_multipart > multipart_upload_v2::complete part_etags: {1=d8e8fca2dc0f896fd7cb4cb0031ba249}
2023-10-11T18:18:12.170+1100 7f25d198b6c0  0 req 0 0.006666600s s3:complete_multipart !!! BUG Unhandled exception while executing operation complete_multipart: database is locked: database
 is locked. replying internal error
2023-10-11T18:18:12.174+1100 7f25d198b6c0  0 req 0 0.009999900s s3:complete_multipart START BACKTRACE (exception St12system_error)
 ceph version s3gw-v0.21.0-12-g26316b8dc28 (26316b8dc28500f4486c5aba674ee82701af8f3b) reef (dev)
 1: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0x106d) [0x55fb8c89fda2]
 2: (process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock:
:Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x13ab) [0x55fb8c8
a1706]
 3: bin/radosgw(+0x2097c32) [0x55fb8c7fec32]
 4: bin/radosgw(+0x209876a) [0x55fb8c7ff76a]
 5: bin/radosgw(+0x2098927) [0x55fb8c7ff927]
 6: bin/radosgw(+0x2098a98) [0x55fb8c7ffa98]
 7: bin/radosgw(+0x2098b67) [0x55fb8c7ffb67]
 8: make_fcontext()

END BACKTRACE

@irq0
Copy link
Contributor

irq0 commented Oct 11, 2023

This might be the SQLiteMultipart::mark_complete transaction. Will PR a fix shortly

@tserong
Copy link
Contributor Author

tserong commented Oct 11, 2023

Seems to work fine with aquarist-labs/ceph#225 applied - just did a quick test run, didn't actually look at the code ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/0 Needs to go into the next release or force a patch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants