-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sharing temporary_buffer across shards? #2450
Comments
Make refs atomic to enable distinct deleter objects on different shards to safely share an underlying deleter::impl. Maybe fixes scylladb#2450.
temporary_buffer is too low-level to directly support cross-shard traffic. That is, a request is likely to require many such objects and making all of them pay the price is too much. I think it's better to re-architect so sharing happens at the request level rather than buffer level. That is, have some object that owns the buffers, and share it with make_foreign. If you do want to share a temporary buffer, you can make a new temporary_buffer pointing at the same storage, but with a deleter that foreign_ptr-points to the original. On original shard: the original temporary_buffer |
We considered something like this, but I had thought was unsafe since foreign pointer only helps for deletion, but some other tb operations (e.g., Effectively, each shard has a unique copy of a "child" reference count, and when a shard's local reference count goes to zero it uses the foreign_ptr mechanism to update the "parent" reference count on the original shard. It doesn't make sharing transparent (you still need to do the dance at the shard boundary) and it has some other downsides, but maybe it's better than making all tb transparent. |
@avikivity wrote:
This seems a bit hard to pull off because it's not only about destruction: even |
You mean const seastar::temporary_buffer& buf_orig;
auto del = buf_orig.share().release();
temporary_buffer buf_on_other_shard(buf_orig.get_write(), buf_orig.size(),
make_object_deleter(make_foreign(std::make_unique(std::move(del))));
smp::submit_to(other_shard, [buf = std::move(buf_on_other_shard)] {
}); |
My suggestion works if the caller halts work on the data while the callee is processing. If there is concurrent work on shared temporary_buffers, it's unsafe. You could do something like this:
If the caller cannot hold on to the deleter for some reason, it could chain it to the custom deleter in step 2. I discourage such games, they are likely to end with bad performance on large machines if you call many shards with the same t_b. |
@tgrabiec wrote:
Yes.
Yes, that suggestion I understand, but this deleter trick was was Avi's "if you really need to... but probably don't" suggestion. In my comment I was referring to the preferred approach of sharing at a higher level "request" as Avi had suggested. In this case, it was implied there is not need to mess with the deleters. |
Yes, but in the scenario we are considering this is almost impossible: the temporary buffers originate in the seastar network stack/input_stream stuff. So we don't know the extent of the sharing and these buffers are manipulated concurrently in the reactor loop due to things like more data arriving on the network etc. So once we get a buffer in the userspace, even if we sent it to another shard and take no further action until that is done, it seems unsafe because seastar internals may also manipulate it. Even if there is some kind of strong guarantee about exactly when the reference counts will be manipulated by seastar and we can somehow rule it out, a common pattern would be to read say one buffer off of the wire, sent it to a shard for processing, then without waiting read a second buffer. Note at the application layer we may always wait for the other shard to finish processing a request/buffer, before manipulating it but here even manipulating different buffers for different request which originated on the same socket is unsafe because the buffers are already shared when they pop out of the input_stream. This is not a complaint. I don't see an obvious solution: it is an obvious win for |
The native one? Or posix? In either case, my second suggestion will work (and indeed is safer, you keep the refcounts always local). In fact, to be completely correct, you'd pass the submit_to gap using an std::span, and create the new temporary_buffer on the other side. Or even keep using std::span.
I can give a guarantee that it's not safe to manipulate shard-crossing temporary_buffers.
I think the recipe in #2450 (comment) should work (with or without constructing a temporary_buffer on the callee side), as long as lifetimes are nested. An alternative solution: make the buffer_allocator interface public, make sure it can be passed down to your data source, and customize it so it generates cross-shard-safe temporary_buffers. I don't recommend it though. And it probably doesn't work, the buffer chains will still be unsafe. |
posix
Yes, agreed. I just wanted to make sure I wasn't missing anything about the first suggestion.
Agreed.
I'd be pessimistic about the chances of upstreaming this. |
Currently, it is not safe to use distinct
temporary_buffer
objects which share any underlying buffers across shards (aka "hidden sharing"). In practice, this generally means that is even unsafe to move any temporary buffers received from seastar to another shard, since these will often share underlying buffers with other temporary buffers (e.g., if they are derived from the same input stream). The only safe approach is to copy the underlying bytes in order to pass them to another shard.This makes it very difficult to do "zero copy" manipulation of such buffers. A typical use case is that a connection is processing requests and these requests are bound for various shards (depending on the contents of the request, e.g., what partition is involved), and then we want to pass the entire payload of the request to the owning shard. This should be easy to implement in a zero-copy thread safe way: there is no concurrent use of the buffer at all, it is examined on one shard then the entire buffer is moved to another shard (ownership could stay with the original shard, or not, the key is there is no concurrent access) where it is accessed. However, the hidden sharing inside
temporary_buffer
makes this impossible as far as I can tell.The only definite reason we can find that
temporary_buffer
has unsafe hidden sharing is thatdelete::impl::refs
is accessed non-atomically.Two questions:
Am I missing something about how to use distinct
temporary_buffer
objects across threads "zero-copy" without hitting this problem? For example I feel like ScyllaDB would run into the same unnecessary copy as described above unless it somehow manages to guarantee that the "connection" shard is the same as where the operation will need to be processed.Is there any willingness to make
delete::impl::refs
astd::atomic<unsigned>
? It seems to us this would solve the cross-shard hidden sharing problem, with the possible exception of deleters of typelambda_deleter_impl
which have non-trivial lambdas (or lambda destructors) which expect to run on the same shard they were created on. Evidently, this would have a performance impact for all temporary_buffer use. That would look like something like this delete: make impl::refs atomic #2451.Maybe there's a better solution out there than (2).
The text was updated successfully, but these errors were encountered: