Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve lifetime management of command encoders and buffers #6544

Merged
merged 4 commits into from
Dec 2, 2024

Conversation

teoxoy
Copy link
Member

@teoxoy teoxoy commented Nov 14, 2024

This PR improves the lifetime management of command encoders (it's now automatic on drop, no more manual destruction), it also removes duplicate invalidity state from command encodes and with the last commit, we now invalidate the command encoder if there have been any errors while recording (per spec).

@teoxoy
Copy link
Member Author

teoxoy commented Nov 14, 2024

I couldn't repro the issue in CI, thanks for pairing @ErichDonGubler and helping to debug it!

@teoxoy
Copy link
Member Author

teoxoy commented Nov 15, 2024

@ErichDonGubler I added the comment on the Queue.device field.

Copy link
Member

@ErichDonGubler ErichDonGubler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: New code makes sense, and actually helped me understand the old code after many "wat"s ("why did we represent things this way before, am confoozed"). Nice!

Just got some questions I'd like resolved before merging, but I'm sure they'll be straightforward.

wgpu-core/src/device/queue.rs Show resolved Hide resolved
wgpu-core/src/device/queue.rs Show resolved Hide resolved
wgpu-core/src/device/queue.rs Show resolved Hide resolved
wgpu-core/src/device/resource.rs Show resolved Hide resolved
wgpu-hal/src/dx12/command.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/mod.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/mod.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/mod.rs Show resolved Hide resolved
wgpu-core/src/command/mod.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/clear.rs Outdated Show resolved Hide resolved
@ErichDonGubler ErichDonGubler added type: bug Something isn't working kind: refactor Making existing function faster or nicer labels Nov 15, 2024
wgpu-hal/src/metal/device.rs Outdated Show resolved Hide resolved
wgpu-hal/src/metal/command.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/mod.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/mod.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/mod.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/mod.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/mod.rs Outdated Show resolved Hide resolved
wgpu-core/src/command/mod.rs Show resolved Hide resolved
wgpu-core/src/command/mod.rs Outdated Show resolved Hide resolved
@jimblandy
Copy link
Member

If it's acceptable for unexpected operations in the Finished state to transition the command buffer to the Error state, then this commit shows that, even aside from deleting the macro itself, the code can actually be shorter without the macro.

@teoxoy
Copy link
Member Author

teoxoy commented Nov 18, 2024

If it's acceptable for unexpected operations in the Finished state to transition the command buffer to the Error state, then this commit shows that, even aside from deleting the macro itself, the code can actually be shorter without the macro.

That's how the code used to look in a previous iteration of the patch, the issue was that we can't replace Finished with Error. We have tests that check for this, which is how I discovered it, I then looked at the spec and it also doesn't say that we can invalidate the encoder, only that a validation error is raised.

I don't like the macro either but it seemed appropriate to hide the ugliness. Something else we could try is to match twice in the methods, will put up a commit for that to see if we like it more.

@teoxoy
Copy link
Member Author

teoxoy commented Nov 18, 2024

I guess we could also just reassign the finished state back.

@teoxoy
Copy link
Member Author

teoxoy commented Nov 18, 2024

I think I addressed all the comments.

@jimblandy
Copy link
Member

One interesting wrinkle in this is that, since struct fields are dropped in order, dropping a wgpu_core::Device will drop Device::raw before Device::command_allocator. This means that wgpu_hal requires wgpu_hal::CommandEncoder implementations to be droppable even after the wgpu_hal::Device is gone.

@jimblandy
Copy link
Member

Sorry, I'm very close to done with this review. Will finish tomorrow morning.

@teoxoy
Copy link
Member Author

teoxoy commented Nov 19, 2024

One interesting wrinkle in this is that, since struct fields are dropped in order, dropping a wgpu_core::Device will drop Device::raw before Device::command_allocator. This means that wgpu_hal requires wgpu_hal::CommandEncoder implementations to be droppable even after the wgpu_hal::Device is gone.

This is the case right now but is a bit problematic that we now have Drop implementations for some objects in HAL but also the destroy_* methods that require callers to pass a &HalDevice (implicitly proving that they still have a HAL device around).

I'm not sure what the best solution is but I'm leaning towards adding Drop implementations for all HAL objects and moving the burden of keeping device backrefs to HAL backends. Interestingly (I think) only Vulkan would need device backrefs since Metal and D3D12 objects already have backrefs. Also, the current implementations of the destroy_* methods in Metal and D3D12 are almost always empty.

@jimblandy
Copy link
Member

I'm not sure what the best solution is but I'm leaning towards adding Drop implementations for all HAL objects and moving the burden of keeping device backrefs to HAL backends. Interestingly (I think) only Vulkan would need device backrefs since Metal and D3D12 objects already have backrefs. Also, the current implementations of the destroy_* methods in Metal and D3D12 are almost always empty.

Right now, the wgpu_hal documentation says, "A CommandEncoder must not outlive its Device." If wgpu_hal relaxes this rule, and allows CommandEncoders to outlive Devices, then we need to decide whether:

  • it is safe to call methods on a CommandEncoder after its Device has been dropped, or

  • once the Device has been dropped, the only thing you can do with a CommandEncoder is drop it.

@jimblandy
Copy link
Member

In any case, this PR as it stands doesn't respect wgpu_hal's rule that "A CommandEncoder must not outlive its Device." So either the code or the wgpu_hal documentation needs to be updated. Suggestion:

modified   wgpu-hal/src/lib.rs
@@ -1108,7 +1108,9 @@ pub trait Queue: WasmNotSendSync {
 /// - A `CommandBuffer` must not outlive the `CommandEncoder` that
 ///   built it.
 ///
-/// - A `CommandEncoder` must not outlive its `Device`.
+/// - A `CommandEncoder` may outlive the `Device` that created it. But once the
+///   `Device` has been dropped, the only operation permitted on the
+///   `CommandEncoder` is to drop it.
 ///
 /// It is the user's responsibility to meet these requirements. This
 /// allows `CommandEncoder` implementations to keep their state

@jimblandy
Copy link
Member

Wow, so dropping a CommandBufferMutable never recycled the wgpu_hal encoder in Device::command_allocator.

@jimblandy
Copy link
Member

jimblandy commented Nov 20, 2024

Right now, the wgpu_hal documentation says, "A CommandEncoder must not outlive its Device." If wgpu_hal relaxes this rule, and allows CommandEncoders to outlive Devices, then we need to decide whether:

  • it is safe to call methods on a CommandEncoder after its Device has been dropped, or

  • once the Device has been dropped, the only thing you can do with a CommandEncoder is drop it.

There's a third alternative:

  • Keep the wgpu_hal requirement that CommandEncoders must not outlive their Device, and change the order of fields in wgpu_core::device::Device so that the device is dropped last.

This wouldn't be a wgpu_hal property that any of the backends would actually exploit. But having a logical but unnecessary rule that's not hard to follow seems better than having strange rules. So I think this would be my first choice.

@teoxoy
Copy link
Member Author

teoxoy commented Nov 20, 2024

That sounds good to me too. I can make the change tomorrow but let me know if there is anything else I should address.

@jimblandy
Copy link
Member

jimblandy commented Nov 27, 2024

Having thought about how wgpu_core handles wgpu_hal objects a little more:

Broadly speaking, I think wgpu_core is not using Rust's unsafe concept in a way that gets us much leverage. Perhaps I'm missing some important pattern, but it looks to me like wgpu_core has a lot of functions that can cause UB if used incorrectly, usually through their use of wgpu_hal, and yet are not marked unsafe - that is, they're mislabeled. But of course, plastering unsafe on everything isn't helpful either. This is a pervasive issue, but for the moment let's assume that we want to repair this, and consider the sorts of changes this PR makes.

As long as their types don't have lifetime parameters, safe code can drop two values in any order it wants. This means that wgpu_hal types cannot have safety requirements about the order in which they are dropped. So this rule is no good:

A CommandEncoder must not outlive its Device.

Given ownership of a CommandEncoder and Device, entirely safe code could violate this rule and cause UB.

Now, since wgpu_hal::Device::create_command_encoder is unsafe, we could salvage this by adding a safety condition to it that says that the caller must ensure the drop order rules are respected. That is, the safety comment on the unsafe block that calls create_command_encoder would need to explain how it's going to ensure that the command encoder it created will not outlive the device used to do it. Then all types that operate on wgpu_hal command encoders produced by that block become responsible for upholding those promises.

But that would be pretty delicate, and it seems like wgpu_hal implementations actually don't need this property, since all the command encoders are, in practice, holding strong references to everything they need to be dropped safely. So it may be better for wgpu_hal to actually remove the requirement that CommandEncoder not outlive its Device. This would be the middle option of the three I suggested above.

We would want to consider whether wgpu_hal should forbid using CommandEncoder methods after the Device is closed.

Copy link
Member

@jimblandy jimblandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor omissions, but otherwise this looks fantastic.

wgpu-core/src/command/ray_tracing.rs Show resolved Hide resolved
wgpu-core/src/command/ray_tracing.rs Show resolved Hide resolved
Copy link
Member

@jimblandy jimblandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and also, the wgpu_hal safety docs need to be updated to reflect whatever policy we settle on.

@teoxoy
Copy link
Member Author

teoxoy commented Nov 27, 2024

So it may be better for wgpu_hal to actually remove the requirement that CommandEncoder not outlive its Device. This would be the middle option of the three I suggested above.

This sounds good to me as well. I will remove the comment that says it must not outlive its Device.

We would want to consider whether wgpu_hal should forbid using CommandEncoder methods after the Device is closed.

I think it's fine if we allow using CommandEncoder methods after dropping a hal Device as it's not a requirement that hal needs and it would have the same non-obvious ramifications.

Copy link
Member

@jimblandy jimblandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thank you!

teoxoy and others added 4 commits December 2, 2024 15:41
…iants to hold `CommandBufferMutable`

This makes the code more straightforward, we were previously holding invalidity state in 2 places: `CommandBuffer::data` could hold `None` and in `CommandEncoderStatus::Error`.

This commit also implements `Drop` for `CommandEncoder` which makes the destruction/reclamation code automatic. We were previously not reclaiming all command encoders (`CommandBufferMutable::destroy` didn't call `release_encoder`) even though all encoders are coming from a pool.
The spec requires us to invalidate the encoder if there was any error during the body of these operations.
@jimblandy jimblandy enabled auto-merge (rebase) December 2, 2024 23:47
@jimblandy jimblandy merged commit 02b28e2 into gfx-rs:trunk Dec 2, 2024
27 checks passed
@teoxoy teoxoy deleted the next branch December 3, 2024 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: refactor Making existing function faster or nicer type: bug Something isn't working
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants