-
-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[3.x] Add 3D occlusion queries #76869
base: 3.x
Are you sure you want to change the base?
Conversation
Please amend the commit message to be more explicit (the detailed PR description is great, but commit messages also matter - see CONTRIBUTING.md or browse through |
Should be fun to try! 😃 |
Note that the Windows task manager's GPU usage readout is wildly known to be inaccurate. I suggest using external tools such as RTSS instead, or use |
Yeah, I know. I only used the task manager to get a quick screenshot for the PR description. During development I used a number of different tools to measure performance, but when I wrote the PR description it was quite late and I was too tired to set up everything and then take good screenshots. Thanks for the docs feedback, I will change that and the commit message once I am back at my PC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks pretty good. I have a few specific concerns that I note in specific comments.
One other concern is the fact that only opaque objects are checked. Transparent objects are never culled. What is the reasoning behind excluding transparent objects? To me it seems like they would benefit just as much.
@@ -1719,6 +1719,13 @@ | |||
<member name="rendering/quality/lightmapping/use_bicubic_sampling.mobile" type="bool" setter="" getter="" default="false"> | |||
Lower-end override for [member rendering/quality/lightmapping/use_bicubic_sampling] on mobile devices, in order to reduce bandwidth usage. | |||
</member> | |||
<member name="rendering/quality/occlusion_queries/disable_for_vendors" type="String" setter="" getter="" default=""PowerVR,Adreno""> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your testing, for which vendors did this actually improve performance, just ARM Mali?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depended a lot on the platform & hardware. I managed to get improvements (desktop & mobile) on most of them, but Adreno was a burning dumpster fire.
But that might have been just my Adreno device, I have had trouble with it in the past.
if (use_oq) { | ||
render_list.sort_by_depth(false); | ||
} else { | ||
render_list.sort_by_key(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about the performance impact of this on complex scenes. Doing this will result in doing a lot of state changes that are not necessary.
I am also skeptical that running occlusion queries while building the depth buffer will result in a net performance gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I need to do this since AFAIK I don't have a depth prepass on mobile.
On desktop I managed to get a nice performance boost during the early depth pass since the bounding boxes only need 8 vertices. In badly optimized scenes that can make a suprisingly big difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you may need to have a more fine-grained check here. If the setting is only a benefit when not using a depth prepass, then it should only be enabled when not using a depth prepass.
Sorting is less beneficial while rendering the depth prepass, so I guess it may make sense to sort by depth in the prepass when occlusion queries are enabled.
What is the performance overhead like for rendering an object with occlusion queries enabled on mobile? My understanding is that everyone always uses AABBs of the objects after the full depth buffer is built. That way you are always testing against a complete depth buffer
if (use_oq) { | ||
render_list.sort_by_depth(false); | ||
} else { | ||
render_list.sort_by_key(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern as above
uint64_t prev_frame = 0; | ||
GLuint query; | ||
}; | ||
//TODO: remove stuff from this map when an instance is removed from the rendering server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice where you do this cleanup, but indeed, you will need to ensure that you aren't leaking queries
@@ -3640,6 +3640,46 @@ void RasterizerStorageGLES3::mesh_add_surface(RID p_mesh, uint32_t p_format, VS: | |||
surface->attribs[i] = attribs[i]; | |||
} | |||
|
|||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest storing the AABB on CPU side, and then create the vertex buffer on demand the first time it is needed. Otherwise this runs for every object created even if occlusion culling is never used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid doing stuff on demand but maybe you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that creating buffers on demand sucks. But we can't add a fixed cost to every single project just to benefit users of a particular feature
@@ -8395,6 +8439,23 @@ void RasterizerStorageGLES3::initialize() { | |||
|
|||
config.use_physical_light_attenuation = GLOBAL_GET("rendering/quality/shading/use_physical_light_attenuation"); | |||
|
|||
config.use_occlusion_queries = bool(GLOBAL_GET("rendering/quality/occlusion_queries/enable")); | |||
// I <3 different gpu drivers! *screams in agony* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, funny, but should be removed before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That shouldn't be there. Sorry. At least I managed to find & remove all the more explicit notes I left while working.
use_oq = use_oq && VSG::viewport->viewport_get_update_mode(VSG::viewport->current_viewport_id) != VisualServer::ViewportUpdateMode::VIEWPORT_UPDATE_ONCE; | ||
use_oq = use_oq && VSG::viewport->viewport_get_update_mode(VSG::viewport->current_viewport_id) != VisualServer::ViewportUpdateMode::VIEWPORT_UPDATE_DISABLED; | ||
use_oq = use_oq && VSG::viewport->viewport_get_allow_occlusion_queries(VSG::viewport->current_viewport_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling the data from the VSG feels a bit icky to me. I think its not actually a problem as these methods aren't exposed through the CommandQueue interface so this shouldn't stall the rendering thread, but it seems like an messy dependency.
I have a feeling that use_oq
should be passed down from the VisualServer into render_scene. That way all this logic can stay with the viewport
glGetQueryObjectuiv(data.query, GL_QUERY_RESULT_AVAILABLE, &query_result_available); | ||
if (query_result_available == GL_FALSE) { | ||
query_result = 1; //let's just assume this is visible | ||
} else { | ||
glGetQueryObjectuiv(data.query, GL_QUERY_RESULT, &query_result); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you check if cur_frame - prev_frame >= 2
here to allow for up to 2 frames in flight.
That being said, the description of GL_QUERY_RESULT_AVAILABLE makes it sound like it doesn't have to wait for the GPU to catch up before returning (as the function would be pointless if it did)
This makes it possible cull away invisible meshes by relying on their AABBs.
This PR makes it possible to use occlusion queries to cull away parts of the scene that are not visible. While not being quite as high quality as the manual occluders we already have this has the advantage that it can be enabled by clicking a simple check box in the settings.
The visibility of culled objects is checked with their bounding box every frame, if the bounding box would be visible the renderer checks the visibility of the original asset in the next frame. Especially in scenes with badly optimized 3d assets that have too many vertices, this can result in nice performance gains (see screenshot below).
Note that the performance of occlusion query can vary a LOT on different hardware & drivers, so I included a settings option to disable this feature depending on the GPU vendor. I also hard-limited this feature to meshes that have more than 256 vertices, due to the overhead of OQ.
The current implementation is as general-purpose as I could make it, with OQ enabled for pretty much all meshes. I have two branches that are more complicated and try to be smarter about this (e.g. exclude animated meshes, etc.) but after a LOT of
professional testingtrial and error, I think that this version is better for most use cases.Something that I will probably need to change is the way I generate the query objects, I left a comment at the relevant line of code. Would greatly appreciate feedback on that in particular.
Screenshot
Sorry for the lackluster image, This is the best I could do on short notice. If I remember I will replace this with a better screenshot tomorrow when I take some time to work through my old PRs.
Anyways, this shows the GPU load in a scene that has a bunch of unoptimized meshes. The valleys are when I had a wall between me and the hero assets, the peaks are when I could see them.
Note that this image was taken with the depth pre-pass disabled. If I were to enable it the difference would be even bigger, since the pre-pass needs to evaluate all the vertices a second time.
This PR was sponsored by Ramatak with 💚