-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frustum-cull small draws #17808
Frustum-cull small draws #17808
Conversation
576f65d
to
3aa8422
Compare
This isn't a huge performance boost for the games that use BBOX (like Tekken), but it'll be more valuable if we start using soft culling more widely automatically, see #17808
Hm, this seems like a good idea for us to skip loading textures - could have interesting impacts on performance in some scenes. I think it'd be good (like software skinning, but hopefully not for as long...) to start as an option. That way people could experiment with the pros and cons and give feedback on heuristics/etc. to remove the option. Potentially we could use less-accurate culling checks that are faster and split the func. I'd like to keep the bbox jumps accurate but we could deviate for our own, since it doesn't have to be exact - it just has to skip enough to be profitable. Through mode is probably easy to cull but I wonder if it even happens often. -[Unknown] |
After some testing, on PC it's hard to beat just performing the draws vs doing this extra work to cull, for complex scenes like in God of War. In GTA though, it's pretty much even or a slight boost. Indeed, an option to experiment with is probably best. (And also, there's a lot of room for optimization - running the whole NormalizeVertex is total overkill). As for less accurate options, one would be to compute a bounding box during vertex loading. Also, there might be an interesting tradeoff to do this at Flush-time instead of at Submit-time. Not sure. |
07bf171
to
eabdee7
Compare
Okay, I made it a bit more conservative - many games submit a lot of tiny drawcalls which we end up joining together, and I now assume that if one drawcall passes culling in the "inner fastloop", all of them will, which cuts down a lot on checking in some games. Additionally, I made a fast-path for non-skinned non-morph geometry, avoiding NormalizeVertices. Now I can't really find any slowdowns even on PC, except my extreme GoW testcase which goes from 560 to 550 fps. Though, we still need to solve or avoid the interaction with vertex cache before merging - it'll reduce its efficiency a bit. Actually maybe this is the time to delete the vertex cache ... I'll do some Android testing.
|
eabdee7
to
6507890
Compare
0e9aea0
to
b09f120
Compare
Rebased it, with the new draw call merging my pathological GoW case doesn't slow down much anymore (since we skip the culling machinery entirely in that case, once one draw has been proven visible). So apart from the icky interactions with the vertex cache which needs solving, might consider actually merging this. |
b09f120
to
96a59cb
Compare
Current status: This does work quite well, but iis blocked on #18339 , and I also want to make sure that games don't end up in the "NormalizeVertices" path here, since it'll likely be expensive enough to eat up any wins. |
21c1d1b
to
04f0885
Compare
389aba0
to
1746c35
Compare
Together with previous optimizations to drawing, this is already fast enough now. But will of course have varying benefit in different games, some like Wipeout end up net zero since they already cull very efficiently. The previous vertex cache concern is now also gone, since it's, well, gone. So starting to think about just merging it without even adding an option, just to solve #17797... One concern might be that without SIMD'd matrix muliplies in the update function, maybe it'll incur some slowdown? Not sure.. There are also some more possible optimizations to implement.. |
1746c35
to
8894e03
Compare
8894e03
to
746d320
Compare
Optimized it a bit. I'm still a little bit afraid of performance regressions from the large number of plane updates that are caused in some games like Burnout Dominator. Can a lot less of those happen by not including the world matrix in the planes (since world matrix is by far the most commonly updated one), but then the plane checks will be a bit more expensive. Tricky tradeoffs. Though, in practice, I don't see much performance regression anywhere, but also where there are improvements they are not big. So still a bit in doubt here about the overall value, except for that one game in #17797 which will improve a lot :/ |
746d320
to
5db2bbe
Compare
I moved it into view space (to avoid updating the planes on every world matrix change, at the cost of transforming each vertex instead) and SSE-optimized it. Not seeing any perf regressions anymore, only wins. So I'll just do NEON as well tomorrow and get it in. |
08ce69d
to
6d5a27f
Compare
Some games do a poor job of culling stuff, and some transparent sprites can be very expensive if they cause a copy. Skipping them if outside the viewport makes sense in that case. One example are the flame sprites in #17797 . Additionally, we should be able to cull through-mode draws easily, this one doesn't even try.
19d4772
to
c5a94c3
Compare
c5a94c3
to
440b832
Compare
33dd7cd
to
7e85d3d
Compare
There, I think this is finally done. It's actually a noticeable boost now, instead of a loss, even in God of War. The amount of culling we get from this varies hugely between games. in LCS we cull 500 (tiny) draw calls per frame, in Wipeout around 10-20, in Tekken a bit more, in Virtua Tennis a lot. |
EDIT: Fixed in 904ce4f |
Instead of // Sign extension. Ugly without SSE4.
bits = _mm_srai_epi32(_mm_unpacklo_epi16(bits, bits), 16);
__m128 pos = _mm_mul_ps(_mm_cvtepi32_ps(bits), scaleFactor); maybe bits = _mm_unpacklo_epi16(_mm_set1_epi32(0), bits);
__m128 pos = _mm_mul_ps(_mm_cvtepi32_ps(bits), scaleFactor2); // scaleFactor2=2^(-(15+16)) Zero probably would be computed outside of the loop. |
Actually never mind, I misread. Your thing will probably work yes, since in that we incorporate the right shift in the scale factor. Clever! |
bool passCulling = onePassed || PASSES_CULLING; | ||
if (!passCulling) { | ||
// Do software culling. | ||
if (drawEngineCommon_->TestBoundingBox(verts, inds, count, vertexType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant for this one to be TestBoundingBoxFast()
too?
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, fixing
This is an experiment, some games do a poor job of culling stuff, and some transparent sprites can be very expensive if they cause a framebuffer copy. Skipping them if outside the viewport makes sense in that case. We simply re-use the bbox code for now, though this could be optimized.
We are simply re-using the BBOX culling code, which is not very efficient.One example is the flame sprites in #17797 .
!!!! Not for merging! It seems our culling isn't completely accurate and needs work. In GTA:LCS for example, some triangles intersecting the screen edges get culled by mistake. However it also manages to cull 600 draw calls per frame in that game (that inefficient water).Actually, my logic was just busted. It works!Additionally, we should be able to cull through-mode draws easily, this one doesn't even try.
Anyway, there are a few ways we can go with this:
In many games, the number of draws culled is very small since games do a good job themselves. In other games, the draws culled can be pretty high. Avoiding draws entirely can save a lot of work doing things like texture binding etc, but this will not be beneficial for all games due to the extra work (plus, the culling code needs a lot of optimization if we're gonna apply it widely).
Also just realized that if applied widely, this is going to seriously mess with the vertex cache when we try to cache merged sequences of draw calls - if parts of one of those gets culled, there'll be a lot of combinations...The vertex cache is now gone, so not a concern.Ended up doing culling in view space so we don't need to update the planes for every new world matrix, and additionally, SIMD-optimized the thing carefully. Now it's very fast and generally a win.