-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Headless rendering much slower on AMD GPUs? (30x slower) #1208
Comments
I think it's important to differentiate between rendering on the GPU and copying of data to and from the GPU over the PCI express bus. I presume this thread is actually about copying rather than rendering so the title of this thread is most likely misleading, is this so or am I have just reading things wrong? There isn't any information above about the amount of data being transferred and what mechanism is being used. With unexpected differences in performance between hardware/drivers sometimes Vulkan errors have occurred that one hardware/driver combination copes just fine with but others ended up slowing down. Running the application with Vulkan validation layer on would be useful test to make sure there are no issues that need fixing. As a general comment, when writing in English it's best to stick with English language conventions on numbers, so a . is a decimal place, not a deliminator between thousands. A German convention of 13.000 in an English language text will be read as 13, not 13 thousands. Having to second guess what folks might mean by what they write just takes away from the bandwidth required to understand the actual problem in hand. |
I apologize if I was unable to communicate the issue at hand clearly enough.
That is the curious thing here. We were seeing worse performance on AMD GPUs than we would have expected (by taking a rough guess based on the hardware specs) We now ended up with the minimal example code that I mentioned above, and it is still showing the same low performance & unexpected COPY in Task-Manager that we were seeing in our production app:
Translating this code to OpenGL for example, would be similar to a render-loop that is just doing So the framerate should be very high, and there should not be any GPU memory-copies happening ... because this is doing headless rendering, there should also be no VK Swapchain involved in any way. That is why the COPY workload in the Task-Manager & the low FPS on AMD are so surprising & unexpected. @robertosfield Thanks |
I had a quick look a the example and nothing jumps out as possible cause of slower rendering. I'm really busy with other VSG work right now so I'm not able to go test out the example as is, perhaps others can test out to get a feel for how things perform on different hardware/OS/driver combinations. Do any of the standard VSG example exhibit the same performance issue? As a general comment, I've been developing on Linux mostly when writing the VSG, using either AMD5700G integrated GPU or a Geforce 1650 and 2080 cards. I've also got an Intel laptop and desktop and use the integrated GPU on these. Mostly I'm seeing really consistent performance across the board. The integrated GPUs show lower cost of copying data from GPU associated memory into CPU associated memory than on the dedicated GPUs. The NVidia cards list more queue options, but that's down to their drivers, this can provide extra options for lowering the cost of copy, but generally I've found the AMD side to have lower copy cost but it's on integrated GPU so it's comparing apples to oranges. As I don't have a dedicated AMD card I can't say how the dedicated AMD card would perform. Vulkan and VSG support GPU timing stats, with the vsg::Profiler supporting both GPU and CPU stats collection so perhaps this is something to try out when profiling how the application is running. The vsg::Profiler can output it's result to console/file after the collection phase so I've used to a few times to figure out cost of different parts of the work. I would also recommend trying the same tests across different OS's and hardware/driver combinations. |
@drywolf Similar behaviour for me |
@drywolf Same hardware but on fedora 40 performs much better |
Thanks @Mikalai for testing ❤️ |
@Mikalai the last time I worked with AMD GPUs on Linux there were two different kinds of drivers, the open-source driver and the proprietary "ROCm" driver. |
@drywolf
|
@Mikalai
The official (proprietary) AMD driver would be showing something like:
|
I am in the process of setting up a more complete repro-case, and now I also recreated the issue with a windowed vsg example code. With that I am getting: on AMD RX 5700 XT
on NV RTX 2080 TI
The VSG code is basically the vsghelloworld.cpp example, but without rendering any 3D scene. |
I now created a self-contained Github repo that contains the same code for headless/offscreen VSG rendering that I already posted above. https://github.com/drywolf/vsg_amd_perf Additionally I also added another minimal VSG app that is rendering to a vsg::Window / Swapchain.
PS: |
Another thing you could look at is whether the windowing system is doing compositing in which was the application is rendering of a buffer that is then used by the compositor as input. Fullscreen without window decoration should bypass the compositor but this will be down to the OS/drivers to implement properly. I'll have to defer to Windows devs to give guidance on how to control the Windows desktop composition and driver settings as I'm only an occasional Windows user with no platform expertise on the platform. |
I disabled all Windows 11 advanced compositing options (following this guide) and ran the app in fullscreen mode, by setting |
Another variable you could experiment with is different formats for the colour and depth buffers, perhaps the defaults chosen by the VSG are tripping up the driver into a slower path on this particular hardware/driver combination. |
Yeah that's a good idea 👍 PS: the |
I now tried a couple more VkFormats, and none of them showed any significant difference in performance. |
Here the VK_COMMAND_POOL_RESET_RELEASE_RESOURCES_BIT flag is set, so resources are freed and reallocated in every frame. If set to 0, got the same performance as NV. My guess is that Nvidia optimizes it and skips the flag so there is no resource reallocation on NV. |
Interesting finding.
Intel and AMD Mesa drivers work fine as well so it:s not an NVidia specific
optimization.
…On Thu, 18 Jul 2024, 12:15 Slaw6820, ***@***.***> wrote:
https://github.com/vsg-dev/VulkanSceneGraph/blob/8a229b30637eea6fcfd9ace3d0745415dd563d7a/include/vsg/vk/CommandPool.h#L35
Here the VK_COMMAND_POOL_RESET_RELEASE_RESOURCES_BIT flag is set, so
resources are freed and reallocated in every frame. If set to 0, got the
same performance as NV.
My guess is that Nvidia optimizes it and skips the flag so there is no
resource reallocation on NV.
—
Reply to this email directly, view it on GitHub
<#1208 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKEGUH7I5BV4AX5A6HS7W3ZM6PUTAVCNFSM6AAAAABIZCDPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZWGI2DKOJVGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@robertosfield Thanks |
I don't know whether we'd see adverse side-effects, one would have to get it tested on all OS's, drivers and hardware combinations with the full range of VSG applications to know for sure. Such a "fix" for a driver oddity raises my hackles, a new driver update might fix the performance issue. |
Describe the bug
I am using VSG to perform some headless rendering (i.e. no Swapchain and no vsg::Window)
The code that I am using is very similar to vsgheadless.cpp from the vsgExamples.
NV RTX 2080 TI
I am getting ca. 13.000 FPSAMD RX 5700 XT
I am getting just ca. 390 FPSAlso when looking at the Windows Task-Manager GPU performance metrics, there is an interesting difference between the two GPUs:
AMD RX 5700 XT
NV RTX 2080 TI
I already tried to do some profiling on the AMD to find out what is happening, but all of the AMD profiling tools are failing to function.
This is the minimal code to reproduce what I showed above:
https://gist.github.com/drywolf/690c775bb181c946b30ed67ebcdee3de
PS: the minimal code does not render anything, it only contains a single RenderPass that would implicitly clear the color & depth-stencil images, but that is all the code is doing. So it is quite surprising to see the low FPS / high Copy load on the AMD card, for such a trivial minimal workload.
The text was updated successfully, but these errors were encountered: