Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

primus is PCIe bottlenecked #176

Open
karolherbst opened this issue Sep 20, 2015 · 10 comments
Open

primus is PCIe bottlenecked #176

karolherbst opened this issue Sep 20, 2015 · 10 comments

Comments

@karolherbst
Copy link
Contributor

after doing some tests with nouvean and DRI_PRIME, I noticed that compared to primus, I got around x4 times fps in pcie limited scenrios (like glxspheres).

Is there a possibility to reduce the pcie load under primus?

@tpruzina
Copy link

not really, the whole point here is that you have to download content of framebuffer into system memory (this costs bandwidth) and then upload it onto display/integrated gpu.
This causes you to copy 4_X_Y Bytes every frame (about 8MB on 1920x1200) and this is gonna be capping your max fps in simple opengl applications.

It's theoretically possible to use one of opengl compressions to reduce memory bandwidth overhead, but that would bottleneck GPU itself.

DRI_PRIME has advantage of GEM and DMA_BUF sharing, they can skip extra copy.

Or at least this is my understanding of it, haven't seen that code in months.

@karolherbst
Copy link
Contributor Author

yeah, compression should make a difference, but while I was doing some pcie speed work for nouveau, I noticed even 5% speed ups in 20 fps full hd scenarios, just by going from 2.5 to 8.0 pcie speed. And this speed up increases the higher the transfered pixels go. And then there are games like the talos principle. which just got a 25% perf boost here.

@tpruzina
Copy link

Anyways, I did some micro optimizations on my branch, but core code is fairly simple and straight forward and I doubt there is much we can do to speed things up without sacrificing performance or introducing heavy input lag.

Newer nvidia cards (fermi-kepler) have asynchronous copy engines that can make the whole buffer copy thing asynchronous for both GPU and CPU, but I haven't tested whether it works properly with primus (it might).

Kepler card actually have two copy engines, which allows concurrent copies at the same time (so you can load textures in your favourite game and copy primus framebuffer at the same time).

There is some interesting read about it here: http://www.nvidia.com/docs/IO/40049/Dual_copy_engines.pdf

@karolherbst
Copy link
Contributor Author

yeah, it would be nice to make the copy less demanding on the application.

@tpruzina
Copy link

@karolherbst yep, but I doubt that this is the real problem, it's sad that we are limited to 300 fps in glxgears, but the it's hardly an issue in games where this really isn't the bottleneck.
As a csgo player I tried to remove quite a bit of cruft from primus (things like XEvent queries on every frame looking for changes in window size, etc) to minimize latencies (somewhat succesfully, even though it's hard to measure without propper equipment).

Code itself caused tons of cache misses, but in grand scheme of things this is negligible.

Quite frankly, the optimal solution would be to ignore dmabuf's GPLv3 license and NVIDIA's license and code propper buffer sharing into opensourced part of nvidia's driver (their driver wrapper that deals with kernel has available code, no idea about licensing tho).

@karolherbst
Copy link
Contributor Author

no, I mean there should be performance improvements even if the pcie bus isn't at full load, or when the game is running around 20 fps. I know that +5% isn't that much, but maybe there is a cheap way to reduce the overhead a bit.

@sjnewbury
Copy link

Wouldn't it be possible to use the Intel userptr API to bind the PBO directly to the drawable? That way only synchronisation and resizing needs to be handled during an active context as the drawable would have the PBO content zero-copy. I've been looking at how to do this, but maybe I'm missing something that would make this unworkable?

@amonakov
Copy link
Owner

amonakov commented Dec 1, 2016

A good way to optimize iGPU upload would be by extending PBO texture upload path (that uses userptr if I recall correctly) to PBO DrawPixels as outlined in comments 5-7 in this bug: https://bugs.freedesktop.org/show_bug.cgi?id=77412. But normally the bottleneck is on dGPU download side, so this wouldn't help frame rates (but should help with power consumption).

@tpruzina
Copy link

tpruzina commented Dec 11, 2016

This could be mitigated with nvidia capture SDK (NVFBC) on newer Nvidia cards, but I have no idea how licensing would work unfortunatedly.

@karolherbst
Copy link
Contributor Author

@tpruzina isn't this for recording from screen when it is driving by the nvidia GPU? I don't see how this can help here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants