primus is PCIe bottlenecked #176

karolherbst · 2015-09-20T16:33:06Z

after doing some tests with nouvean and DRI_PRIME, I noticed that compared to primus, I got around x4 times fps in pcie limited scenrios (like glxspheres).

Is there a possibility to reduce the pcie load under primus?

tpruzina · 2015-09-22T05:58:35Z

not really, the whole point here is that you have to download content of framebuffer into system memory (this costs bandwidth) and then upload it onto display/integrated gpu.
This causes you to copy 4_X_Y Bytes every frame (about 8MB on 1920x1200) and this is gonna be capping your max fps in simple opengl applications.

It's theoretically possible to use one of opengl compressions to reduce memory bandwidth overhead, but that would bottleneck GPU itself.

DRI_PRIME has advantage of GEM and DMA_BUF sharing, they can skip extra copy.

Or at least this is my understanding of it, haven't seen that code in months.

karolherbst · 2015-09-22T10:31:37Z

yeah, compression should make a difference, but while I was doing some pcie speed work for nouveau, I noticed even 5% speed ups in 20 fps full hd scenarios, just by going from 2.5 to 8.0 pcie speed. And this speed up increases the higher the transfered pixels go. And then there are games like the talos principle. which just got a 25% perf boost here.

tpruzina · 2015-09-22T12:45:58Z

Anyways, I did some micro optimizations on my branch, but core code is fairly simple and straight forward and I doubt there is much we can do to speed things up without sacrificing performance or introducing heavy input lag.

Newer nvidia cards (fermi-kepler) have asynchronous copy engines that can make the whole buffer copy thing asynchronous for both GPU and CPU, but I haven't tested whether it works properly with primus (it might).

Kepler card actually have two copy engines, which allows concurrent copies at the same time (so you can load textures in your favourite game and copy primus framebuffer at the same time).

There is some interesting read about it here: http://www.nvidia.com/docs/IO/40049/Dual_copy_engines.pdf

karolherbst · 2015-09-22T12:51:46Z

yeah, it would be nice to make the copy less demanding on the application.

tpruzina · 2015-09-22T12:58:32Z

@karolherbst yep, but I doubt that this is the real problem, it's sad that we are limited to 300 fps in glxgears, but the it's hardly an issue in games where this really isn't the bottleneck.
As a csgo player I tried to remove quite a bit of cruft from primus (things like XEvent queries on every frame looking for changes in window size, etc) to minimize latencies (somewhat succesfully, even though it's hard to measure without propper equipment).

Code itself caused tons of cache misses, but in grand scheme of things this is negligible.

Quite frankly, the optimal solution would be to ignore dmabuf's GPLv3 license and NVIDIA's license and code propper buffer sharing into opensourced part of nvidia's driver (their driver wrapper that deals with kernel has available code, no idea about licensing tho).

karolherbst · 2015-09-22T13:07:20Z

no, I mean there should be performance improvements even if the pcie bus isn't at full load, or when the game is running around 20 fps. I know that +5% isn't that much, but maybe there is a cheap way to reduce the overhead a bit.

sjnewbury · 2016-12-01T13:10:24Z

Wouldn't it be possible to use the Intel userptr API to bind the PBO directly to the drawable? That way only synchronisation and resizing needs to be handled during an active context as the drawable would have the PBO content zero-copy. I've been looking at how to do this, but maybe I'm missing something that would make this unworkable?

amonakov · 2016-12-01T19:18:11Z

A good way to optimize iGPU upload would be by extending PBO texture upload path (that uses userptr if I recall correctly) to PBO DrawPixels as outlined in comments 5-7 in this bug: https://bugs.freedesktop.org/show_bug.cgi?id=77412. But normally the bottleneck is on dGPU download side, so this wouldn't help frame rates (but should help with power consumption).

tpruzina · 2016-12-11T22:04:01Z

This could be mitigated with nvidia capture SDK (NVFBC) on newer Nvidia cards, but I have no idea how licensing would work unfortunatedly.

karolherbst · 2016-12-12T23:12:36Z

@tpruzina isn't this for recording from screen when it is driving by the nvidia GPU? I don't see how this can help here?

karolherbst mentioned this issue Dec 19, 2015

Slow performance on 2880x1800 display Bumblebee-Project/Bumblebee#714

Closed

ArchangeGabriel mentioned this issue May 6, 2016

Low perfomance on nvidia video card but good on intel Bumblebee-Project/Bumblebee#741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

primus is PCIe bottlenecked #176

primus is PCIe bottlenecked #176

karolherbst commented Sep 20, 2015

tpruzina commented Sep 22, 2015

karolherbst commented Sep 22, 2015

tpruzina commented Sep 22, 2015

karolherbst commented Sep 22, 2015

tpruzina commented Sep 22, 2015

karolherbst commented Sep 22, 2015

sjnewbury commented Dec 1, 2016

amonakov commented Dec 1, 2016

tpruzina commented Dec 11, 2016 •

edited

Loading

karolherbst commented Dec 12, 2016

primus is PCIe bottlenecked #176

primus is PCIe bottlenecked #176

Comments

karolherbst commented Sep 20, 2015

tpruzina commented Sep 22, 2015

karolherbst commented Sep 22, 2015

tpruzina commented Sep 22, 2015

karolherbst commented Sep 22, 2015

tpruzina commented Sep 22, 2015

karolherbst commented Sep 22, 2015

sjnewbury commented Dec 1, 2016

amonakov commented Dec 1, 2016

tpruzina commented Dec 11, 2016 • edited Loading

karolherbst commented Dec 12, 2016

tpruzina commented Dec 11, 2016 •

edited

Loading