Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA support for registration #744

Open
fengjim opened this issue Oct 24, 2016 · 19 comments
Open

CUDA support for registration #744

fengjim opened this issue Oct 24, 2016 · 19 comments

Comments

@fengjim
Copy link

fengjim commented Oct 24, 2016

Overview Description:

I'm using libfreenect2 to collect RGB and Depth Data from Kinect2 devices on Linux (Ubuntu 14.04) and generate PCL point cloud based on that.

The steps are generally: 1) libfreenect2::SyncMultiFrameListener::waitForNewFrame() to get the RGB and Depth frames, 2) libfreenect2::Registration::apply() to align them, 3) loop through 512x424 matrix and call libfreenect2::Registration::getPointXYZRGB() to fill all the matrix elements.

According to the performance testing result, step 3# was the one taking most of time in the whole pipeline. I was thinking to use parallel programming (either CUDA with GPU or multiple thread on CPU) in Step 3# to improve the efficiency. However, considering libfreenect2 has already providing CUDA/OpenGL options of pipelines etc. , it might looks helpful if libfreenect2 could provide one more function along aside with getPointXYZRGB() using CUAD etc. to generate all the points, i.e. adding libfreenect2::Registration::getPointXYZRGB(const Frame* undistorted, const Frame* registered, float** depth, uint_8** color), where 'depth' point to a 3x512x424 array of float representing point (X, Y, Z) matrix and 'color' point to a 3x512x424 array of uinit_8 representing color for related point.

Would you please kindly share your comments/thoughts about this?

Thanks in advance!

@xlz
Copy link
Member

xlz commented Oct 24, 2016

It's worth trying.

You probably want to create a class CudaRegistration in cuda_registration.cpp and registration.h.

To interoperate with PCL, you'll need to look at the byte format of PCL. It looks like a pointcloud in PCL stores data in vector<pcl::PointXYZRGB> or vector<pcl::PointXYZ>. pcl::PointXYZ is 4 bytes. pcl::PointXYZRGB is 8 bytes.

@fengjim
Copy link
Author

fengjim commented Oct 26, 2016

@xlz, thanks a lot for your reply and suggestions.

I will have a try then:)

@hanshammel1337
Copy link

@fengjim Did you make any progress on this topic? I am also interested in it.

@fengjim
Copy link
Author

fengjim commented Dec 1, 2016

Hi @hanshammel1337 , umm, I haven't done any real progress other than checking GPU related materials.

You may go ahead to start it over:) would be appreciated that you can share your branch later after you start working on it.

@xlz xlz changed the title Is it worth adding CUDA support for generating point cloud from RGB&Depth data CUDA support registration Dec 5, 2016
@xlz xlz changed the title CUDA support registration CUDA support for registration Dec 5, 2016
@xlz xlz added the enhancement label Dec 5, 2016
@blackzafiro
Copy link

Hi, are there any news about this? I will need this feature, so I will be working on something of the sort for the next weeks. I don't have much experience with CUDA, so I'll be a bit slow, but if I can help just let me know.

@aosewski
Copy link

I'm trying to use libfreenect library on Jetson Tk1, however I've got terrible performance when receiving both rgb and depth data with protonect, it's simply unusable. So when I encountered this thread I thought I could give it a try to port some part on GPU. My first step to do this was an attempt to obtain an application profile with callgrind. However that ended with failure. Every time I run callgrind I receive no information because application hangs down. Here is an output I get:

[ubuntu@tegra-ubuntu bin]$ valgrind --tool=callgrind ./bin/Protonect -noviewer cuda
==5048== Callgrind, a call-graph generating cache profiler
==5048== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5048== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5048== Command: ./bin/Protonect -noviewer cuda
==5048== 
==5048== For interactive control, run 'callgrind_control -h'.
^C==5048== 
==5048== Events    : Ir
==5048== Collected : 581231764
==5048== 
==5048== I   refs:      581,231,764

[ubuntu@tegra-ubuntu build]$ valgrind --tool=callgrind ./bin/Protonect -noviewer     
==5077== Callgrind, a call-graph generating cache profiler
==5077== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5077== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5077== Command: ./bin/Protonect -noviewer
==5077== 
==5077== For interactive control, run 'callgrind_control -h'.
Version: 0.2.0
Environment variables: LOGFILE=<protonect.log>
Usage: ./bin/Protonect [-gpu=<id>] [gl | cl | clkde | cuda | cudakde | cpu] [<device serial>]
        [-noviewer] [-norgb | -nodepth] [-help] [-version]
        [-frames <number of frames to process>]
To pause and unpause: pkill -USR1 Protonect
^C==5077== 
==5077== Events    : Ir
==5077== Collected : 367982872
==5077== 
==5077== I   refs:      367,982,872
Killed
[ubuntu@tegra-ubuntu build]$ valgrind --tool=callgrind ./bin/Protonect -noviewer cuda -nodepth
==5079== Callgrind, a call-graph generating cache profiler
==5079== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5079== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5079== Command: ./bin/Protonect -noviewer cuda -nodepth
==5079== 
==5079== For interactive control, run 'callgrind_control -h'.
Version: 0.2.0
Environment variables: LOGFILE=<protonect.log>
Usage: ./bin/Protonect [-gpu=<id>] [gl | cl | clkde | cuda | cudakde | cpu] [<device serial>]
        [-noviewer] [-norgb | -nodepth] [-help] [-version]
        [-frames <number of frames to process>]
To pause and unpause: pkill -USR1 Protonect
^C==5079== 
==5079== Events    : Ir
==5079== Collected : 355734782
==5079== 
==5079== I   refs:      355,734,782
Killed
[ubuntu@tegra-ubuntu build]$ valgrind --tool=callgrind ./bin/Protonect -noviewer cuda -norgb  
==5082== Callgrind, a call-graph generating cache profiler
==5082== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5082== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5082== Command: ./bin/Protonect -noviewer cuda -norgb
==5082== 
==5082== For interactive control, run 'callgrind_control -h'.
Version: 0.2.0
Environment variables: LOGFILE=<protonect.log>
Usage: ./bin/Protonect [-gpu=<id>] [gl | cl | clkde | cuda | cudakde | cpu] [<device serial>]
        [-noviewer] [-norgb | -nodepth] [-help] [-version]
        [-frames <number of frames to process>]
To pause and unpause: pkill -USR1 Protonect
^C==5082== 
==5082== Events    : Ir
==5082== Collected : 244213827
==5082== 
==5082== I   refs:      244,213,827

When I ran Protonect with ./Protonect -noviewer cuda -norgb or ./Protonect -noviewer cuda -nodepth then program works as expected, yielding quite good performance about 80-90 fps. Whereas when running ./Protonect -noviewer cuda I get information only about skipped packets:

...
[Debug] [DepthPacketStreamParser] skipping depth packet
[Debug] [RgbPacketStreamParser] skipping rgb packet!
...

I compile library with following command:

cmake .. -DENABLE_CXX11=ON -DCMAKE_INSTALL_PREFIX=/usr/local/lib/freenect2 && make -j2 && sudo make install

I've also tried to use gprof by passing -DCMAKE_CXX_FLAGS=-pg, also with no luck.

I'd be very grateful for any information that could help me to obtain application profile.

@xlz
Copy link
Member

xlz commented Mar 17, 2017

Jetson TK1's CPU is slow. Protonect -noviewer cuda still does registration which is slow on CPU.

If you really want you can use perf tool but you have to build it from source and there isn't much useful information. The most useful indicator would be CPU usage per thread and I expect the main thread has the highest because it does on CPU registration.

So there isn't much you can do except commenting out registration in Protonect.cpp. Jetson TK1 is barely capable enough of handling Kinect and it takes careful optimization. If Cuda registration is done this might get better.

@blackzafiro
Copy link

I cloned this repository and finished a first implementation of registration apply method, depth looks good but I haven't had the chance to check the color registered frame. I'll let you know as soon as it looks presentable.

@blackzafiro
Copy link

It works!

cudaregistration_1

@xlz
Copy link
Member

xlz commented Mar 30, 2017

Make it a PR.

@blackzafiro
Copy link

Ok, I only implemented the function apply. Should I finish the other ones that work in parallel or PR now?

@xlz
Copy link
Member

xlz commented Apr 1, 2017

You can create a PR for us to see and amend it with new commits later.

@mlaz
Copy link

mlaz commented Apr 1, 2017

Hi, I am also interested in this, I am currently running a cuda kernel which does the registration and I am looking to pass the rgb buffer (rgb->data) to this kernel using zerocopy. This memory region apparently is allocated by the gst-jpeg library for tegra provided by nvidia (in my case using Jetson TK1) on NVMM. Is there any way to do this without having to copy the whole memory region to a pinned memory region?

@xlz
Copy link
Member

xlz commented Apr 1, 2017

Is there any way to do this

Okay, this is fairly complicated. In terms of TK1 the ideal way is zero-copy, i.e. not even cudaMemcpy(). There is something some unified virtual addressing supported by TK1 but I haven't figured out how to make this paradigm portable on platforms without such support without making a mess of code.

One-copy is also possible. The memory jpegTegraMgr->buff[0] allocated by Nvidia internal libraries is probably page-locked but I'm not sure. cudaHostRegister() can turn a host memory pointer into "pinned memory", which is what you're asking for, but I'm not sure if it works or what will happen if you page lock the same address twice.

In short, try cudaHostRegister() first. You want to cudaHostRegister() just once and see what happens ~~~https://github.com/OpenKinect/libfreenect2/blob/master/src/tegra_jpeg_rgb_packet_processor.cpp#L147~~~

Sorry, you need another way to enforce cudaHostRegister() being called on the same address just once.

@mlaz
Copy link

mlaz commented Apr 1, 2017

I already tried to cudaHostRegister() the memory region but it looks like it is not supported on ARM platforms, according to this thread:
https://devtalk.nvidia.com/default/topic/998962/cuda-zero-copy-on-tx1/

@xlz
Copy link
Member

xlz commented Apr 1, 2017

We don't have any control on how it is allocated. The part is not open source.

Have you tried AastaLLL's example? Just start with cudaSetDeviceFlags(cudaDeviceMapHost); and cudaHostGetDevicePointer() on jpegTegraMgr->buff[0] without the malloc part.

@mlaz
Copy link

mlaz commented Apr 1, 2017

I am already doing cudaSetDeviceFlags(cudaDeviceMapHost) and also or'ed cudaHostAllocMapped on the CudaAllocator's cudaHostAlloc() flags, which (unexpectedly) allowed me to use the allocated data regions (i.e. depth->c_map) without needing to cudaHostGetDevicePointer(). I will do some more testing, just to make sure but I recall not being able to do this with rgb->data, which means this might be page-locked memory but not mapped on the device.

To clarify: I only tried cudaHostGetDevicePointer() after cudaHostRegister() on rgb->data, which returned an error.

@xlz
Copy link
Member

xlz commented Apr 1, 2017

unexpectedly

It's exactly the unified virtual addressing on TK1. But I can't make this portable yet.

page-locked memory but not mapped on the device

I guess the secret sauce is how to map it to the device.

But if it's already page locked then the "the caching attribute of an existing allocation can't be changed on the fly" issue of cudaHostRegister() not being supported is no longer relevant because it only needs to do the mapping part.

@blackzafiro
Copy link

PR #822

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants