CUDA support for registration #744

fengjim · 2016-10-24T13:14:26Z

Overview Description:

I'm using libfreenect2 to collect RGB and Depth Data from Kinect2 devices on Linux (Ubuntu 14.04) and generate PCL point cloud based on that.

The steps are generally: 1) libfreenect2::SyncMultiFrameListener::waitForNewFrame() to get the RGB and Depth frames, 2) libfreenect2::Registration::apply() to align them, 3) loop through 512x424 matrix and call libfreenect2::Registration::getPointXYZRGB() to fill all the matrix elements.

According to the performance testing result, step 3# was the one taking most of time in the whole pipeline. I was thinking to use parallel programming (either CUDA with GPU or multiple thread on CPU) in Step 3# to improve the efficiency. However, considering libfreenect2 has already providing CUDA/OpenGL options of pipelines etc. , it might looks helpful if libfreenect2 could provide one more function along aside with getPointXYZRGB() using CUAD etc. to generate all the points, i.e. adding libfreenect2::Registration::getPointXYZRGB(const Frame* undistorted, const Frame* registered, float** depth, uint_8** color), where 'depth' point to a 3x512x424 array of float representing point (X, Y, Z) matrix and 'color' point to a 3x512x424 array of uinit_8 representing color for related point.

Would you please kindly share your comments/thoughts about this?

Thanks in advance!

xlz · 2016-10-24T19:36:25Z

It's worth trying.

You probably want to create a class CudaRegistration in cuda_registration.cpp and registration.h.

To interoperate with PCL, you'll need to look at the byte format of PCL. It looks like a pointcloud in PCL stores data in vector<pcl::PointXYZRGB> or vector<pcl::PointXYZ>. pcl::PointXYZ is 4 bytes. pcl::PointXYZRGB is 8 bytes.

fengjim · 2016-10-26T08:59:49Z

@xlz, thanks a lot for your reply and suggestions.

I will have a try then:)

hanshammel1337 · 2016-12-01T08:51:12Z

@fengjim Did you make any progress on this topic? I am also interested in it.

fengjim · 2016-12-01T10:06:50Z

Hi @hanshammel1337 , umm, I haven't done any real progress other than checking GPU related materials.

You may go ahead to start it over:) would be appreciated that you can share your branch later after you start working on it.

blackzafiro · 2017-02-23T18:44:29Z

Hi, are there any news about this? I will need this feature, so I will be working on something of the sort for the next weeks. I don't have much experience with CUDA, so I'll be a bit slow, but if I can help just let me know.

aosewski · 2017-03-17T14:10:48Z

I'm trying to use libfreenect library on Jetson Tk1, however I've got terrible performance when receiving both rgb and depth data with protonect, it's simply unusable. So when I encountered this thread I thought I could give it a try to port some part on GPU. My first step to do this was an attempt to obtain an application profile with callgrind. However that ended with failure. Every time I run callgrind I receive no information because application hangs down. Here is an output I get:

[ubuntu@tegra-ubuntu bin]$ valgrind --tool=callgrind ./bin/Protonect -noviewer cuda
==5048== Callgrind, a call-graph generating cache profiler
==5048== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5048== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5048== Command: ./bin/Protonect -noviewer cuda
==5048== 
==5048== For interactive control, run 'callgrind_control -h'.
^C==5048== 
==5048== Events    : Ir
==5048== Collected : 581231764
==5048== 
==5048== I   refs:      581,231,764

[ubuntu@tegra-ubuntu build]$ valgrind --tool=callgrind ./bin/Protonect -noviewer     
==5077== Callgrind, a call-graph generating cache profiler
==5077== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5077== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5077== Command: ./bin/Protonect -noviewer
==5077== 
==5077== For interactive control, run 'callgrind_control -h'.
Version: 0.2.0
Environment variables: LOGFILE=<protonect.log>
Usage: ./bin/Protonect [-gpu=<id>] [gl | cl | clkde | cuda | cudakde | cpu] [<device serial>]
        [-noviewer] [-norgb | -nodepth] [-help] [-version]
        [-frames <number of frames to process>]
To pause and unpause: pkill -USR1 Protonect
^C==5077== 
==5077== Events    : Ir
==5077== Collected : 367982872
==5077== 
==5077== I   refs:      367,982,872
Killed
[ubuntu@tegra-ubuntu build]$ valgrind --tool=callgrind ./bin/Protonect -noviewer cuda -nodepth
==5079== Callgrind, a call-graph generating cache profiler
==5079== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5079== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5079== Command: ./bin/Protonect -noviewer cuda -nodepth
==5079== 
==5079== For interactive control, run 'callgrind_control -h'.
Version: 0.2.0
Environment variables: LOGFILE=<protonect.log>
Usage: ./bin/Protonect [-gpu=<id>] [gl | cl | clkde | cuda | cudakde | cpu] [<device serial>]
        [-noviewer] [-norgb | -nodepth] [-help] [-version]
        [-frames <number of frames to process>]
To pause and unpause: pkill -USR1 Protonect
^C==5079== 
==5079== Events    : Ir
==5079== Collected : 355734782
==5079== 
==5079== I   refs:      355,734,782
Killed
[ubuntu@tegra-ubuntu build]$ valgrind --tool=callgrind ./bin/Protonect -noviewer cuda -norgb  
==5082== Callgrind, a call-graph generating cache profiler
==5082== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==5082== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5082== Command: ./bin/Protonect -noviewer cuda -norgb
==5082== 
==5082== For interactive control, run 'callgrind_control -h'.
Version: 0.2.0
Environment variables: LOGFILE=<protonect.log>
Usage: ./bin/Protonect [-gpu=<id>] [gl | cl | clkde | cuda | cudakde | cpu] [<device serial>]
        [-noviewer] [-norgb | -nodepth] [-help] [-version]
        [-frames <number of frames to process>]
To pause and unpause: pkill -USR1 Protonect
^C==5082== 
==5082== Events    : Ir
==5082== Collected : 244213827
==5082== 
==5082== I   refs:      244,213,827

When I ran Protonect with ./Protonect -noviewer cuda -norgb or ./Protonect -noviewer cuda -nodepth then program works as expected, yielding quite good performance about 80-90 fps. Whereas when running ./Protonect -noviewer cuda I get information only about skipped packets:

...
[Debug] [DepthPacketStreamParser] skipping depth packet
[Debug] [RgbPacketStreamParser] skipping rgb packet!
...

I compile library with following command:

cmake .. -DENABLE_CXX11=ON -DCMAKE_INSTALL_PREFIX=/usr/local/lib/freenect2 && make -j2 && sudo make install

I've also tried to use gprof by passing -DCMAKE_CXX_FLAGS=-pg, also with no luck.

I'd be very grateful for any information that could help me to obtain application profile.

xlz · 2017-03-17T14:19:55Z

Jetson TK1's CPU is slow. Protonect -noviewer cuda still does registration which is slow on CPU.

If you really want you can use perf tool but you have to build it from source and there isn't much useful information. The most useful indicator would be CPU usage per thread and I expect the main thread has the highest because it does on CPU registration.

So there isn't much you can do except commenting out registration in Protonect.cpp. Jetson TK1 is barely capable enough of handling Kinect and it takes careful optimization. If Cuda registration is done this might get better.

blackzafiro · 2017-03-30T00:07:58Z

I cloned this repository and finished a first implementation of registration apply method, depth looks good but I haven't had the chance to check the color registered frame. I'll let you know as soon as it looks presentable.

blackzafiro · 2017-03-30T17:09:00Z

It works!

xlz · 2017-03-30T17:24:47Z

Make it a PR.

blackzafiro · 2017-03-30T18:04:05Z

Ok, I only implemented the function apply. Should I finish the other ones that work in parallel or PR now?

xlz · 2017-04-01T02:16:27Z

You can create a PR for us to see and amend it with new commits later.

mlaz · 2017-04-01T03:04:16Z

Hi, I am also interested in this, I am currently running a cuda kernel which does the registration and I am looking to pass the rgb buffer (rgb->data) to this kernel using zerocopy. This memory region apparently is allocated by the gst-jpeg library for tegra provided by nvidia (in my case using Jetson TK1) on NVMM. Is there any way to do this without having to copy the whole memory region to a pinned memory region?

xlz · 2017-04-01T04:23:29Z

Is there any way to do this

Okay, this is fairly complicated. In terms of TK1 the ideal way is zero-copy, i.e. not even cudaMemcpy(). There is something some unified virtual addressing supported by TK1 but I haven't figured out how to make this paradigm portable on platforms without such support without making a mess of code.

One-copy is also possible. The memory jpegTegraMgr->buff[0] allocated by Nvidia internal libraries is probably page-locked but I'm not sure. cudaHostRegister() can turn a host memory pointer into "pinned memory", which is what you're asking for, but I'm not sure if it works or what will happen if you page lock the same address twice.

In short, try cudaHostRegister() first. You want to cudaHostRegister() just once and see what happens ~~~https://github.com/OpenKinect/libfreenect2/blob/master/src/tegra_jpeg_rgb_packet_processor.cpp#L147~~~

Sorry, you need another way to enforce cudaHostRegister() being called on the same address just once.

mlaz · 2017-04-01T04:45:09Z

I already tried to cudaHostRegister() the memory region but it looks like it is not supported on ARM platforms, according to this thread:
https://devtalk.nvidia.com/default/topic/998962/cuda-zero-copy-on-tx1/

xlz · 2017-04-01T04:56:30Z

We don't have any control on how it is allocated. The part is not open source.

Have you tried AastaLLL's example? Just start with cudaSetDeviceFlags(cudaDeviceMapHost); and cudaHostGetDevicePointer() on jpegTegraMgr->buff[0] without the malloc part.

mlaz · 2017-04-01T05:09:30Z

I am already doing cudaSetDeviceFlags(cudaDeviceMapHost) and also or'ed cudaHostAllocMapped on the CudaAllocator's cudaHostAlloc() flags, which (unexpectedly) allowed me to use the allocated data regions (i.e. depth->c_map) without needing to cudaHostGetDevicePointer(). I will do some more testing, just to make sure but I recall not being able to do this with rgb->data, which means this might be page-locked memory but not mapped on the device.

To clarify: I only tried cudaHostGetDevicePointer() after cudaHostRegister() on rgb->data, which returned an error.

xlz · 2017-04-01T05:34:33Z

unexpectedly

It's exactly the unified virtual addressing on TK1. But I can't make this portable yet.

page-locked memory but not mapped on the device

I guess the secret sauce is how to map it to the device.

But if it's already page locked then the "the caching attribute of an existing allocation can't be changed on the fly" issue of cudaHostRegister() not being supported is no longer relevant because it only needs to do the mapping part.

blackzafiro · 2017-04-04T22:16:50Z

PR #822

xlz changed the title ~~Is it worth adding CUDA support for generating point cloud from RGB&Depth data~~ CUDA support registration Dec 5, 2016

xlz changed the title ~~CUDA support registration~~ CUDA support for registration Dec 5, 2016

xlz mentioned this issue Dec 5, 2016

Dropping packets on Jetson TX1 #737

Closed

xlz added the enhancement label Dec 5, 2016

xlz mentioned this issue Apr 4, 2017

Cuda registration #822

Open

xlz mentioned this issue May 19, 2017

Slow memcpy when using TegraJPEG. #852

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA support for registration #744

CUDA support for registration #744

fengjim commented Oct 24, 2016

xlz commented Oct 24, 2016

fengjim commented Oct 26, 2016

hanshammel1337 commented Dec 1, 2016

fengjim commented Dec 1, 2016

blackzafiro commented Feb 23, 2017

aosewski commented Mar 17, 2017

xlz commented Mar 17, 2017

blackzafiro commented Mar 30, 2017

blackzafiro commented Mar 30, 2017

xlz commented Mar 30, 2017

blackzafiro commented Mar 30, 2017

xlz commented Apr 1, 2017

mlaz commented Apr 1, 2017

xlz commented Apr 1, 2017 •

edited

Loading

mlaz commented Apr 1, 2017

xlz commented Apr 1, 2017

mlaz commented Apr 1, 2017 •

edited

Loading

xlz commented Apr 1, 2017

blackzafiro commented Apr 4, 2017

CUDA support for registration #744

CUDA support for registration #744

Comments

fengjim commented Oct 24, 2016

xlz commented Oct 24, 2016

fengjim commented Oct 26, 2016

hanshammel1337 commented Dec 1, 2016

fengjim commented Dec 1, 2016

blackzafiro commented Feb 23, 2017

aosewski commented Mar 17, 2017

xlz commented Mar 17, 2017

blackzafiro commented Mar 30, 2017

blackzafiro commented Mar 30, 2017

xlz commented Mar 30, 2017

blackzafiro commented Mar 30, 2017

xlz commented Apr 1, 2017

mlaz commented Apr 1, 2017

xlz commented Apr 1, 2017 • edited Loading

mlaz commented Apr 1, 2017

xlz commented Apr 1, 2017

mlaz commented Apr 1, 2017 • edited Loading

xlz commented Apr 1, 2017

blackzafiro commented Apr 4, 2017

xlz commented Apr 1, 2017 •

edited

Loading

mlaz commented Apr 1, 2017 •

edited

Loading