-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA support for registration #744
Comments
It's worth trying. You probably want to create a To interoperate with PCL, you'll need to look at the byte format of PCL. It looks like a pointcloud in PCL stores data in |
@xlz, thanks a lot for your reply and suggestions. I will have a try then:) |
@fengjim Did you make any progress on this topic? I am also interested in it. |
Hi @hanshammel1337 , umm, I haven't done any real progress other than checking GPU related materials. You may go ahead to start it over:) would be appreciated that you can share your branch later after you start working on it. |
Hi, are there any news about this? I will need this feature, so I will be working on something of the sort for the next weeks. I don't have much experience with CUDA, so I'll be a bit slow, but if I can help just let me know. |
I'm trying to use libfreenect library on Jetson Tk1, however I've got terrible performance when receiving both rgb and depth data with protonect, it's simply unusable. So when I encountered this thread I thought I could give it a try to port some part on GPU. My first step to do this was an attempt to obtain an application profile with callgrind. However that ended with failure. Every time I run callgrind I receive no information because application hangs down. Here is an output I get:
When I ran Protonect with
I compile library with following command: cmake .. -DENABLE_CXX11=ON -DCMAKE_INSTALL_PREFIX=/usr/local/lib/freenect2 && make -j2 && sudo make install I've also tried to use I'd be very grateful for any information that could help me to obtain application profile. |
Jetson TK1's CPU is slow. If you really want you can use perf tool but you have to build it from source and there isn't much useful information. The most useful indicator would be CPU usage per thread and I expect the main thread has the highest because it does on CPU registration. So there isn't much you can do except commenting out registration in Protonect.cpp. Jetson TK1 is barely capable enough of handling Kinect and it takes careful optimization. If Cuda registration is done this might get better. |
I cloned this repository and finished a first implementation of registration apply method, depth looks good but I haven't had the chance to check the color registered frame. I'll let you know as soon as it looks presentable. |
Make it a PR. |
Ok, I only implemented the function apply. Should I finish the other ones that work in parallel or PR now? |
You can create a PR for us to see and amend it with new commits later. |
Hi, I am also interested in this, I am currently running a cuda kernel which does the registration and I am looking to pass the rgb buffer (rgb->data) to this kernel using zerocopy. This memory region apparently is allocated by the gst-jpeg library for tegra provided by nvidia (in my case using Jetson TK1) on NVMM. Is there any way to do this without having to copy the whole memory region to a pinned memory region? |
Okay, this is fairly complicated. In terms of TK1 the ideal way is zero-copy, i.e. not even cudaMemcpy(). There is something some unified virtual addressing supported by TK1 but I haven't figured out how to make this paradigm portable on platforms without such support without making a mess of code. One-copy is also possible. The memory In short, try cudaHostRegister() first. You want to cudaHostRegister() just once and see what happens ~~~https://github.com/OpenKinect/libfreenect2/blob/master/src/tegra_jpeg_rgb_packet_processor.cpp#L147~~~
|
I already tried to cudaHostRegister() the memory region but it looks like it is not supported on ARM platforms, according to this thread: |
We don't have any control on how it is allocated. The part is not open source. Have you tried AastaLLL's example? Just start with |
I am already doing cudaSetDeviceFlags(cudaDeviceMapHost) and also or'ed cudaHostAllocMapped on the CudaAllocator's cudaHostAlloc() flags, which (unexpectedly) allowed me to use the allocated data regions (i.e. depth->c_map) without needing to cudaHostGetDevicePointer(). I will do some more testing, just to make sure but I recall not being able to do this with rgb->data, which means this might be page-locked memory but not mapped on the device. To clarify: I only tried cudaHostGetDevicePointer() after cudaHostRegister() on rgb->data, which returned an error. |
It's exactly the unified virtual addressing on TK1. But I can't make this portable yet.
I guess the secret sauce is how to map it to the device. But if it's already page locked then the "the caching attribute of an existing allocation can't be changed on the fly" issue of cudaHostRegister() not being supported is no longer relevant because it only needs to do the mapping part. |
PR #822 |
Overview Description:
I'm using libfreenect2 to collect RGB and Depth Data from Kinect2 devices on Linux (Ubuntu 14.04) and generate PCL point cloud based on that.
The steps are generally: 1) libfreenect2::SyncMultiFrameListener::waitForNewFrame() to get the RGB and Depth frames, 2) libfreenect2::Registration::apply() to align them, 3) loop through 512x424 matrix and call libfreenect2::Registration::getPointXYZRGB() to fill all the matrix elements.
According to the performance testing result, step 3# was the one taking most of time in the whole pipeline. I was thinking to use parallel programming (either CUDA with GPU or multiple thread on CPU) in Step 3# to improve the efficiency. However, considering libfreenect2 has already providing CUDA/OpenGL options of pipelines etc. , it might looks helpful if libfreenect2 could provide one more function along aside with getPointXYZRGB() using CUAD etc. to generate all the points, i.e. adding libfreenect2::Registration::getPointXYZRGB(const Frame* undistorted, const Frame* registered, float** depth, uint_8** color), where 'depth' point to a 3x512x424 array of float representing point (X, Y, Z) matrix and 'color' point to a 3x512x424 array of uinit_8 representing color for related point.
Would you please kindly share your comments/thoughts about this?
Thanks in advance!
The text was updated successfully, but these errors were encountered: