Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernels read keys and values out of bounds #4

Open
natevm opened this issue Oct 17, 2023 · 4 comments
Open

Kernels read keys and values out of bounds #4

natevm opened this issue Oct 17, 2023 · 4 comments

Comments

@natevm
Copy link

natevm commented Oct 17, 2023

Hello,

I've recently discovered that if the key/value buffers used in the sort are not a multiple of PARALLELSORT_THREADGROUP_SIZE * 4, then the buffers are read out of bounds and undefined behavior can occur.

See these lines below:

uint srcKeys[FFX_PARALLELSORT_ELEMENTS_PER_THREAD];
srcKeys[0] = SrcBuffer[DataIndex];
srcKeys[1] = SrcBuffer[DataIndex + FFX_PARALLELSORT_THREADGROUP_SIZE];
srcKeys[2] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 2)];
srcKeys[3] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 3)];

No bounds checks are done to SrcBuffer here, causing GPU instability when these are read out of bounds.

Also an issue here:

// Pre-load the key values in order to hide some of the read latency
uint srcKeys[FFX_PARALLELSORT_ELEMENTS_PER_THREAD];
srcKeys[0] = SrcBuffer[DataIndex];
srcKeys[1] = SrcBuffer[DataIndex + FFX_PARALLELSORT_THREADGROUP_SIZE];
srcKeys[2] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 2)];
srcKeys[3] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 3)];
#ifdef kRS_ValueCopy
uint srcValues[FFX_PARALLELSORT_ELEMENTS_PER_THREAD];
srcValues[0] = SrcPayload[DataIndex];
srcValues[1] = SrcPayload[DataIndex + FFX_PARALLELSORT_THREADGROUP_SIZE];
srcValues[2] = SrcPayload[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 2)];
srcValues[3] = SrcPayload[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 3)];

Later on, the number of keys is checked, but by that point it's too late:

uint localKey = (DataIndex < CBuffer.NumKeys ? srcKeys[i] : 0xffffffff);
#ifdef kRS_ValueCopy
uint localValue = (DataIndex < CBuffer.NumKeys ? srcValues[i] : 0);

I suspect the fix would be to just check the number of keys before pre-loading the key/value pairs.

Reproducing is simple enough, just run the sort on data that is less than PARALLELSORT_THREADGROUP_SIZE * 4 with GPU-assisted validation that checks out of bounds descriptor reads.

@jlacroixAMD
Copy link

Thank you for reporting this. I'll file a ticket internally and we'll get it fixed in the next release.

@jlacroixAMD
Copy link

As I just realized this was reported on the old Parallel Sort sample, a fix to address this will be pushed with the next version of the FidelityFX SDK (which is how we are pushing out most updates to our older features now - https://github.com/GPUOpen-LibrariesAndSDKs/FidelityFX-SDK).

Also, in order to keep the GPU code as fast as possible, the fix will likely be done as a check on the NumKeys value at CPU time with an error code returned in the data setup stage.

@natevm
Copy link
Author

natevm commented Oct 19, 2023

Fwiw this sort implementation has been very helpful.

Small feature request, it would be great to also have a separate dedicated prefix sum/scan and a parallel device selection / compaction. Both of these are used internally by the radix sorter, but would also be helpful standalone.

@jlacroixAMD
Copy link

I'll add it to the list of planned improvements to existing samples. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants