Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debugging #8

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 1 addition & 32 deletions src/aztec/gpu/notes/problems.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,35 +220,4 @@
Need to figure out the proper way to time as well. There's a huge dosparity between using chrono timer vs. Cuda events.

```Profiling discussion```
For 2^15 constraints, profiling the kernel execution with "nsys profile --stats=true ./bin/arithmetic_cuda" and "sudo /usr/local/cuda/bin/ncu ./bin/arithmetic_cuda" highlights a low achieved occupancy on device. With regards to the execution time, bucket_running_sum_kernel accounts for 38%, final_accumulation_kernel for 31%, and accumulate_buckets_kernel for 28.1% of the runtime respectively. The radix sorting algorithm is negligible in comparison. Bumping up to 2^20 constraints, the accumulate_buckets_kernel call accounts for 91% of the runtime. It's obvious that as the constraint size increases, optimizing the bucket accumulation seems like a natural step. Both the achieved occupancy and SOL (compute %) are low, indicating the kernel launch parameters and actual kernel computational work aren't stressing the available device capabilities. The register pressure does not seems to be the cause.

TODOs:
/**
* TODO: choose block sizes based on occupancy in terms of active blocks
* TODO: look into shared memory optimizations instead of global memory accesses (100x latency lower than global memory)
* TODO: remove extraneous loops
* TODO: adjust kernel parameters to reduce overhead
* TODO: look into loop unrolling with pragma
* TODO: more efficient sum reduction kernel
* TODO: change size of windows (choose b and c parameters dynamically based on number of points and scalars)
* TODO: Address depaul notes on SOL, occupancy achieved, etc.
* TODO: look into reducing registers and pipelining loads (e.g. __launch_bounds__)
* TODO: change the indexing for the other sum reduction kernel
* TODO: change indexing of threads from tid to threadrank. maybe it's better need to look into it
* TODO: clean up comments in kernels
* TODO: switch jacobian to projective coordinates to eliminate infinity and zero checks
* TODO: are conditional checks are degrading performance?
* TODO: Look into 'Staged concurrent copy and execute' over 'Sequential copy and execute'
* TODO: add threads for for loops in main and pippenger.cu
* TODO: run a sample roofline model to check performance as a function of Compute (SM) Throughput + occupancy
* TODO: update tasks.md file

* TODO: debug why pippenger is segfaulting all of a sudden for larger constraints
* TODO: run valgrind to check for leaking memory
* TODO: Add back wrappers to cuda mmeory allocations and frees
* TODO: change alocta ebakc to cudamallochost
* TODO: Figure out the correct malloc parameter size
* TODO: free the rest of the points correctly
* TODO: fix non-virtual constructor error that breaks complication
* TODO: figure out correct profiling methodology (host and device profilers, events, etc.)
*/
For 2^15 constraints, profiling the kernel execution with "nsys profile --stats=true ./bin/arithmetic_cuda" and "sudo /usr/local/cuda/bin/ncu ./bin/arithmetic_cuda" highlights a low achieved occupancy on device. With regards to the execution time, bucket_running_sum_kernel accounts for 38%, final_accumulation_kernel for 31%, and accumulate_buckets_kernel for 28.1% of the runtime respectively. The radix sorting algorithm is negligible in comparison. Bumping up to 2^20 constraints, the accumulate_buckets_kernel call accounts for 91% of the runtime. It's obvious that as the constraint size increases, optimizing the bucket accumulation seems like a natural step. Both the achieved occupancy and SOL (compute %) are low, indicating the kernel launch parameters and actual kernel computational work aren't stressing the available device capabilities. The register pressure does not seems to be the cause.