diff --git a/ese532_handouts/hw3/walk_through.md b/ese532_handouts/hw3/walk_through.md index 4c07904..b28ce53 100644 --- a/ese532_handouts/hw3/walk_through.md +++ b/ese532_handouts/hw3/walk_through.md @@ -591,3 +591,86 @@ name: top --- `top` on Ultra96 showing core usage ``` + +### Performance Counter Statistics using Perf +In hw2, two methods are suggested to measure latency: instrumentation-based profiling and gprof. In hw3, a new method using Perf is proposed. To go through the following steps, you need to 'cd' to hw2 folder and probably need to install 'perf' on the board. + +ARM has a dedicated Performance Monitor Unit (PMU) that can give you the number of cycles +your program takes to run (read more about PMU [here](https://easyperf.net/blog/2018/06/01/PMU-counters-and-profiling-basics)). +We can use `perf` to get the performance counter statistics of your program (read these [slides](https://static.linaro.org/connect/yvr18/presentations/yvr18-416.pdf) to learn more about perf). + +Run perf as follows (`make perf` in the supplied `Makefile`): +``` +sudo perf stat ./rendering +``` +You should see the following output: +``` +[stahmed@macarena hw2_profiling_tutorial]$ make perf +g++ -DWITH_TIMER -Wall -g -O2 -Wno-unused-label -I/src/sw/ -I/src/host/ -o rendering ./src/host/3d_rendering_host.cpp ./src/host/utils.cpp ./src/host/check_result.cpp ./src/sw/rendering_sw.cpp +Executable rendering compiled! +Running perf stat... +3D Rendering Application +Total latency of projection is: 125316 ns. +Total latency of rasterization1 is: 136611 ns. +Total latency of rasterization2 is: 2.29206e+06 ns. +Total latency of zculling is: 244155 ns. +Total latency of coloringFB is: 146167 ns. +Total time taken by the loop is: 3.48606e+06 ns. +--------------------------------------------------------------- +Average latency of projection per loop iteration is: 39.2594 ns. +Average latency of rasterization1 per loop iteration is: 42.7979 ns. +Average latency of rasterization2 per loop iteration is: 718.065 ns. +Average latency of zculling per loop iteration is: 76.4897 ns. +Average latency of coloringFB per loop iteration is: 45.7917 ns. +Average latency of each loop iteration is: 1092.12 ns. +Writing output... +Check output.txt for a bunny! + + Performance counter stats for './rendering': + + 6.36 msec task-clock # 0.953 CPUs utilized + 0 context-switches # 0.000 K/sec + 0 cpu-migrations # 0.000 K/sec + 163 page-faults # 0.026 M/sec + 17,322,039 cycles # 2.722 GHz + 39,090,159 instructions # 2.26 insn per cycle + 5,358,366 branches # 842.109 M/sec + 56,019 branch-misses # 1.05% of all branches + + 0.006679970 seconds time elapsed + + 0.006685000 seconds user + 0.000000000 seconds sys +``` +From the above output, we can see that our program took $17,322,039$ cycles at $2.722$ GHz. We can use these numbers to find the +run time of our program, which is $17322039/2.722$ +$\approx$ $6.36$ milli seconds which agrees with the $6.36$ msec +reported by perf too. Note that perf used the "task-clock" (system timer) +to report the latency in seconds, and used the PMU counter to report +the latency in cycles. The PMU counter runs at the same frequency as +the cpu, which is $2.722$ GHz, whereas the system timer runs at a +much lower frequency (in the MHz range). + +--- +Now that we have shown you three approaches for measuring latency, a natural question is when do you use either of these methods? +- Use {ref}`profiling/instrumentation` or {ref}`profiling/gprof` when you want to find individual +latencies of your functions. +- Use {ref}`profiling/perf` when you just want to know the total latency (either +in seconds or cycles) of your program. When you don't have `perf` available in your system, +you can also run `time executable-name` to get the total time +your program takes to run. + +However, the above answer is too simple. The application we showed you +is slow enough for `std::chrono` to measure accurately. When the resolution of your system timer is not fine-grained +enough, or your function is too fast, you should measure the function for a longer period of time (see the spin loop section from [here](https://www.cs.fsu.edu/~engelen/courses/HPC/Performance.pdf)). Alternatively, +that's where the PMU offers more accuracy. Since the PMU runs at the same +frequency as the CPU, it can measure any function. However, you will +have to isolate your functions and create separate programs to use +the PMU through `perf`. There is no stopwatch-like user API for the PMU +counter. + +For our application above, we saw that the total runtime reported by task-clock and PMU counter doesn't differ. Hence, it doesn't matter which approach you use in this case. If you want to get the latencies +of individual function in ***cycles*** instead, you can just use your +measured time with the clock frequency to figure out the cycles. +Alternatively you could get the fraction of time spent by your function +and use the total number of cycles from `perf stat`.