diff --git a/ese532_handouts/hw3/walk_through.md b/ese532_handouts/hw3/walk_through.md
index 4c07904..b28ce53 100644
--- a/ese532_handouts/hw3/walk_through.md
+++ b/ese532_handouts/hw3/walk_through.md
@@ -591,3 +591,86 @@ name: top
 ---
 `top` on Ultra96 showing core usage
 ```
+
+### Performance Counter Statistics using Perf
+In hw2, two methods are suggested to measure latency: instrumentation-based profiling and gprof. In hw3, a new method using Perf is proposed. To go through the following steps, you need to 'cd' to hw2 folder and probably need to install 'perf' on the board.  
+
+ARM has a dedicated Performance Monitor Unit (PMU) that can give you the number of cycles
+your program takes to run (read more about PMU [here](https://easyperf.net/blog/2018/06/01/PMU-counters-and-profiling-basics)).
+We can use `perf` to get the performance counter statistics of your program (read these [slides](https://static.linaro.org/connect/yvr18/presentations/yvr18-416.pdf) to learn more about perf).
+
+Run perf as follows (`make perf` in the supplied `Makefile`):
+```
+sudo perf stat ./rendering
+```
+You should see the following output:
+```
+[stahmed@macarena hw2_profiling_tutorial]$ make perf
+g++ -DWITH_TIMER -Wall -g -O2 -Wno-unused-label -I/src/sw/ -I/src/host/ -o rendering ./src/host/3d_rendering_host.cpp ./src/host/utils.cpp ./src/host/check_result.cpp ./src/sw/rendering_sw.cpp  
+Executable rendering compiled!
+Running perf stat...
+3D Rendering Application
+Total latency of projection is: 125316 ns.
+Total latency of rasterization1 is: 136611 ns.
+Total latency of rasterization2 is: 2.29206e+06 ns.
+Total latency of zculling is: 244155 ns.
+Total latency of coloringFB is: 146167 ns.
+Total time taken by the loop is: 3.48606e+06 ns.
+---------------------------------------------------------------
+Average latency of projection per loop iteration is: 39.2594 ns.
+Average latency of rasterization1 per loop iteration is: 42.7979 ns.
+Average latency of rasterization2 per loop iteration is: 718.065 ns.
+Average latency of zculling per loop iteration is: 76.4897 ns.
+Average latency of coloringFB per loop iteration is: 45.7917 ns.
+Average latency of each loop iteration is: 1092.12 ns.
+Writing output...
+Check output.txt for a bunny!
+
+ Performance counter stats for './rendering':
+
+              6.36 msec task-clock                #    0.953 CPUs utilized          
+                 0      context-switches          #    0.000 K/sec                  
+                 0      cpu-migrations            #    0.000 K/sec                  
+               163      page-faults               #    0.026 M/sec                  
+        17,322,039      cycles                    #    2.722 GHz                    
+        39,090,159      instructions              #    2.26  insn per cycle         
+         5,358,366      branches                  #  842.109 M/sec                  
+            56,019      branch-misses             #    1.05% of all branches        
+
+       0.006679970 seconds time elapsed
+
+       0.006685000 seconds user
+       0.000000000 seconds sys
+```
+From the above output, we can see that our program took $17,322,039$ cycles at $2.722$ GHz. We can use these numbers to find the 
+run time of our program, which is $17322039/2.722$
+$\approx$ $6.36$ milli seconds which agrees with the $6.36$ msec
+reported by perf too. Note that perf used the "task-clock" (system timer)
+to report the latency in seconds, and used the PMU counter to report
+the latency in cycles. The PMU counter runs at the same frequency as
+the cpu, which is $2.722$ GHz, whereas the system timer runs at a
+much lower frequency (in the MHz range).
+ 
+---
+Now that we have shown you three approaches for measuring latency, a natural question is when do you use either of these methods?
+- Use {ref}`profiling/instrumentation` or {ref}`profiling/gprof` when you want to find individual
+latencies of your functions.
+- Use {ref}`profiling/perf` when you just want to know the total latency (either
+in seconds or cycles) of your program. When you don't have `perf` available in your system,
+you can also run `time executable-name` to get the total time
+your program takes to run.
+
+However, the above answer is too simple. The application we showed you
+is slow enough for `std::chrono` to measure accurately. When the resolution of your system timer is not fine-grained
+enough, or your function is too fast, you should measure the function for a longer period of time (see the spin loop section from [here](https://www.cs.fsu.edu/~engelen/courses/HPC/Performance.pdf)). Alternatively,
+that's where the PMU offers more accuracy. Since the PMU runs at the same
+frequency as the CPU, it can measure any function. However, you will
+have to isolate your functions and create separate programs to use
+the PMU through `perf`. There is no stopwatch-like user API for the PMU
+counter.
+
+For our application above, we saw that the total runtime reported by task-clock and PMU counter doesn't differ. Hence, it doesn't matter which approach you use in this case. If you want to get the latencies
+of individual function in ***cycles*** instead, you can just use your
+measured time with the clock frequency to figure out the cycles.
+Alternatively you could get the fraction of time spent by your function
+and use the total number of cycles from `perf stat`.