Accurate latency and throughput calculations for Alveo via PYNQ #711

rfforelli · 2023-02-12T00:33:28Z

rfforelli
Feb 12, 2023
Maintainer

Currently, axi_stream_driver.py for the Alveos uses PYNQ calls inside .predict() which executes the kernel. If profile is set to true, it also benchmarks the execution. For example, the profiling prints this when I predict on 100 samples of shape (216, 680, 3) where the output is (648, 2040, 3).

Classified 100 samples in 29.929913 seconds (3.341139013668366 inferences / s)
Or 299299.12999999995 us / inferences

The driver records the time before the input buffer is synced to the device and records the completion time after the output buffer is synced back to the host PC. This time is what's reported as the throughput. However, the input-output buffer transfers over PCIe (and the other python calls made after beginning the timer) can take a noticeable amount of time which effects the the final reported throughput. Maybe it would be better to record the time directly before and after calling the kernel to isolate kernel execution time and therefore throughput? I implemented the change here. After this change, the reported throughput was 3.602655 inferences / s or 277573.07 us / inferences. Maybe I'm missing something, but it seems this transfer latency in addition to several other function calls made between timea and timeb inflate the results so reporting the transfer time in addition to throughput and an average per-sample latency as separate metrics might be better. The change was even more drastic with a different model I tested where the input was 65536 samples of shape (120,120) and the reported throughput increased from 7190.451059 inferences / s to 10234.377489 inferences / s.

I also added a function to calculate the average start-to-finish latency per sample (only if profiling is enabled) by predicting on each sample individually and averaging. The motivation for this was that running the kernel on all samples at once and dividing time by samples doesn't account for the interval. Although there may be a more efficient way to implement this part.

The output with these changes looks like this

Classified 100 samples in 27.757307 seconds (27.968777 seconds including buffer transfers)
Throughput: 3.602655 inferences / s (or 277573.07 us / inferences)
PCIe Buffer Transfer Time: 0.2114699999999985 s
Average per-sample latency: 286237.330000 us / inference

So the user knows the total inference time including and excluding transfer time, throughput, and latency averaged across all samples. There may be better way to format this though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accurate latency and throughput calculations for Alveo via PYNQ #711

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Accurate latency and throughput calculations for Alveo via PYNQ #711

rfforelli Feb 12, 2023 Maintainer

Replies: 0 comments

rfforelli
Feb 12, 2023
Maintainer