You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, axi_stream_driver.py for the Alveos uses PYNQ calls inside .predict() which executes the kernel. If profile is set to true, it also benchmarks the execution. For example, the profiling prints this when I predict on 100 samples of shape (216, 680, 3) where the output is (648, 2040, 3).
Classified 100 samples in 29.929913 seconds (3.341139013668366 inferences / s)
Or 299299.12999999995 us / inferences
The driver records the time before the input buffer is synced to the device and records the completion time after the output buffer is synced back to the host PC. This time is what's reported as the throughput. However, the input-output buffer transfers over PCIe (and the other python calls made after beginning the timer) can take a noticeable amount of time which effects the the final reported throughput. Maybe it would be better to record the time directly before and after calling the kernel to isolate kernel execution time and therefore throughput? I implemented the change here. After this change, the reported throughput was 3.602655 inferences / s or 277573.07 us / inferences. Maybe I'm missing something, but it seems this transfer latency in addition to several other function calls made between timea and timeb inflate the results so reporting the transfer time in addition to throughput and an average per-sample latency as separate metrics might be better. The change was even more drastic with a different model I tested where the input was 65536 samples of shape (120,120) and the reported throughput increased from 7190.451059 inferences / s to 10234.377489 inferences / s.
I also added a function to calculate the average start-to-finish latency per sample (only if profiling is enabled) by predicting on each sample individually and averaging. The motivation for this was that running the kernel on all samples at once and dividing time by samples doesn't account for the interval. Although there may be a more efficient way to implement this part.
The output with these changes looks like this
Classified 100 samples in 27.757307 seconds (27.968777 seconds including buffer transfers)
Throughput: 3.602655 inferences / s (or 277573.07 us / inferences)
PCIe Buffer Transfer Time: 0.2114699999999985 s
Average per-sample latency: 286237.330000 us / inference
So the user knows the total inference time including and excluding transfer time, throughput, and latency averaged across all samples. There may be better way to format this though.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Currently, axi_stream_driver.py for the Alveos uses PYNQ calls inside .predict() which executes the kernel. If profile is set to true, it also benchmarks the execution. For example, the profiling prints this when I predict on 100 samples of shape (216, 680, 3) where the output is (648, 2040, 3).
The driver records the time before the input buffer is synced to the device and records the completion time after the output buffer is synced back to the host PC. This time is what's reported as the throughput. However, the input-output buffer transfers over PCIe (and the other python calls made after beginning the timer) can take a noticeable amount of time which effects the the final reported throughput. Maybe it would be better to record the time directly before and after calling the kernel to isolate kernel execution time and therefore throughput? I implemented the change here. After this change, the reported throughput was
3.602655 inferences / s
or277573.07 us / inferences
. Maybe I'm missing something, but it seems this transfer latency in addition to several other function calls made betweentimea
andtimeb
inflate the results so reporting the transfer time in addition to throughput and an average per-sample latency as separate metrics might be better. The change was even more drastic with a different model I tested where the input was 65536 samples of shape (120,120) and the reported throughput increased from 7190.451059 inferences / s to 10234.377489 inferences / s.I also added a function to calculate the average start-to-finish latency per sample (only if profiling is enabled) by predicting on each sample individually and averaging. The motivation for this was that running the kernel on all samples at once and dividing time by samples doesn't account for the interval. Although there may be a more efficient way to implement this part.
The output with these changes looks like this
So the user knows the total inference time including and excluding transfer time, throughput, and latency averaged across all samples. There may be better way to format this though.
Beta Was this translation helpful? Give feedback.
All reactions