Comparison of vector element sum using various data types.
For all experiments, each approach is attempted on a number of vector sizes, running each approach 5 times per size to get a good time measure. The experiments are done with guidance from Prof. Dip Sankar Banerjee and Prof. Kishore Kothapalli.
In this experiment (float-vs-bfloat16, main), we comparing the performance of finding the sum of numbers between, the number stored as float or bfloat16. While it seemed to me that bfloat16 method would be a clear winner because of reduced memory bandwidth requirement, for some reason it is only slightly faster. This is possibly because memory loads are anyway always 32-bit. The only reason using bfloat16 is slightly faster could possibly be because it allows data to be retained in cache for a longer period of time (because of its small size). Note that neither approach makes use of SIMD instructions which are available on all modern hardware.
- A Study of BFLOAT16 for Deep Learning Training
- Convert FP32 to Bfloat16 in C++
- Why is there no 2-byte float and does an implementation already exist?
- Is it safe to reinterpret_cast an integer to float?