Vectorized implementations of hash join algorithms on Intel Xeon Phi (KNL)
Advanced processor architectures have been driving new designs, implementations and optimizations of main-memory hash join algorithms recently. The newly released Intel Xeon Phi many-core processor of the Knights Landing architecture (KNL) embraces interesting hardware features such as many low-frequency out-of-order cores connected on a 2D mesh, and high-bandwidth multi-channel memory (MCDRAM).
In these implementations, we experimentally revisit the state-of-the-art main-memory hash join algorithms to study how the new hardware features of KNL affect the algorithmic design and tuning as well as to identify the opportunities for further performance improvement on KNL. In detail, we implement the state-of-the-art simple hash join (denoted as SHJ), partitioned hash join (with and without NUMA-aware optimizations, denoted as PHJ and CPRA, respectively). Our experiments show that, although many existing optimizations are still valid on KNL with proper tuning, even the state-of-the-art algorithms have severely underutilized the memory bandwidth and other hardware resources.
Our implementations are based on codes from Polychroniou et al. which were designed for the previous generation of the Intel Xeon Phi. Their publication is listed in the following: Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking SIMD Vectorization for In-Memory Databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1493-1508. DOI: https://doi.org/10.1145/2723372.2747645
The Intel Xeon Phi of the Knights Landing architecture (KNL) is illustrated in the above figure. The KNL model we use consists of 32 tiles, 16 GB MCDRAM and other hardware components which are all connected to a 2D mesh. Each tile tightly couples two low-frequency out-of-order x86-based cores, and two 512-bit Vector Processing Units (VPUs). Each core has its L1 caches and shares a 1 MB L2 cache with the other peer in the same tile. Through the memory channels, KNL connects to at most 400 GB DDR main memory. For more details, please refer to our publication listed below. The above figure compares the execution time of three state-of-the-art hash join algorithms when executed in the flat (where only the DDR is used), hybrid, and the cache mode. The hybrid mode helps to reduce the execution time by 29\%, 53\%, and 58\% for NPJ, PHJ, and CPRA, respectively. The cache mode further reduces the execution time by 65\%, 12\%, and 7\% for these three algorithms, respectively. Among them, PHJ performs the best. Among the three hash join algorithms, NPJ benefits a lot from the high-bandwidth MCDRAM configured as a third-level cache, and the other two have only minor performance gain. Comparing the hybrid mode and the cache mode, we find that, the larger the MCDRAM cache, the better the performance. For more results, please refer to our publication listed below.- Intel Xeon Phi Many-core processor of the Knights Landing Architecture
- x86-based Intel CPUs with AVX-512 support (not verified yet)
- Linux
- Intel C/C++ 17.0.2 20170213
- The memkind library
make npj
make phj
make cpra
make write
./write [#threads] [size of the outer relation] [size of the input relation]
Please update the paths for input relations in the source files of hash joins accordinly after this generation.
./npj [#threads] [size of the outer relation] [size of the input relation]
./phj [#threads] [size of the outer relation] [size of the input relation]
./cpra [#threads] [size of the outer relation] [size of the input relation]
- Xuntao Cheng, Bingsheng He, Xiaoli Du, and Chiew Tong Lau. 2017. A Study of Main-Memory Hash Joins on Many-core Processor: A Case with Intel Knights Landing Architecture. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). ACM, New York, NY, USA, 657-666. DOI: https://doi.org/10.1145/3132847.3132916
Based on our exisiting study, we are exploing several directions to further exploit the many-core architecture and die-stacked High Bandwidth Memory for main-memory databases. We list three major directions in the following. By far, we have explored the first two directions and submitted our research manuscripts to related conferences.
- Optimizing hash joins algorithms taking advantage of opportunities identified in exisiting studies
- Deploying hash tables on many-core architectures with die-stacked High Bandwidth Memory
- Optimizing query processing on such hardware platforms
Email Cheng Xuntao.