Hardware Tile Group Shared Memory #370

bornaehsani · 2020-07-29T07:44:14Z

A new Shared EVA for the manycore hardware
A shared hardware hash function for translating the shared EVA to NPA
Two new CSRs added for tile group dimensions, used in Shared EVA to NPA translation
The hardware tile group shared memory software library to declare and access the memory, with built-in reduction method
Updated profiling tool sets to recognize new remote tile group shared accesses.
Adds four CUDA Lite regression tests, hardware tile group shared memory load store, a reduction test, and two version of blocked matrix multiplication with hardware tile group shared memory.

CI in progress, CUDA Lite regression passes with 4x4_fast_n_fake, 4x4_blocking_vcache_f1_model, timing_v0_8_4.

drichmond · 2020-07-31T16:40:17Z

software/bsg_manycore_lib/bsg_shared_mem.hpp

+        //  | /
+        //  |/
+        // |8|1|2|1|4|1|2|1|
+        void reduce(bsg_barrier<TG_DIM_X, TG_DIM_Y> &barrier) {


Why is reduce part of shared memory?

software/bsg_manycore_lib/bsg_shared_mem.hpp

drichmond · 2020-07-31T16:41:50Z

software/bsg_manycore_lib/bsg_shared_mem.hpp

+        static constexpr std::size_t STRIPES_PER_TILE = (STRIPES/TILES) + (STRIPES%TILES == 0 ? 0 : 1);
+        static constexpr std::size_t ELEMENTS_PER_TILE = STRIPES_PER_TILE * STRIPE_SIZE;
+
+        static constexpr uint32_t DMEM_START_ADDR = 0x1000;       // Beginning of DMEM


I think this should be read from the linker script

...g_cuda_lite_runtime/hardware_shared_mem_load_store/kernel_hardware_shared_mem_load_store.hpp

software/spmd/bsg_cuda_lite_runtime/hardware_shared_mem_load_store/main.c

.../spmd/bsg_cuda_lite_runtime/hardware_shared_mem_reduce/kernel_hardware_shared_mem_reduce.cpp

drichmond · 2020-07-31T16:45:22Z

.../spmd/bsg_cuda_lite_runtime/matrix_mul_shared_mem_hard/kernel_matrix_mul_shared_mem_hard.cpp

+#define TEMPLATE_TG_DIM_X 4
+#define TEMPLATE_TG_DIM_Y 4
+#define TEMPLATE_BLOCK_SIZE_X  64
+#define TEMPLATE_BLOCK_SIZE_Y  64
+#define TEMPLATE_SUBBLOCK_SIZE 32
+#define TEMPLATE_STRIPE_SIZE   1
+#define bsg_tiles_X TEMPLATE_TG_DIM_X
+#define bsg_tiles_Y TEMPLATE_TG_DIM_Y


The CUDA makefiles in replicant specify a 2x2, but this is a 4x4.

Yes I'm aware, I'm using pre-fixed 4x4 dimensions for this specific kernel, because for 2x2 tile groups the shared memory would be too small. And I didn't want to change the bsg_replicant infrastructure since the tile group dimensions there are 2x2. The replicant Makefile specifies a 2x2 but the 4x4 inside the kernel overrides that. Good catch though!

Is this resolved?

I don't know - best to check if the source in replicant and here agree.

v/bsg_manycore_eva_to_npa.v

v/vanilla_bean/hash_function_shared.v

tommydcjung · 2020-07-31T17:30:56Z

v/vanilla_bean/hash_function_shared.v

+    ,input [width_p-1:0] shared_eva_i
+    ,input [hash_width_p-1:0] hash_i


Instead of plugging in separated fields, just plug in the whole eva, and cast it with the struct in here.

Great suggestion! Done, thanks!

tommydcjung · 2020-07-31T17:35:19Z

v/bsg_manycore_pkg.v

@@ -190,6 +192,18 @@ package bsg_manycore_pkg;
      logic [1:0]                           low_bits;              \
    } bsg_manycore_tile_group_addr_s;

+  // shared
+  `define declare_bsg_manycore_shared_addr_s     \


`define is not needed here, since the macros don't have any parameters. same with the ones above.

tommydcjung · 2020-07-31T17:57:40Z

v/vanilla_bean/hash_function_shared.v

+    ,input [width_p-1:0] shared_eva_i
+    ,input [hash_width_p-1:0] hash_i
+    ,input [x_cord_width_p-1:0] tg_dim_x_width_i
+    ,input [y_cord_width_p-1:0] tg_dim_y_width_i


as discussed before this width doesn't need to be be x_cord_width_p. Instead there should be a minimum number of bits to encode all the size of tile-group dimension (which is only power of 2, in this case).

tommydcjung · 2020-07-31T18:02:44Z

v/vanilla_bean/hash_function_shared.v

+      for (integer i = 0; i < tg_dim_x_width_i; i = i + 1) begin
+        x_o[i] = shared_eva_i[i+hash_i];
+      end
+
+      // Y coordinate
+      for (integer i = 0; i < tg_dim_y_width_i; i = i + 1) begin
+        y_o[i] = shared_eva_i[i+tg_dim_x_width_i+hash_i];
+      end
+
+      // The LSB bits of address are stripe
+      for (integer i = 0; i < hash_i; i = i + 1) begin
+          addr_o[i] = shared_eva_i[i];
+      end
+
+      // The MSB bits of address are the local offset
+      for (integer i = hash_i; i < epa_word_addr_width_gp; i = i + 1) begin
+          addr_o[i] = shared_eva_i[i+tg_dim_y_width_i+tg_dim_x_width_i];
+      end


after changing tg_dim_x_width_i to a compact form, we can use that as a select bit for bsg_mux. This way we know for sure this will synthesize into the most efficient hardware.

please talk with me if you have question about this approach.

tommydcjung · 2020-07-31T18:06:53Z

testbenches/common/v/vanilla_core_profiler.v

@@ -134,11 +134,13 @@ module vanilla_core_profiler
  wire remote_ld_dram_inc = remote_ld_inc & remote_req_o.addr[data_width_p-1]; 
  wire remote_ld_global_inc = remote_ld_inc & (remote_req_o.addr[data_width_p-1-:2] == 2'b01);
  wire remote_ld_group_inc = remote_ld_inc & (remote_req_o.addr[data_width_p-1-:3] == 3'b001);
+  wire remote_ld_shared_inc = remote_ld_inc & (remote_req_o.addr[data_width_p-1-:5] == 5'b00001);


I think there is lots of repeating code here (e.g. remote_req_o.addr[data_width_p-1-:5] == 5'b00001)

I don't quite understand the problem here. I've tried to use a style consistent with the rest of this code. I'm basically copy-pasting lines that were already in these files prior to this PR. I don't see how the issue can only be with the new line, and not the old ones.

tommydcjung · 2020-07-31T18:12:28Z

testbenches/common/v/vanilla_core_profiler.v

+  // int_sb[4]: idiv
+  // int_sb[3]: remote dram load
+  // int_sb[2]: remote global load
+  // int_sb[1]: remote group load
+  // int_sb[0]: remote shared load


why was it necessary to put shared in [0]...?

No technical reason, but it would be nice to have a logical ordering of different memory types. My logic is access latency, and in that logic, DRAM is the slowest remote memory with [3], and shared is the fastest remote memory with [0].

tommydcjung · 2020-07-31T18:16:34Z

v/bsg_manycore_eva_to_npa.v

+      y_cord_o = y_cord_width_p'(shared_y_lo + tgo_y_i);
+      x_cord_o = x_cord_width_p'(shared_x_lo + tgo_x_i);
+      epa_o = { {(addr_width_p-epa_word_addr_width_gp){1'b0}},
+                {{shared_epa_lo + dmem_start_addr_lp}} };


#370 (comment)

tommydcjung · 2020-07-31T18:59:46Z

There should be SPMD tests that covers a wide range of configurations.

…anaged tile group shared memory. Still incomplete. 2. Added a hard_shared bsg_cuda_lite_runtime test

… maximum

… and local address of a shared eva 2. Updated the arch_filelist.mk accordingly 3. Updated the eva_to_npa unit to use this hash function 4. Updated the bsg_shared_mem.hpp to incorporate the stripe size (hash) inside the shared EVA

…ash part of shared eva for determining the stripe_width

…address

…Rs tg_dim_x/y_width = (ceil(log2(tg_dim_x/y)) 2. Updated the shared eva hash function to use the tile group dimension width instead of loaclparams

…memory regression test

… library for llvm

…sion

…filing tools, and the blood graph

…e tile group dimensions are power of two

… a hardware tile group shared memory cuda lite regression test

…re_pkg

bornaehsani added enhancement New feature or request CUDA-Lite labels Jul 29, 2020

bornaehsani requested review from drichmond, mrutt92, taylor-bsg and tommydcjung July 29, 2020 07:44

bornaehsani self-assigned this Jul 29, 2020