-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware Tile Group Shared Memory #370
base: master
Are you sure you want to change the base?
Conversation
// | / | ||
// |/ | ||
// |8|1|2|1|4|1|2|1| | ||
void reduce(bsg_barrier<TG_DIM_X, TG_DIM_Y> &barrier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is reduce part of shared memory?
static constexpr std::size_t STRIPES_PER_TILE = (STRIPES/TILES) + (STRIPES%TILES == 0 ? 0 : 1); | ||
static constexpr std::size_t ELEMENTS_PER_TILE = STRIPES_PER_TILE * STRIPE_SIZE; | ||
|
||
static constexpr uint32_t DMEM_START_ADDR = 0x1000; // Beginning of DMEM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be read from the linker script
...g_cuda_lite_runtime/hardware_shared_mem_load_store/kernel_hardware_shared_mem_load_store.hpp
Outdated
Show resolved
Hide resolved
software/spmd/bsg_cuda_lite_runtime/hardware_shared_mem_load_store/main.c
Show resolved
Hide resolved
.../spmd/bsg_cuda_lite_runtime/hardware_shared_mem_reduce/kernel_hardware_shared_mem_reduce.cpp
Outdated
Show resolved
Hide resolved
#define TEMPLATE_TG_DIM_X 4 | ||
#define TEMPLATE_TG_DIM_Y 4 | ||
#define TEMPLATE_BLOCK_SIZE_X 64 | ||
#define TEMPLATE_BLOCK_SIZE_Y 64 | ||
#define TEMPLATE_SUBBLOCK_SIZE 32 | ||
#define TEMPLATE_STRIPE_SIZE 1 | ||
#define bsg_tiles_X TEMPLATE_TG_DIM_X | ||
#define bsg_tiles_Y TEMPLATE_TG_DIM_Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA makefiles in replicant specify a 2x2, but this is a 4x4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'm aware, I'm using pre-fixed 4x4 dimensions for this specific kernel, because for 2x2 tile groups the shared memory would be too small. And I didn't want to change the bsg_replicant infrastructure since the tile group dimensions there are 2x2. The replicant Makefile specifies a 2x2 but the 4x4 inside the kernel overrides that. Good catch though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know - best to check if the source in replicant and here agree.
,input [width_p-1:0] shared_eva_i | ||
,input [hash_width_p-1:0] hash_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of plugging in separated fields, just plug in the whole eva, and cast it with the struct in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! Done, thanks!
v/bsg_manycore_pkg.v
Outdated
@@ -190,6 +192,18 @@ package bsg_manycore_pkg; | |||
logic [1:0] low_bits; \ | |||
} bsg_manycore_tile_group_addr_s; | |||
|
|||
// shared | |||
`define declare_bsg_manycore_shared_addr_s \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`define is not needed here, since the macros don't have any parameters. same with the ones above.
,input [width_p-1:0] shared_eva_i | ||
,input [hash_width_p-1:0] hash_i | ||
,input [x_cord_width_p-1:0] tg_dim_x_width_i | ||
,input [y_cord_width_p-1:0] tg_dim_y_width_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed before this width doesn't need to be be x_cord_width_p. Instead there should be a minimum number of bits to encode all the size of tile-group dimension (which is only power of 2, in this case).
for (integer i = 0; i < tg_dim_x_width_i; i = i + 1) begin | ||
x_o[i] = shared_eva_i[i+hash_i]; | ||
end | ||
|
||
// Y coordinate | ||
for (integer i = 0; i < tg_dim_y_width_i; i = i + 1) begin | ||
y_o[i] = shared_eva_i[i+tg_dim_x_width_i+hash_i]; | ||
end | ||
|
||
// The LSB bits of address are stripe | ||
for (integer i = 0; i < hash_i; i = i + 1) begin | ||
addr_o[i] = shared_eva_i[i]; | ||
end | ||
|
||
// The MSB bits of address are the local offset | ||
for (integer i = hash_i; i < epa_word_addr_width_gp; i = i + 1) begin | ||
addr_o[i] = shared_eva_i[i+tg_dim_y_width_i+tg_dim_x_width_i]; | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after changing tg_dim_x_width_i to a compact form, we can use that as a select bit for bsg_mux. This way we know for sure this will synthesize into the most efficient hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please talk with me if you have question about this approach.
@@ -134,11 +134,13 @@ module vanilla_core_profiler | |||
wire remote_ld_dram_inc = remote_ld_inc & remote_req_o.addr[data_width_p-1]; | |||
wire remote_ld_global_inc = remote_ld_inc & (remote_req_o.addr[data_width_p-1-:2] == 2'b01); | |||
wire remote_ld_group_inc = remote_ld_inc & (remote_req_o.addr[data_width_p-1-:3] == 3'b001); | |||
wire remote_ld_shared_inc = remote_ld_inc & (remote_req_o.addr[data_width_p-1-:5] == 5'b00001); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is lots of repeating code here (e.g. remote_req_o.addr[data_width_p-1-:5] == 5'b00001)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the problem here. I've tried to use a style consistent with the rest of this code. I'm basically copy-pasting lines that were already in these files prior to this PR. I don't see how the issue can only be with the new line, and not the old ones.
// int_sb[4]: idiv | ||
// int_sb[3]: remote dram load | ||
// int_sb[2]: remote global load | ||
// int_sb[1]: remote group load | ||
// int_sb[0]: remote shared load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was it necessary to put shared in [0]...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No technical reason, but it would be nice to have a logical ordering of different memory types. My logic is access latency, and in that logic, DRAM is the slowest remote memory with [3], and shared is the fastest remote memory with [0].
y_cord_o = y_cord_width_p'(shared_y_lo + tgo_y_i); | ||
x_cord_o = x_cord_width_p'(shared_x_lo + tgo_x_i); | ||
epa_o = { {(addr_width_p-epa_word_addr_width_gp){1'b0}}, | ||
{{shared_epa_lo + dmem_start_addr_lp}} }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be SPMD tests that covers a wide range of configurations. |
c241572
to
56bc3ad
Compare
56bc3ad
to
f724b74
Compare
…anaged tile group shared memory. Still incomplete. 2. Added a hard_shared bsg_cuda_lite_runtime test
… and local address of a shared eva 2. Updated the arch_filelist.mk accordingly 3. Updated the eva_to_npa unit to use this hash function 4. Updated the bsg_shared_mem.hpp to incorporate the stripe size (hash) inside the shared EVA
…ash part of shared eva for determining the stripe_width
…Rs tg_dim_x/y_width = (ceil(log2(tg_dim_x/y)) 2. Updated the shared eva hash function to use the tile group dimension width instead of loaclparams
…memory regression test
… library for llvm
…filing tools, and the blood graph
…e tile group dimensions are power of two
… a hardware tile group shared memory cuda lite regression test
a7a3ae4
to
6fcbbab
Compare
CI in progress, CUDA Lite regression passes with 4x4_fast_n_fake, 4x4_blocking_vcache_f1_model, timing_v0_8_4.