Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware Tile Group Shared Memory #370

Open
wants to merge 39 commits into
base: master
Choose a base branch
from

Conversation

bornaehsani
Copy link
Contributor

@bornaehsani bornaehsani commented Jul 29, 2020

  • A new Shared EVA for the manycore hardware
  • A shared hardware hash function for translating the shared EVA to NPA
  • Two new CSRs added for tile group dimensions, used in Shared EVA to NPA translation
  • The hardware tile group shared memory software library to declare and access the memory, with built-in reduction method
  • Updated profiling tool sets to recognize new remote tile group shared accesses.
  • Adds four CUDA Lite regression tests, hardware tile group shared memory load store, a reduction test, and two version of blocked matrix multiplication with hardware tile group shared memory.

CI in progress, CUDA Lite regression passes with 4x4_fast_n_fake, 4x4_blocking_vcache_f1_model, timing_v0_8_4.

@bornaehsani bornaehsani added enhancement New feature or request CUDA-Lite labels Jul 29, 2020
@bornaehsani bornaehsani self-assigned this Jul 29, 2020
// | /
// |/
// |8|1|2|1|4|1|2|1|
void reduce(bsg_barrier<TG_DIM_X, TG_DIM_Y> &barrier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is reduce part of shared memory?

static constexpr std::size_t STRIPES_PER_TILE = (STRIPES/TILES) + (STRIPES%TILES == 0 ? 0 : 1);
static constexpr std::size_t ELEMENTS_PER_TILE = STRIPES_PER_TILE * STRIPE_SIZE;

static constexpr uint32_t DMEM_START_ADDR = 0x1000; // Beginning of DMEM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be read from the linker script

Comment on lines +16 to +23
#define TEMPLATE_TG_DIM_X 4
#define TEMPLATE_TG_DIM_Y 4
#define TEMPLATE_BLOCK_SIZE_X 64
#define TEMPLATE_BLOCK_SIZE_Y 64
#define TEMPLATE_SUBBLOCK_SIZE 32
#define TEMPLATE_STRIPE_SIZE 1
#define bsg_tiles_X TEMPLATE_TG_DIM_X
#define bsg_tiles_Y TEMPLATE_TG_DIM_Y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA makefiles in replicant specify a 2x2, but this is a 4x4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm aware, I'm using pre-fixed 4x4 dimensions for this specific kernel, because for 2x2 tile groups the shared memory would be too small. And I didn't want to change the bsg_replicant infrastructure since the tile group dimensions there are 2x2. The replicant Makefile specifies a 2x2 but the 4x4 inside the kernel overrides that. Good catch though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know - best to check if the source in replicant and here agree.

v/bsg_manycore_eva_to_npa.v Outdated Show resolved Hide resolved
v/bsg_manycore_eva_to_npa.v Outdated Show resolved Hide resolved
v/vanilla_bean/hash_function_shared.v Outdated Show resolved Hide resolved
Comment on lines 25 to 26
,input [width_p-1:0] shared_eva_i
,input [hash_width_p-1:0] hash_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of plugging in separated fields, just plug in the whole eva, and cast it with the struct in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! Done, thanks!

@@ -190,6 +192,18 @@ package bsg_manycore_pkg;
logic [1:0] low_bits; \
} bsg_manycore_tile_group_addr_s;

// shared
`define declare_bsg_manycore_shared_addr_s \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`define is not needed here, since the macros don't have any parameters. same with the ones above.

Comment on lines 25 to 22
,input [width_p-1:0] shared_eva_i
,input [hash_width_p-1:0] hash_i
,input [x_cord_width_p-1:0] tg_dim_x_width_i
,input [y_cord_width_p-1:0] tg_dim_y_width_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed before this width doesn't need to be be x_cord_width_p. Instead there should be a minimum number of bits to encode all the size of tile-group dimension (which is only power of 2, in this case).

Comment on lines 45 to 60
for (integer i = 0; i < tg_dim_x_width_i; i = i + 1) begin
x_o[i] = shared_eva_i[i+hash_i];
end

// Y coordinate
for (integer i = 0; i < tg_dim_y_width_i; i = i + 1) begin
y_o[i] = shared_eva_i[i+tg_dim_x_width_i+hash_i];
end

// The LSB bits of address are stripe
for (integer i = 0; i < hash_i; i = i + 1) begin
addr_o[i] = shared_eva_i[i];
end

// The MSB bits of address are the local offset
for (integer i = hash_i; i < epa_word_addr_width_gp; i = i + 1) begin
addr_o[i] = shared_eva_i[i+tg_dim_y_width_i+tg_dim_x_width_i];
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after changing tg_dim_x_width_i to a compact form, we can use that as a select bit for bsg_mux. This way we know for sure this will synthesize into the most efficient hardware.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please talk with me if you have question about this approach.

@@ -134,11 +134,13 @@ module vanilla_core_profiler
wire remote_ld_dram_inc = remote_ld_inc & remote_req_o.addr[data_width_p-1];
wire remote_ld_global_inc = remote_ld_inc & (remote_req_o.addr[data_width_p-1-:2] == 2'b01);
wire remote_ld_group_inc = remote_ld_inc & (remote_req_o.addr[data_width_p-1-:3] == 3'b001);
wire remote_ld_shared_inc = remote_ld_inc & (remote_req_o.addr[data_width_p-1-:5] == 5'b00001);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is lots of repeating code here (e.g. remote_req_o.addr[data_width_p-1-:5] == 5'b00001)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the problem here. I've tried to use a style consistent with the rest of this code. I'm basically copy-pasting lines that were already in these files prior to this PR. I don't see how the issue can only be with the new line, and not the old ones.

Comment on lines +234 to +258
// int_sb[4]: idiv
// int_sb[3]: remote dram load
// int_sb[2]: remote global load
// int_sb[1]: remote group load
// int_sb[0]: remote shared load
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was it necessary to put shared in [0]...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No technical reason, but it would be nice to have a logical ordering of different memory types. My logic is access latency, and in that logic, DRAM is the slowest remote memory with [3], and shared is the fastest remote memory with [0].

y_cord_o = y_cord_width_p'(shared_y_lo + tgo_y_i);
x_cord_o = x_cord_width_p'(shared_x_lo + tgo_x_i);
epa_o = { {(addr_width_p-epa_word_addr_width_gp){1'b0}},
{{shared_epa_lo + dmem_start_addr_lp}} };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tommydcjung
Copy link
Contributor

There should be SPMD tests that covers a wide range of configurations.

… and local address of a shared eva

2. Updated the arch_filelist.mk accordingly
3. Updated the eva_to_npa unit to use this hash function
4. Updated the bsg_shared_mem.hpp to incorporate the stripe size (hash) inside the shared EVA
…ash part of shared eva for determining the stripe_width
…Rs tg_dim_x/y_width = (ceil(log2(tg_dim_x/y))

2. Updated the shared eva hash function to use the tile group dimension width instead of loaclparams
… a hardware tile group shared memory cuda lite regression test
@mrutt92 mrutt92 force-pushed the rebased_hardware_shared_mem branch from a7a3ae4 to 6fcbbab Compare October 31, 2020 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA-Lite enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants