Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hw3 #416

Open
wants to merge 5 commits into
base: HW3
Choose a base branch
from
Open

Hw3 #416

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions HW3/P3/P3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#0: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz on Apple
#1: Intel(R) Iris(TM) Graphics 6100 on Apple
coalesced reads, workgroups: 8, num_workers: 4, 0.1539116 seconds
coalesced reads, workgroups: 8, num_workers: 8, 0.08159168 seconds
coalesced reads, workgroups: 8, num_workers: 16, 0.05025496 seconds
coalesced reads, workgroups: 8, num_workers: 32, 0.02677488 seconds
coalesced reads, workgroups: 8, num_workers: 64, 0.01586712 seconds
coalesced reads, workgroups: 8, num_workers: 128, 0.00901392 seconds
coalesced reads, workgroups: 16, num_workers: 4, 0.07981168 seconds
coalesced reads, workgroups: 16, num_workers: 8, 0.04508424 seconds
coalesced reads, workgroups: 16, num_workers: 16, 0.0276228 seconds
coalesced reads, workgroups: 16, num_workers: 32, 0.01530312 seconds
coalesced reads, workgroups: 16, num_workers: 64, 0.00905 seconds
coalesced reads, workgroups: 16, num_workers: 128, 0.00379776 seconds
coalesced reads, workgroups: 32, num_workers: 4, 0.04075392 seconds
coalesced reads, workgroups: 32, num_workers: 8, 0.0240396 seconds
coalesced reads, workgroups: 32, num_workers: 16, 0.01509976 seconds
coalesced reads, workgroups: 32, num_workers: 32, 0.00901072 seconds
coalesced reads, workgroups: 32, num_workers: 64, 0.0048736 seconds
coalesced reads, workgroups: 32, num_workers: 128, 0.004536 seconds
coalesced reads, workgroups: 64, num_workers: 4, 0.02377664 seconds
coalesced reads, workgroups: 64, num_workers: 8, 0.01393464 seconds
coalesced reads, workgroups: 64, num_workers: 16, 0.00937744 seconds
coalesced reads, workgroups: 64, num_workers: 32, 0.00493296 seconds
coalesced reads, workgroups: 64, num_workers: 64, 0.00444424 seconds
coalesced reads, workgroups: 64, num_workers: 128, 0.00383528 seconds
coalesced reads, workgroups: 128, num_workers: 4, 0.02512672 seconds
coalesced reads, workgroups: 128, num_workers: 8, 0.01553176 seconds
coalesced reads, workgroups: 128, num_workers: 16, 0.009058 seconds
coalesced reads, workgroups: 128, num_workers: 32, 0.00472944 seconds
coalesced reads, workgroups: 128, num_workers: 64, 0.00383848 seconds
coalesced reads, workgroups: 128, num_workers: 128, 0.00367832 seconds
coalesced reads, workgroups: 256, num_workers: 4, 0.02256088 seconds
coalesced reads, workgroups: 256, num_workers: 8, 0.01199136 seconds
coalesced reads, workgroups: 256, num_workers: 16, 0.00709632 seconds
coalesced reads, workgroups: 256, num_workers: 32, 0.00401824 seconds
coalesced reads, workgroups: 256, num_workers: 64, 0.00396376 seconds
coalesced reads, workgroups: 256, num_workers: 128, 0.0034156 seconds
coalesced reads, workgroups: 512, num_workers: 4, 0.022924 seconds
coalesced reads, workgroups: 512, num_workers: 8, 0.011864 seconds
coalesced reads, workgroups: 512, num_workers: 16, 0.00672344 seconds
coalesced reads, workgroups: 512, num_workers: 32, 0.00368856 seconds
coalesced reads, workgroups: 512, num_workers: 64, 0.00337808 seconds
coalesced reads, workgroups: 512, num_workers: 128, 0.0035404 seconds
blocked reads, workgroups: 8, num_workers: 4, 0.14802352 seconds
blocked reads, workgroups: 8, num_workers: 8, 0.08757504 seconds
blocked reads, workgroups: 8, num_workers: 16, 0.05723184 seconds
blocked reads, workgroups: 8, num_workers: 32, 0.03349552 seconds
blocked reads, workgroups: 8, num_workers: 64, 0.0155752 seconds
blocked reads, workgroups: 8, num_workers: 128, 0.0105428 seconds
blocked reads, workgroups: 16, num_workers: 4, 0.07797992 seconds
blocked reads, workgroups: 16, num_workers: 8, 0.04836992 seconds
blocked reads, workgroups: 16, num_workers: 16, 0.03260912 seconds
blocked reads, workgroups: 16, num_workers: 32, 0.0158576 seconds
blocked reads, workgroups: 16, num_workers: 64, 0.01049312 seconds
blocked reads, workgroups: 16, num_workers: 128, 0.00736264 seconds
blocked reads, workgroups: 32, num_workers: 4, 0.04226272 seconds
blocked reads, workgroups: 32, num_workers: 8, 0.02656104 seconds
blocked reads, workgroups: 32, num_workers: 16, 0.01613416 seconds
blocked reads, workgroups: 32, num_workers: 32, 0.0100472 seconds
blocked reads, workgroups: 32, num_workers: 64, 0.00740984 seconds
blocked reads, workgroups: 32, num_workers: 128, 0.00793808 seconds
blocked reads, workgroups: 64, num_workers: 4, 0.02388056 seconds
blocked reads, workgroups: 64, num_workers: 8, 0.01382888 seconds
blocked reads, workgroups: 64, num_workers: 16, 0.01021512 seconds
blocked reads, workgroups: 64, num_workers: 32, 0.00744776 seconds
blocked reads, workgroups: 64, num_workers: 64, 0.00763472 seconds
blocked reads, workgroups: 64, num_workers: 128, 0.00713896 seconds
blocked reads, workgroups: 128, num_workers: 4, 0.02438576 seconds
blocked reads, workgroups: 128, num_workers: 8, 0.015862 seconds
blocked reads, workgroups: 128, num_workers: 16, 0.01180312 seconds
blocked reads, workgroups: 128, num_workers: 32, 0.00880544 seconds
blocked reads, workgroups: 128, num_workers: 64, 0.00774904 seconds
blocked reads, workgroups: 128, num_workers: 128, 0.00706464 seconds
blocked reads, workgroups: 256, num_workers: 4, 0.02058416 seconds
blocked reads, workgroups: 256, num_workers: 8, 0.0128224 seconds
blocked reads, workgroups: 256, num_workers: 16, 0.00820904 seconds
blocked reads, workgroups: 256, num_workers: 32, 0.00728816 seconds
blocked reads, workgroups: 256, num_workers: 64, 0.00783352 seconds
blocked reads, workgroups: 256, num_workers: 128, 0.00655888 seconds
blocked reads, workgroups: 512, num_workers: 4, 0.01911864 seconds
blocked reads, workgroups: 512, num_workers: 8, 0.01279408 seconds
blocked reads, workgroups: 512, num_workers: 16, 0.00918208 seconds
blocked reads, workgroups: 512, num_workers: 32, 0.00653176 seconds
blocked reads, workgroups: 512, num_workers: 64, 0.00648824 seconds
blocked reads, workgroups: 512, num_workers: 128, 0.00676712 seconds
configuration ('coalesced', 512, 64): 0.00337808 seconds
44 changes: 25 additions & 19 deletions HW3/P3/sum.cl
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,19 @@ __kernel void sum_coalesced(__global float* x,
{
float sum = 0;
size_t local_id = get_local_id(0);

size_t global_id = get_global_id(0);
int global_size = get_global_size(0);

// thread i (i.e., with i = get_global_id()) should add x[i],
// x[i + get_global_size()], ... up to N-1, and store in sum.
for (;;) { // YOUR CODE HERE
; // YOUR CODE HERE

for (int i=global_id; i<N; i+=global_size) {
sum += x[i];
}

fast[local_id] = sum;
barrier(CLK_LOCAL_MEM_FENCE);

// binary reduction
//
// thread i should sum fast[i] and fast[i + offset] and store back
Expand All @@ -24,10 +27,13 @@ __kernel void sum_coalesced(__global float* x,
// You can assume get_local_size(0) is a power of 2.
//
// See http://www.nehalemlabs.net/prototype/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
for (;;) { // YOUR CODE HERE
; // YOUR CODE HERE
int local_size = get_local_size(0);

for (uint offset = local_size/2; offset>0 ; offset >>= 1) {
if (offset > local_id ) fast[local_id] += fast[local_id + offset];
barrier(CLK_LOCAL_MEM_FENCE);
}

if (local_id == 0) partial[get_group_id(0)] = fast[0];
}

Expand All @@ -38,23 +44,22 @@ __kernel void sum_blocked(__global float* x,
{
float sum = 0;
size_t local_id = get_local_id(0);
int k = ceil(float(N) / get_global_size(0));

int k = ceil((float)N / get_global_size(0));
// thread with global_id 0 should add 0..k-1
// thread with global_id 1 should add k..2k-1
// thread with global_id 2 should add 2k..3k-1
// ...
// with k = ceil(N / get_global_size()).
//
//
// Be careful that each thread stays in bounds, both relative to
// size of x (i.e., N), and the range it's assigned to sum.
for (;;) { // YOUR CODE HERE
; // YOUR CODE HERE
for (int ii = 0; ii < k & get_global_id(0) * k + ii < N; ii++) { // YOUR CODE HERE
sum += x[get_global_id(0) * k + ii]; // YOUR CODE HERE
}

fast[local_id] = sum;
barrier(CLK_LOCAL_MEM_FENCE);

// binary reduction
//
// thread i should sum fast[i] and fast[i + offset] and store back
Expand All @@ -64,9 +69,10 @@ __kernel void sum_blocked(__global float* x,
// You can assume get_local_size(0) is a power of 2.
//
// See http://www.nehalemlabs.net/prototype/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
for (;;) { // YOUR CODE HERE
; // YOUR CODE HERE
for (uint offset = get_local_size(0)/2; offset>0 ; offset >>= 1) { // YOUR CODE HERE
if (offset > local_id ) fast[local_id] += fast[local_id + offset];
barrier(CLK_LOCAL_MEM_FENCE);
}

if (local_id == 0) partial[get_group_id(0)] = fast[0];
}
4 changes: 2 additions & 2 deletions HW3/P3/tune.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def create_data(N):
times = {}

for num_workgroups in 2 ** np.arange(3, 10):
partial_sums = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * num_workgroups + 4)
partial_sums = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * num_workgroups)
host_partial = np.empty(num_workgroups).astype(np.float32)
for num_workers in 2 ** np.arange(2, 8):
local = cl.LocalMemory(num_workers * 4)
Expand All @@ -40,7 +40,7 @@ def create_data(N):
format(num_workgroups, num_workers, seconds))

for num_workgroups in 2 ** np.arange(3, 10):
partial_sums = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * num_workgroups + 4)
partial_sums = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * num_workgroups)
host_partial = np.empty(num_workgroups).astype(np.float32)
for num_workers in 2 ** np.arange(2, 8):
local = cl.LocalMemory(num_workers * 4)
Expand Down
52 changes: 48 additions & 4 deletions HW3/P4/median_filter.cl
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,24 @@ median_3x3(__global __read_only float *in_values,
// Note: It may be easier for you to implement median filtering
// without using the local buffer, first, then adjust your code to
// use such a buffer after you have that working.

//get global location
int x = get_global_id(0);
int y = get_global_id(1);

//get local location
int lx = get_local_id(0);
int ly = get_local_id(1);

//get corner location of buffer
int buf_corner_x = x - lx - halo;
int buf_corner_y = y - ly - halo;

//get location of buffer
int buf_x = lx + halo;
int buf_y = ly + halo;

int idx_1D = ly * get_local_size(0) + lx;

// Load into buffer (with 1-pixel halo).
//
Expand All @@ -21,14 +38,41 @@ median_3x3(__global __read_only float *in_values,
//
// Note that globally out-of-bounds pixels should be replaced
// with the nearest valid pixel's value.

if(idx_1D < buf_w){
// replace x_id to stay in-bounds
int x_id = buf_corner_x + idx_1D;
if(x_id < 0) x_id = 0;
if(x_id >= w) x_id = w - 1;


// Compute 3x3 median for each pixel in core (non-halo) pixels
//
// replace y_id to stay in-bounds
for(int r = 0; r < buf_h; r++){
int y_id = buf_corner_y + r;
if(y_id < 0) y_id = 0;
if(y_id >= h) y_id = h-1;
buffer[r*buf_w+idx_1D] = in_values[y_id*w+x_id];
}
}
barrier(CLK_LOCAL_MEM_FENCE);




// Compute 3x3 median for each pixel in core (non-halo) pixels
// We've given you median9.h, and included it above, so you can
// use the median9() function.


float median = median9(buffer[(buf_y-1)*buf_w+buf_x-1],
buffer[(buf_y-1)*buf_w+buf_x],
buffer[(buf_y-1)*buf_w+buf_x+1],
buffer[(buf_y)*buf_w+buf_x-1],
buffer[(buf_y)*buf_w+buf_x],
buffer[(buf_y)*buf_w+buf_x+1],
buffer[(buf_y+1)*buf_w+buf_x-1],
buffer[(buf_y+1)*buf_w+buf_x],
buffer[(buf_y+1)*buf_w+buf_x+1]);

// Each thread in the valid region (x < w, y < h) should write
// back its 3x3 neighborhood median.
if(x < w && y < h) out_values[x + y * w] = median;
}
5 changes: 3 additions & 2 deletions HW3/P4/median_filter.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from __future__ import division
import pyopencl as cl
import numpy as np
import imread
import pylab
import os.path

def round_up(global_size, group_size):
r = global_size % group_size
Expand Down Expand Up @@ -51,7 +51,8 @@ def numpy_median(image, iterations=10):
properties=cl.command_queue_properties.PROFILING_ENABLE)
print 'The queue is using the device:', queue.device.name

program = cl.Program(context, open('median_filter.cl').read()).build(options='')
curdir = os.path.dirname(os.path.realpath(__file__))
program = cl.Program(context, open('median_filter.cl').read()).build(options=['-I', curdir])

host_image = np.load('image.npz')['image'].astype(np.float32)[::2, ::2].copy()
host_image_filtered = np.zeros_like(host_image)
Expand Down
43 changes: 43 additions & 0 deletions HW3/P5/P5.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@

——————————————— Part 1 ———————————————————-

i) Maze 1
- Finished after 914 iterations, 222.05546 ms total, 0.242949080963 ms per iteration.
- Found 2 regions.
ii) Maze 2
- Finished after 532 iterations, 132.33624 ms total, 0.248752348352 ms per iteration.
- Found 35 regions.

——————————————— Part 2 ———————————————————-

i) Maze 1
- Finished after 529 iterations, 128.96085 ms total, 0.243782340345ms per iteration.
- Found 2 regions.
ii) Maze 2
- Finished after 273 iterations, 65.91784 ms total, 0.241457293493 ms per iteration.
- Found 35 regions.

——————————————— Part 3 ———————————————————-

i) Maze 1
- Finished after 10 iterations, 2.48348 ms total, 0. 248348 ms per iteration.
- Found 2 regions.
ii) Maze
- Finished after 9 iterations, 2.18921 ms total, 0.243246534875 ms per iteration.
- Found 35 regions.

——————————————— Part 4 ———————————————————-
i) Maze 1
- Finished after 10 iterations, 4.87712 ms total, 0. 487712 ms per iteration.
- Found 2 regions.
ii) Maze 2
- Finished after 9 iterations, 4.34102 ms total, 0.482336 ms per iteration.
- Found 35 regions.

In this part, we used a single thread so that redundant global memory reads could be partly reduced. However, from the results we can see that a single thread is actually not a good choice. I think that’s because most of the labels are different which at the beginning which may require global memory reads. Thus, the time saved by avoiding unnecessary global memory reads is not as much as we expected.

——————————————— Part 5 ———————————————————-
From the results of this part, I found that min() and atomic_min don’t have much difference on the correctness of the algorithm. However, atomic_min() requires less iterations. I think the reason is that the value in labels will be changed by other threads if we called min(). On the other hand, atomic_min() can guarantee that only one thread could update the label at one time. Thus, I think atomic_min() is a better choice.



Loading