Types of parallelism:
- Task based: do different operations in parallel
- e.g. multiply/add
- e.g. GNU Make (running subtasks in parallel)
- Suitable for multi-core CPU or networks of computers
- Data based: same operations in lock step on different data in parallel
- e.g. image ops on all pixels in an image simultaneously
- Suitable for GPUs
Latency vs Throughput:
- Latency oriented
- Get each result with minimum delay
- Get first result ASAP, even if it slows getting all results
- Throughput oriented
- Get all results back with minimum delay
- Doesn't matter how long to get each result, so long as total time is short as possible
- Large cache to speed up memory access
- Temporal/Spatial locality, Working sets (likely to use same value from
memory again, likely to use value close to used value from memory)
- Ensure each operation has smallest probability of fetching from slow memory
- May be a good idea to pre-fetch data
- Temporal/Spatial locality, Working sets (likely to use same value from
memory again, likely to use value close to used value from memory)
- Complex control units
- Short pipelines, Branch prediction, Data forwarding
- Make each instruction finish ASAP with minimal pipeline stalls
- Short pipelines, Branch prediction, Data forwarding
- Complex, energy expensive ALUs
- Large complex transistor arrangements
- Minimise no. of clock cycles per operation, get result ASAP
- Large complex transistor arrangements
- Small caches
- For staging data
- Get blocks of data for groups of threads to work on
- Avoids separate fetches from each thread
- Not for temporally/spatially located data
- For staging data
- Simple control units
- No branch prediction or data forwarding
- Control shared by multiple threads working on different data
- Simple energy efficient ALU
- Long pipeline
- Large no. of cycles per operation, heavily pipelined
- Long wait for 1st result (filling the pipeline), following results come quickly
- Requires large no. of threads to keep processor occupied
- Conceptually, host CPU and GPU are separate devices, connected by
communication path
- Need to generate separate code for each device. Nvidia compiler for CUDA "nvcc"
- nvcc takes C/C++ with Nvidia extensions, separates and compiles GPU code itself, passes host code to host for compilation
- Result binary contains host and device binary, downloaded to device from host when run
- Have to identify if function runs on host, device or both. Identify where it's
callable from:
- Host:
__host__ void f(...)
- Default, can be omitted
- Callable from host only
- Device:
__global__ void f(...)
- Special functions called kernel functions
- Callable from host only, how host gets code to run on GPU
- Device:
__device__ void f(...)
- Callable from device only, helper functions to kernel functions and other GPU functions
- Both:
__host__ __device__ void f(...)
- Generates host and device function, same code can run on both. Host/device cannot call other version
- Host:
- When calling kernel function, specify how threads are organised to execute it
- Every thread executes same kernel function
- Different GPU devices can support different no. of threads
- Making thread structure uniform would sacrifice potential computing power
-
Don't want fixed no. of threads to dictate size of largest vector we can add
-
Don't want to change code to run on different GPU with different thread count
-
Want to be able to organise threads to co-locate groups of threads on sets of processing units, taking advantage of shared caches and synchronisation facilities.
-
Nvidia GPUs organise threads into hierarchical structure
- A Grid is a collection of Blocks
- A Block is a collection of Threads
- A Thread is execution of a kernel on a single processing unit
There is the concept of a warp:
- A set of tightly related threads, must execute fully in lock step with each other
- Not part of the CUDA spec, but a feature of all Nvidia GPUs in low level hardware design
- No. of threads in a warp is a feature of GPU but current GPUs mostly 32
- Low level basis of thread scheduling on a GPU. If a thread is scheduled to execute, so must all other threads in warp
- Executing same instructions in lock step, all threads in warp have same instruction exec timing
- A block can have between 1 and max block size no. of threads for GPU. Is high-level basis for thread scheduling
- Because of warps, block size should be multiple of warp size, otherwise blocks are padded with remaining threads from a warp and many are wasted
- Grids have a large no. of blocks, more than can be executed at once
- Grid = whole problem divided into bitesize blocks
- (Missing notes)
(Missing notes)
- Latency: time from start to end of task
- Work: measure of work to do, i.e. FLOPS, num images processed
- Throughput: work done per time unit
- Speedup: improvement in speed from using 1 computational unit to
P
computational unitsS(p) = T1 / Tp
whereT1
is time taken for one computational unit andTp
is time taken forp
computational units- Perfect linear speedup:
S(p) = p
- Means that for every computational unit we add, we get an extra
T1
boost to work done in a time unit
- Means that for every computational unit we add, we get an extra
- Efficiency: ratio of
S(p) : p
E(p) = T1 / (Tp * p) = S(p) / p
- Measures how well computational units are contributing to latency/throughput
- Perfect linear efficiency:
E(p) = 1
T1 = Tser + Tper
whereT1
is time spent using 1 computational unitTser
is time spent doing non-parallelizable workTpar
is time spent doing parallelizable work
- Therefore, using
p
computational units gives us:Tp = Tser + (Tpar / S(p))
- Therefore, overall speed up is:
S'(p) = (Tset + Tpar) / (Tser + (Tpar / S(p)))
- TODO: Are these equations right?
- We can write
Tser
andTpar
using a timeT
and a fraction of it that is parallelizablef
S'(p) = ((1 - f)T + fT) / ((1 - f)T + (fT / S(p)))
S'(p) = 1 / (1 - f + (f / S(p)))
- If as
p
tends to infinity, so doesS(p)
, the limit of speed isTser
or(1 - f)T
- Focuses on workload instead of time
W = (1 - f)W + fW
whereW
is the total workload executedf
is the fraction of the workload that is parallelizable
- Therefore, with
s
as some speedup factor for parallel parts:Ws = (1 - f)W + sfW
- We can measure the ratio between
Ws
andW
, giving us speedup:S = Ws / W
S = ((1 - f)W + sfW) / ((1 - f)W + fW)
S = 1 - f + fs
- A grid contains a collection of blocks, which contain a collection of threads
- Threads are executed in warps
- e.g. 1024 threads in a block, 32 threads in a warp, 1024/32=32 warps needed to execute a block
- TODO: How do warps relate to streaming multiprocessors?
- Warps execute in lock step, so implicit synchronisation
- For synchronisation in a block, we use
__syncthreads()
- All threads in a block will hit this point and wait
- Once all threads reach it, execution will continue
- Can not call
__syncthreads()
in a conditional
- Can not synchronise in a grid
- In a 1D block
- Warp 0 has threads 0-31
- Warp 1 has threads 32-63
- In a 2D block
- Let's say block size of
(4, 4, 4)
- Thread
(0, 0, 0)
is given index 0 - Thread
(0, 0, 1)
is given index 1 - Thread
(0, 1, 0)
is given index 4 - Thread
(0, 1, 2)
is given index 6 - Thread
(1, 0, 0)
is given index 16 - etc.
- Thread
- Then we allocate the warps as we did with 1D block, using these new indices
- Let's say block size of
- In a warp, say threads 0-15 take branch A, and threads 16-31 take branch B. This is called divergence.
- They all have to execute the same commands, which means that the complete warp has to process branch A and branch B
- If they all take branch A, branch B does not have to be executed
- We want to minimise divergence, and will design our algorithms so that threads next to each other in terms of index will execute the same instructions
- Apply some operator to a list of data and store the cumulative output
- e.g.
- List is [1, 2, 1, 3]
- Operator is addition
- Output is [1, 3, 4, 7]
- Only works within a single block
- Work through the list, where thread
n
adds together:- Items
n
andn - 1
- Then items
n
andn - 2
- Then items
n
andn - 4
- etc.
- Items
- FLOPS:
(N - 1) + (N - 2) + (N - 4) + ...
- For 1024, serial has 1023 FLOPS
- For 1024, HSH has 9217 FLOPS
- BAD!
- Two phases, reduction and distribution
- Can optimise through copying block into shared memeory, performing operations, and then copying back (this works for HSH too)
- Hopefully we don't get examined on this because it was in the assignment... imma skip.
- FLOPS:
- Reduction phase:
(N/2) + (N/4) + (N/8)...
- Distribution phase:
...(N/8 - 1) + (N/4 - 1) + (N/2 - 1)
- For 1024, Blelloch has 2037 FLOPS
- Reduction phase:
- Perform block scan on all blocks in the list
- Extract the last value of each block into new list
- Perform block scan on this list
- Add this list back into the original block scanned list
- Global memory read/writes in transactions of size 32, 64, or 128 bytes
- Each memory transaction takes approx the same amount of time, so writing 8 bytes is equivalent to writing 128 bytes
- If consecutive threads access consecutive bytes, then all reads will be done in one memory transaction
- If consecutive threads access non-consecutive bytes, then all reads will be done in separate memory transactions, which is a lot slower
- In CUDA programming, favour SoA as it helps coalesced memory accesses
- 2 orders of magnitude faster than global memory accesses
- There are 32 banks (since compute 2.0)
- 32 consecutive shared memory accesses are spread across the 32 banks
n
reads ton
different banks can be done in one read operationn
reads to1
bank has to be done inn
read operations - must be done serially- With exception:
n
reads to the same address in the same bank only takes one operation- TODO: What do these other exceptions mean?
- Sometimes we need to perform an atomic operation, e.g.
array[i] += 1;
- This line contains a read, a modification, and a write
- We have some options:
- Use
__syncthreads()
to make sure all threads can perform the operation uninterupted - Restructure so that one thread collects all the updates and performs them
- Use an atomic operation:
atomicAdd(&(array[i]), 1);
- Use
- TODO: Is this fast? How does it work? Not mentioned in the slides
- Each SM has a maximum number of blocks and warps that it can run concurrently
- Maximum blocks « maximum warps
- If you have less warps running than the max number of warps, you fail to reach 100% occupancy
- If you have loads of small blocks, then you might not reach the maximum number of warps, but you will reach the maximum number of blocks
- Map
- One-to-one, each thread responsible for converting one element
- Gather
- Read many, write one
- No read/write races still
- Scatter
- Read one, write many
- Need to be careful with writes
- Stencil
- Special case of gather
- Pattern of reads is constant across threads
- Transpose
- AoS → SoA
- Reduce
- All values into one single value
- Scan/sort
- All-to-all (or at least many-to-all)
- Each thread calculates value for some element, and increments histogram bin
- Need to do atomic add
- This doesn't scale well if you have many elements writing to one histogram bin
- Improvement
- Each thread calculates histogram for a subset of the data, no atomic adds here
- Add to final result
- Atomic add final results
- Improvement 2
- Instead of adding final results, perform reduce operation
- Improvement 3
- Sort the keys, and then reduce by key
- Gather
- One-to-one mapping, but which "ones" are included is not uniform
- Exclusive scan
- The scan at index
i
does not contain the value at indexi
in the original array
- The scan at index
- Segmented scan
- Only cumulate results across specific ranges
- We want to apply
f()
to elements that satisfyp()
- While all threads execute
p()
, only few will executef()
, meaning that there will be a lot of divergence - So we can compact all elements that satisfy
p()
into one array, and then iterate over that - call thiscompat()
- First, map
xs
throughp()
xs = [1, 3, 2, 3, 2, 4], p(x) = iseven(x)
bs = [f, f, t, f, t, t]
- Cast booleans from
p()
into0/1
sis = [0, 0, 1, 0, 1, 1]
- Perform exclusive sum scan
scan = [0, 0, 0, 1, 1, 2]
- Perform scatter with original
xs
with indicesscan
ifbs
satisfied for index2 → 0, 2 → 1, 4 → 2
[2, 2, 4]
- Perform some arbitrary
f()
on the final array
- First, map
- Can use this for clipping 2D triangle into a viewport
- If a triangle is outside the viewport, it maps to 0 triangles
- If a triangle is inside the viewport, it maps to 1 triangle
- If the triangle is partially inside the viewport, it maps to
n
different triangles - We can use
compat()
to allocate indices for these triangles
Matrix representation of X⋅Y
:
[0 b c][x] [bx + cx]
[d e f][y] = [dy + ey + fy]
[0 0 i][z] [iz]
We can represent this as:
V = [b, c, d, e, f, i]
: The values in theX
, excluding zerosC = [1, 2, 0, 1, 2, 2]
: The columns of the values inV
R = [0, 2, 5]
: The indices of values inV
that start a row
Algorithm:
- Gather from
Y
usingC
:
G = gather([x, y, z], [1, 2, 0, 1, 2, 2])
G = [y, z, x, y, z, z]
- Map
V * G
M = map(_*_, zip(V, G))
M = [by, cz, dx, ey, fz, iz]
- Run segmented inclusive sum scan on
M
[by + cz, dx + ey + fz, iz]
- In every even step, thread
i
compares elements2i
and2i + 1
, and swaps if out of order - In odd steps, thread
i
compares2i + 1
and2i + 2
O(n)
steps,O(n²)
work
- Start with trivially sorted segments of size 1
- Merge pairs of segments at each step
- End when there is only one segment
- Small segments:
- Perform merge on one thread
- Medium segments:
- Use multiple threads to merge segments
- Large segments (bigger than block size)
- Use multiple blocks to handle each merge
- Merging
A[]
andB[]
- Thread
i
looks at elementA[i]
- Binary search for
A[i]
inB[]
, get indexj
- We now know to insert into new array at index
i + j
- Split
A[]
andB[]
into segments of sizek
- Take each of the splitting values, and merge into a single list using previous algorithm
- Find each of these splitting values in the other array
- e.g.
- Have splitting values
a₁, a₂, a₃
andb₁, b₂, b₃
- Merged, they become
a₁, b₁, b₂, a₂, b₃, a₃
- We take range
0 - a₁
from bothA[]
andB[]
(wherea₁
is binary searched for inB[]
) - We merge these two ranges into the final array
- We do the same for
a₁ - b₁
,b₁ - b₂
,b₂ - a₂
from the previously merged splitters array
- Have splitting values
- A monotonic sequence is where each value is greater than the last, or less
than the last, e.g.
[1, 2, 3, 4]
- A bitonic sequence is where the values in the list change once, e.g.
[1, 2, 3, 4, 2, 1]
- A bitonic sequence can be shifted circularly, e.g. for the previous example,
[3, 4, 2, 1, 1, 2]
, and still remain a bitonic sequence
- You can split a bitonic sequence into two bitonic sequences
- This is done by splitting it in half, and comparing swapping elements between the splits so that the first split contains all the lower elements
[1, 2, 3, 4, 2, 1]
=> [1, 2, 3]
[4, 2, 1]
=> [1, 2, 1]
[4, 2, 3]
- We can split the bitonic list repeatedly, until there is only one element in each bitonic list
- Since they only have one element, they are trivially sorted
- Due to the nature of bitonic splits, all the lists concatenated will be sorted
- But first, we need a bitonic list...
- Change vector into lists of two elements
- Each of these is trivially bitonic
- We then turn it into a monotonic list (using above method)
- We then group these lists together, e.g.
[a₁, a₂, a₃, a₄] → [a₁++a₂, a₃++a₄]
- These lists are then bitonic, as we've concatenated two monotonic lists, but they are of size 4
- We do this until all elements are merged into a single bitonic list, and then turn the list into a monotonic (i.e. sorted) list
- Algorithm is easily parallelizable
O(nlog²n)
steps- Fastest for small lists
- Can sort integers
- Look at each integer's binary representation, and do a stable split based on
whether the least significant bit is 0/1
- Stable means that when you sort, if two elements are equal then they do not change order
- Then do this for the 2nd least significant bit, then 3rd, etc.
- Complexity is
O(kn)
wherek
is the number of bits, andn
is the size of the list - Where to put each element can be done with the
compact
operation previously discussed - Fastest for medium to large lists
- IEEE-754 floating point number has:
- 1 bit sign (
S
) - 8 bit exponent field (
E
)- Bias of 127, i.e.
exponent = E - 127
- 0 and 255 reserved for special numbers
- Bias of 127, i.e.
- 23 bit mantissa field (
M
)
- 1 bit sign (
- For normal numbers, implicit initial 1 bit in the mantissa is assumed
- i.e.
010101 → 1010101
- i.e.
- Special bit patterns in
S, E, M
for:- Positive/negative zero
- Positive/negative infinity
- NaN
- Other sub-normal numbers
- For normal numbers:
(-1)^S * (1 + (2^-23) * M) * 2^(E - 127)
- For sub-normal numbers:
- Difference between
0 - succ(0)
is far greater thansucc(0) - succ(succ(0))
- Non-zero numbers that are smaller than the magnitude of the smallest IEE-754 floating point number
- If
E=0
, then do not use the implicit 1 bit in the mantissa
- Difference between
- Given some number
x
, what is the shortest distance between the paira, b
wherea ≤ x ≤ b
- IEEE-754 requires operations round to the nearest representable floating point
number
- i.e. the computed result must be with in 0.5 ULPs of the mathematically correct result
- If we try to sum the list
[1000.0, 0.1, 0.1, 0.1...]
the result will always be1000.0
because the operation1000.0 + 0.1
results in1000.0
- Worst error is
O(n)
- Sort the list as ascending first, then sum
- However, accumulated small values will get significantly larger that later values in the vector
- Accumulate the sum, but calculate a correction term
float kahan_add(float *xs, size_t len) {
float sum = 0;
float correction = 0;
for (size_t i = 0; i < len; i++) {
float corrected_next_term = xs[i] - correction;
float new_sum = sum + corrected_next_term;
correction = (new_sum - sum) - corrected_next_term;
sum = new_sum;
}
return sum;
}
- Worst error is
O(1)
- Not easy to parallelise
- Sort the list, then thread
i
performsa[i*s] + a[i*s + (s / 2)]
- where
s
is the stride that starts at 2 and doubles on each iteration
- where
- Starts with adding similar pairs, but difference increases as you go on
- Worst error is
O(log(n))
- Poor thread blocking, memory access patterns
- Could have thread
i
performa[i] + a[i + s]
- where
s
is the stride that starts atlen(a) / 2
and halves on each iteration
- where
- This needs careful original ordering of
a[]
- Could have thread
- Absolute error:
|v - v'|
wherev
is the true value, andv'
is the approximation - Relative error:
|v - v'| / |v|
whenv ≠ 0
- Used for image analysis, matrix multiplication, spacial simulations
- Threads are essentially broken down into 1D again, for example with a
(2, 2, 2)
dimensional blockthreadIdx = (0, 0, 0) → thread 0
threadIdx = (1, 0, 0) → thread 1
threadIdx = (0, 1, 0) → thread 2
threadIdx = (1, 1, 0) → thread 3
- etc.
D[blockIdx.x][blockIdx.y]
whereD
and block size is2x2
- Thread
(0, 0) → 0
accessesD + 0
- Thread
(1, 0) → 1
accessesD + 2
- Thread
(0, 1) → 2
accessesD + 1
- So we're not coalesced
- Thread
D[blockIdx.y][blockIdx.x]
(we switchx
andy
)- Thread
(0, 0) → 0
accessesD + 0
- Thread
(1, 0) → 1
accessesD + 1
- Thread
(0, 1) → 2
accessesD + 2
- So we're coalesced!
- Thread
D[x][y]
whereD
and block size isK x K
andK
is a multiple of 32- Thread
(0, 0) → 0
accesses bank 0 - Thread
(1, 0) → 1K
accesses bank 0 - Thread
(2, 0) → 2K
accesses bank 0 - Conflicts!
- Thread
D[y][x]
(we switchx
andy
)- Thread
(0, 0) → 0
accesses bank 0 - Thread
(1, 0) → 1K
accesses bank 1 - Thread
(2, 0) → 2K
accesses bank 2 - No conflicts!
- Thread
- When transposing, we need to access both
D[y][x]
andD[x][y]
, meaning one of the accesses will have uncoalesced global memory accesses, and shared memory bank conflicts - Trick:
- Copy in a element of
D
into shared memory with good indexing __syncthreads()
- Copy out a different element with also good indexing
- Copy in a element of
- Naive: each thread handles an output element, and performs the dot product of
the correct row and column.
- For the "compute to global memory access" ratio, we have two global reads, one multiplication, one addition. This gives us a 1:1 ratio - not good
- Clever:
- Do dot product for tiles of the outputs
- e.g. if we have 64x64 matrices, and 32x32 tiles, we have 2x2 tiles
- To calculate
O[0, 0]
:- First do partial dot products of
A[0, 0]
andB[0, 0]
- Then do partial dot products of
A[1, 0]
andB[0, 1]
- Sum the results for the final
O[0, 0]
values
- First do partial dot products of
- Do dot product for tiles of the outputs