Skip to content

About convolution

irenenikk edited this page Nov 23, 2019 · 5 revisions

About convolutional neural networks

Note that this is techincal documentation used to aid implementation, testing and make the behaviour of my layer implementations logical.

Sources:

This Standford course material

This blog post

Pytorch documentation

Convolution

In mathematics, convolution means the result of a signal affecting another one. It is often associated with blurring or smoothing input.

Convolutional layer

Compute local correlations from input using a fixed amount of filters, who typically learn to detect different things from the input. The process includes learnable weights and biases. The amount of filters defines the depth of the layer, where all the neurons in the same depth column look at the same area of the input.

Some notation

  • F = receptive field = the amount of neurons a neuron receives input from = the size of one side of the kernel matrix (FxF)
  • W = input size = the size of the previous layer
  • S = the size of the stride
  • P = the size of the zero padding on one side. Set to P=(F−S)/2 in order to keep the same layer size.
  • The size of the output = (W−F+2P)/S+1. If this is not integer, the hyperparameters are invalid.

Parameter sharing

In order to reduce the amount of parameters used, the same weights of the filter are used on the same depth column, or depth slice, of the image. In practice this means that the same filter pierces through all the channels of the input image.

With a filter of size 5x5, and a stride of 2, the feature map of the first weight and bias would be calculated with:

V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0
V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0
.
.
.

Each filter has different weights, which construct the third dimension of the output volume.

V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1
V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1
V[2,0,1] = np.sum(X[4:9,:5,:] * W1) + b1
V[3,0,1] = np.sum(X[6:11,:5,:] * W1) + b1
.
.
.
V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1 (example of going along y)
V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1 (or along both)

The weights of the filter are multiplied with a block of the input in an elementwise manner, and summed to one single number. The resulting numbers of the first two dimensions of the input, which are calculated using the same weight tensor (= parameter sharing), are summed together, and the bias is added. Thus the third dimension of the filter weight must be the same size as the input's third dimension.

One usually defines K filters to use on an input image, each of them providing D weight matrices, when D is the depth of the input image. Each weight matrix is only used on its respective dimension. Also K biases are used.

The stride

The stride defines how a filter is moved along the input image. Most popular values are 1 and 2. The value can't be arbitrary, as the filters have to actually fit the whole image. If (input size - amount of neurons on conv layer + 2*zero padding)/stride + 1 is not an integer, the stride is impossible.

Zero padding

Adds zero valued cells to both sides of the input. Allows controlling the size of the output of the layer. When stride is 1, in order to keep the size of the output the same as the input, set padding to be (amount of neurons in conv layer - 1)/2.

The whole thing

can be implemented as one grand matrix multiplication, which is kind of cool.

Let's assume the input image shape is AxBxC.

All the possible blocks singled out by the filter are stretched to columns of size (ABC), creating a two dimensional matrix where each column is its own receptive field singled out by a filter both through width and length. Note that the process duplicates many of the values in the input.

The same process is repeated to the three-dimensional weights of the filters.

Finally, one can simply conduct np.dot(stretched_conv, stretched_input), which can in turn be reshaped to acquire the appropriate output.

Pooling layer

Reduce the amount of information passed on to the next layer.