-
Notifications
You must be signed in to change notification settings - Fork 0
Training a neural network
Neural networks are a powerful model for learning non-linear problems. One of their greatest assets is that the creator of the model doesn't have to hand pick any features like in most statistical methods, but the model learns them itself.
Neural networks are trained in a supervised way: a network is given an input, which it passes through itself providing an output. Then it's shown the desired output. The network then adjusts its own parameters based on the produced and desired outputs in order to produce a more accurate result on the next pass. Neural networks vary in depth and width, but a rather simple one can get a long way.
Each layer in the network consists of three components: weights, biases and activation functions. Weights and biases are learned from the data, and activation functions are predefined. In short, after enough iterations, the weights and biases start extracting meaning from the data they're given. The input of a layer is multiplied with the weights of the layer, and summed the bias of each neuron. The result is then given to an activation function. A non-linear activation function allows solving more complicated problems.
The input of the whole network is the input of the first layer. Each layer produces an output vector, which is then given to the next layer. This is why the layer sizes have to match.
In practice, the network goes through data in batches as it is computationally efficient. Each batch is a matrix, with rows consisting of data points. For example with a batch of size 5 with a data point being a vector of dimension 2, the network would take a 5x2 matrix as an input.
I'm implementing a fully connected network, which means that all neurons of a layer are connected to all the neurons of the previous layer. The sizes of the layers can (and often do) change.
The size of the output of the network is the size of the output of the last layer.
Gradient descent is used to minimize the loss of a network. Basic functionality:
Given an input and its true label
-
output = feed_forward(input)
- Calculate the current prediction of the network given an input.
- In one forward pass, the input is run through the network, which means multiplying it by weights, adding biases and passing the result through activation functions.
-
loss = label - output
- Since we want to train the network, we have to know how wrong its predictions were. This can be done simply by defining the difference of the desired and actual output (as done here), or more commonly with a loss function, such as mean square loss or cross entropy, the last being a common choice in classification problems.
-
gradient = loss.backpropagate()
- Backpropagation is used to define how much each weight affects the output of the network.
- Each weight in the network is defined a gradient. The gradient is obtained from the derivative of the loss function with regards to each weight.
-
gradient.step()
- Update the weights and biases according to the gradient defined in backpropagation. In practice this means adding the gradient defined in packpropagation multiplied by a learning rate and the current weight in each weight.
- Thinking about the curve of the loss function, we want to define the gradient, i.e. derivative of the loss function, and take a step to the opposite direction of it.
Basically what I'm implementing in this project is all the functions one needs in gradient descent, i.e. feed forward, backpropagation, updating the parameters and a bunch of loss and activation functions.