Remembering back to the logistic regression computations in the basics module, we discussed the sigmoid function ŷ = σ(z), where z = wTx+ b, used to then compute the Loss function L(ŷ, y). A visual representation of what we were computing with logistic regression is shown below.
When computing a neural network, this logistic regression computation is repeated with each layer. Consider a = σ(z) on a layer instead of ŷ = σ(z). The figure below represents a small neural network. In the figure above, the initial parameters used to compute z are w and b. Here, the parameters in the first layer are W[1]and b[1]which give the equation z[1]= W[1]x + b[1]. The values z, W[1], and b[1]represent matrices composed of the z, w, and, b values from each node in the layer. For example, the top node of layer 1 in the figure below is described by the equation z1[1]= w1[1]Tx + b1[1]. The superscript [1] denotes the first layer, The subscript 1 denotes the first node in the layer. Similarly, the second 2nd node of layer 1 has the equation z2[1]= w2[1]Tx + b2[1].The next step is to compute the ‘a’ for this layer with the equation a[1]= σ(z[1]) which will be used in place of x for the following layer. Here, z[1]is a matrix so a[1]will be a matrix of the ‘a’ values for each node in the first layer. For instance, a1[1]= σ(z1[1]]) in the first node and a2[1]= σ(z2[1]) in the second node. To compute the z value of the 2nd layer, the equation z[2]= W[2]a[1]+ b[2]is used. Just like in layer one, the ‘a’ value for layer 2 is calculated as a[2]= σ(z[2]). Since the example represented by the figure below has reached the end, a[2]and z[2]are real numbers instead of matrices and also a[2]becomes the ŷ therefore the Loss can be computed as L(a[2], y). For a network with more than 2 layers, repeat this process, finding a[i]for each layer until reaching the end.
Computing loss and implementing gradient descent will be done in a way similar to the right to left (or “backwards) calculation as discussed in the basics module. With logistic regression, we found dw and db (the derivative of parameters w and b) by computing da to then compute dz which was then used to arrive at dw and db. In a neural network, you will want to do it for each layer. For example, start by computing da[2]followed by computing dz[2]which can then be used to find dw[2]and db[2]. This will be discussed further in the backpropagation tutorial.
Sources https://www.coursera.org/