Partial differentiation and the chain rule
Before we get into the details of calculating each weight, let's review a little bit of calculus and differentiation. If you recall your favorite math class, calculus, you can determine the slope of change for any point on a function by differentiating. A calculus refresher is shown in the following diagram:
In the diagram, we have a nonlinear function, f, that describes the equation of the blue line. We can determine the slope (rate of change) on any point by differentiating to f' and solving. Recall that we can also determine the functions of local and global minimum or maximum using this new function and as shown in the diagram. Simple differentiation allows us to solve for one variable, but we need to solve for multiple weights, so we will use partial derivatives or differentiating with respect to one variable.
As you may recall, partial differentiation allows us to derive for a single variable with respect to the other variables, which we then treat as constants. Let's go back to our Cost function and see how to differentiate that with respect to a single weight:
- is our cost function described by the following:
- We can differentiate this function with respect to a single variable weight as follows:
- If we collect all of these partial derivatives together, we get the vector gradient for our Cost function, , denoted by the following:
- This gradient defines a vector direction that we want to negate and use to minimize the Cost function. In the case of our previous example, there are over 13,000 components to this vector. These correspond to over 13,000 weights in the network that we need to optimize. That is a lot of partial derivatives we need to combine in order to calculate the gradient. Fortunately, the chain rule in calculus can come to our rescue and greatly simplify the math. Recall that the chain rule is defined by the following:
- This now allows us to define the gradient for a single weight using the chain rule as such:
- Here, represents the input number and the neuron position. Note how we now need to take the partial derivative of the activation function, a, for the given neuron, and that is again summarized by the following:
The superscript notation denotes the current layer and denotes the previous layer. denotes either the input or the output from the previous layer. denotes the activation function, recall that we previously used the Step and ReLU functions for this role.
- Then, we take the partial derivative of this function, like so:
For convenience, we define the following:
- At this point, things may look a lot more complicated than they are. Try to understand all the subtleties of the notation and remember all we are looking at is essentially the partial derivative of the activation with respect to the Cost function. All that the extra notation does is allow us to index the individual weight, neuron, and layer. We can then express this as follows:
- Again, all we are doing is defining the gradient () for the weight at the th input, th neuron, and layer . Along with gradient descent, we need to backpropagate the adjustment to the weights using the preceding base formula. For the output layer (last layer), this now can be summarized as follows:
- For an internal or a hidden layer, the equation comes out to this:
- And with a few more substitutions and manipulations of the general equation, we end up with this:
Here, f' denotes the derivative of the activation function.
The preceding equation allows us to run the network and backpropagate the errors back through, using the following procedure:
- You first calculate the activations and for each layer starting with the input layer and propagate forward.
- We then evaluate the term at the output layer using .
- We do this by using the remainder to evaluate each layer using , starting with the output layer and propagating backward.
- Again, we are using the partial derivative to obtain the required derivatives in each layer.
It may take you a few reads through this section in order to grasp all the concepts. What can also be useful is to run the previous examples and watch the training, trying to imagine how each of the weights is getting updated. We are by no means completely done here, and there are a couple more steps—using automatic differentiation being one of them. Unless you are developing your own low-level networks, just having a basic understanding of that math should give you a better understanding of the needs in training a neural network. In the next section, we get back to some more hands-on basics and put our new knowledge to use by building a neural network agent.