Hands-On Deep Learning for Games
上QQ阅读APP看书,第一时间看更新

Optimizers

An optimizer is really nothing more than another way to train the backpropagation of error through a network. As we learned back in Chapter 1, Deep Learning for Games, the base algorithm we use for backpropagation is the gradient descent and the more advanced stochastic gradient descent (SGD). 

SGD works by altering the evaluation of the gradient by randomly picking the batch order during each training iteration. While SGD works well for most cases, it does not perform well in a GAN, due to a problem known as the vanishing exploding gradient, which happens when trying to train multiple, but combined, networks. Remember, we are directly feeding the results of our generator into the discriminator. Instead, we look to more advanced optimizers. A graph showing the performance of the typical best optimizers is shown in the following diagram:



Performance comparison of various optimizers

All of the methods in the graph have their origin in SGD, but you can clearly see the winner in this instance is Adam. There are cases where this is not the case, but the current favorite optimizer is Adam. It is something we have used extensively before, as you may have noticed, and you will likely continue using it in the future. However, let's take a look at each of the optimizers in a little more detail, as follows:

  • SGD: This is one of the first models we looked at and it will often be our baseline to train against.
  • SGD with Nesterov: The problem SGD often faces is that wobble effect we saw in network loss, in one of the earlier training examples. Remember, during training, our network loss would fluctuate between two values, almost as if it was a ball going up and down a hill. In essence, that is exactly what is happening, but we can correct that by introducing a term we call momentum. An example of the effect momentum has on training is shown in the following diagram:


SGD with and without momentum

So, now, instead of just letting the ball blindly roll around, we control its speed. We give it a push to get over some of those annoying bumps or wobbles, and more efficiently get to the lowest point. 

As you may recall from studying the math of backpropagation, we control the gradient in SGD to train the network to minimize error or loss. By introducing momentum, we try to control the gradient to be more efficient by approximating what the values should be. The Nesterov technique, or it may just be referred to as Momentum, uses an accelerated momentum term to further optimize the gradient.

  • AdaGrad: This method optimizes the individual training parameters based on the frequency of the updates, which makes it ideal for working with smaller datasets. The other main benefit is that it allows you to not have to tune the learning rate. However, a big weakness with this method is squared gradients causing the learning rate to become so small that the network stops learning.
  • AdaDelta: This method is an extension to AdaGrad, which deals with the squared gradients and vanishing learning rate. It does this by fixing the learning rate window to a particular minimum.
  • RMSProp: Developed by Geoff Hinton, the grandfather of deep learning, this is a technique to manage the vanishing learning rate problem in AdaGrad. As you can see in the graph, it is on par with AdaDelta for the sample shown.  
  • Adaptive Moment Estimation (Adam): This is another technique that attempts to control that gradient using a more controlled version of Momentum. It is often described as Momentum plus RMSProp, since it combines the best of both techniques.
  • AdaMax: This method is not shown on the performance graph but is worth mentioning. It is an extension to Adam that generalizes each iteration of an update applied to the momentum. 
  • Nadam: This is another method not on the graph; it is a combination of Nesterov-accelerated Momentum and Adam. The vanilla Adam just uses a Momentum term that is not accelerated.
  • AMSGrad: This is a variation of Adam that works best when Adam is shown to be unable to converge or wobble. This is caused by the algorithm failing to adapt learning rates and is fixed by taking a maximum rather than an average of previously squared gradients. The difference is subtle and tends to prefer smaller datasets. Keep this option in the back of your mind as a possible future tool.

That completes our short overview of optimizers; be sure to refer to the exercises at the end of the chapter for ways you can explore them further. In the next section, we build our own GAN that can generate textures we can use in games.