2.5.3 Training Feedforward and Radial Basis Function Networks
Suppose you have chosen an FF or RBF network and you have already decided on the exact structure, the number of layers, and the number of neurons in the different layers. Denote this network with =g( ,x) where is a parameter vector containing all the parametric weights of the network and x is the input. Then it is time to train the network. This means that will be tuned so that the network approximates the unknown function producing your data. The training is done with the command NeuralFit, described in Chapter 7, Training Feedforward and Radial Basis Function Networks. Here follows a tutorial on the available training algorithms.
Given a fully specified network, it can now be trained using a set of data containing N input-output pairs, . With this data the mean square error (MSE) is defined by
Then, a good estimate for the parameter is one that minimizes the MSE; that is,
Often it is more convenient to use the root-mean-square error (RMSE)
when evaluating the quality of a model during and after training, since it can be compared with the output signal directly. It is the RMSE value that is logged and written out during the training, and plotted when the training terminates.
The various training algorithms that apply to FF and RBF networks have one thing in common—they are iterative. They both start with an initial parameter vector , which you set with the command InitializeFeedForwardNet or InitializeRBFNet. Starting at , the training algorithm iteratively decreases the MSE in Eq. (2.0) by incrementally updating along the negative gradient of the MSE, as follows
Here, the matrix R may change the search direction from the negative gradient direction to a more favorable one. The purpose of parameter is to control the size of the update increment in with each iteration i, while decreasing the value of the MSE. It is in the choice of R and that the various training algorithms differ in the Neural Networks package.
If R is chosen to be the inverse of the Hessian of the MSE function, that is, the inverse of
then Eq. (2.0) assumes the form of the Newton algorithm. This search scheme can be motivated by a second-order Taylor expansion of the MSE function at the current parameter estimate . There are several drawbacks to using Newton's algorithm. For example, if the Hessian is not positive definite, the updates will be in the positive gradient direction, which will increase the MSE value. This possibility may be avoided with a commonly used alternative for R, the first part of the Hessian in Eq. (2.0):
With H defined, the option Method may be used to choose from the following algorithms:
Neural network minimization problems are often very ill-conditioned; that is, the Hessian in Eq. (2.0) is often ill-conditioned. This makes the minimization problem harder to solve, and for such problems the Levenberg-Marquardt algorithm is often a good choice. For this reason, the Levenberg-Marquardt algorithm method is the default training algorithm of the package.
Instead of adapting the step length to guarantee a downhill step in each iteration of Eq. (2.0) a diagonal matrix is added to H in Eq. (2.0); in other words, R is chosen to be
The value of is chosen automatically so that a downhill step is produced. At each iteration, the algorithm tries to decrease the value of by some increment . If the current value of does not decrease the MSE in Eq. (2.0), then is increased in steps of until it does produce a decrease.
The training is terminated prior to the specified number of iteration if any of the following conditions are satisfied:
Here PrecisionGoal is an option of NeuralFit and s is the largest eigenvalue of H.
Large values of produce parameter update increments primarily along the negative gradient direction, while small values result in updates governed by the Gauss-Newton method. Accordingly, the Levenberg-Marquardt algorithm is a hybrid of the two relaxation methods, which are explained next.
The Gauss-Newton method is a fast and reliable algorithm that may be used for a large variety of minimization problems. However, this algorithm may not be a good choice for neural network problems if the Hessian is ill-conditioned; that is, if its eigenvalues span a large numerical range. If so, the algorithm will converge poorly, slowing down the training process.
The training algorithm uses the Gauss-Newton method when matrix R is chosen to be the inverse of H in Eq. (2.0); that is,
At each iteration, the step length parameter is set to unity, =1. This allows the full Gauss-Newton step, which is accepted only if the MSE in Eq. (2.0) decreases in value. Otherwise is halved again and again until a downhill step is affected. Then, the algorithm continues with a new iteration.
The training terminates prior to the specified number of iterations if any of the following conditions are satisfied:
Here PrecisionGoal is an option of NeuralFit.
The training algorithm in Eq. (2.0) reduces to the steepest descent form when
This means that the parameter vector is updated along the negative gradient direction of the MSE in Eq. (2.0) with respect to .
The step length parameter in Eq. (2.0) is adaptable. At each iteration the value of is doubled. This gives a preliminary parameter update. If the criterion is not decreased by the preliminary parameter update, is halved until a decrease is obtained. The default initial value of the step length is =20, but you can choose another value with the StepLength option.
The training with the steepest descent method will stop prior to the given number of iterations under the same conditions as the Gauss-Newton method.
Compared to the Levenberg-Marquardt and the Gauss-Newton algorithms, the steepest descent algorithm needs fewer computations in each iteration, since there is no matrix to be inverted. However, the steepest descent method is typically much less efficient than the other two methods, so that it is often worth the extra computational load to use the Levenberg-Marquardt or the Gauss-Newton algorithm.
The backpropagation algorithm is similar to the steepest descent algorithm with the difference that the step length is kept fixed during the training. Hence the backpropagation algorithm is obtained by choosing R=I in the parameter update in Eq. (2.0). The step length is set with the option StepLength, which has default =0.1.
The training algorithm in Eq. (2.0) may be augmented by using a momentum parameter , which may be set with the Momentum option. The new algorithm is
Note that the default value of is 0.
The idea of using momentum is motivated by the need to escape from local minima, which may be effective in certain problems. In general, however, the recommendation is to use one of the other, better, training algorithms and repeat the training a couple of times from different initial parameter initializations.
If you prefer, you can use the built-in Mathematica minimization command FindMinimum to train FF and RBF networks. This is done by setting the option MethodFindMinimum in NeuralFit. All other choices for Method are algorithms specially written for neural network minimization, which should be superior to FindMinimum in most neural network problems. See the documentation on FindMinimum for further details.
Examples comparing the performance of the various algorithms discussed here may be found in Chapter 7, Training Feedforward and Radial Basis Function Networks.