Documentation /  Neural Networks /  Neural Network Theory - A Short Tutorial /  Feedforward and Radial basis Function Networks /

Radial Basis Function NetworksDynamic Neural Networks

2.5.3 Training Feedforward and Radial Basis Function Networks

Suppose you have chosen an FF or RBF network and you have already decided on the exact structure, the number of layers, and the number of neurons in the different layers. Denote this network with =g( Theta,x) where Theta is a parameter vector containing all the parametric weights of the network and x is the input. Then it is time to train the network. This means that Theta will be tuned so that the network approximates the unknown function producing your data. The training is done with the command NeuralFit, described in Chapter 7, Training Feedforward and Radial Basis Function Networks. Here follows a tutorial on the available training algorithms.

Given a fully specified network, it can now be trained using a set of data containing N input-output pairs, . With this data the mean square error (MSE) is defined by

Then, a good estimate for the parameter Theta is one that minimizes the MSE; that is,

Often it is more convenient to use the root-mean-square error (RMSE)

when evaluating the quality of a model during and after training, since it can be compared with the output signal directly. It is the RMSE value that is logged and written out during the training, and plotted when the training terminates.

The various training algorithms that apply to FF and RBF networks have one thing in common—they are iterative. They both start with an initial parameter vector , which you set with the command InitializeFeedForwardNet or InitializeRBFNet. Starting at , the training algorithm iteratively decreases the MSE in Eq. (2.0) by incrementally updating Theta along the negative gradient of the MSE, as follows

Here, the matrix R may change the search direction from the negative gradient direction to a more favorable one. The purpose of parameter Mu is to control the size of the update increment in Theta with each iteration i, while decreasing the value of the MSE. It is in the choice of R and Mu that the various training algorithms differ in the Neural Networks package.

If R is chosen to be the inverse of the Hessian of the MSE function, that is, the inverse of

then Eq. (2.0) assumes the form of the Newton algorithm. This search scheme can be motivated by a second-order Taylor expansion of the MSE function at the current parameter estimate . There are several drawbacks to using Newton's algorithm. For example, if the Hessian is not positive definite, the Theta updates will be in the positive gradient direction, which will increase the MSE value. This possibility may be avoided with a commonly used alternative for R, the first part of the Hessian in Eq. (2.0):

With H defined, the option Method may be used to choose from the following algorithms:







Neural network minimization problems are often very ill-conditioned; that is, the Hessian in Eq. (2.0) is often ill-conditioned. This makes the minimization problem harder to solve, and for such problems the Levenberg-Marquardt algorithm is often a good choice. For this reason, the Levenberg-Marquardt algorithm method is the default training algorithm of the package.

Instead of adapting the step length Mu to guarantee a downhill step in each iteration of Eq. (2.0) a diagonal matrix is added to H in Eq. (2.0); in other words, R is chosen to be

and Mu=1.

The value of Lambda is chosen automatically so that a downhill step is produced. At each iteration, the algorithm tries to decrease the value of Lambda by some increment CapitalDeltaLambda. If the current value of Lambda does not decrease the MSE in Eq. (2.0), then Lambda is increased in steps of CapitalDeltaLambda until it does produce a decrease.

The training is terminated prior to the specified number of iteration if any of the following conditions are satisfied:


Here PrecisionGoal is an option of NeuralFit and s is the largest eigenvalue of H.

Large values of Lambda produce parameter update increments primarily along the negative gradient direction, while small values result in updates governed by the Gauss-Newton method. Accordingly, the Levenberg-Marquardt algorithm is a hybrid of the two relaxation methods, which are explained next.


The Gauss-Newton method is a fast and reliable algorithm that may be used for a large variety of minimization problems. However, this algorithm may not be a good choice for neural network problems if the Hessian is ill-conditioned; that is, if its eigenvalues span a large numerical range. If so, the algorithm will converge poorly, slowing down the training process.

The training algorithm uses the Gauss-Newton method when matrix R is chosen to be the inverse of H in Eq. (2.0); that is,

At each iteration, the step length parameter is set to unity, Mu=1. This allows the full Gauss-Newton step, which is accepted only if the MSE in Eq. (2.0) decreases in value. Otherwise Mu is halved again and again until a downhill step is affected. Then, the algorithm continues with a new iteration.

The training terminates prior to the specified number of iterations if any of the following conditions are satisfied:

Here PrecisionGoal is an option of NeuralFit.

Steepest descent

The training algorithm in Eq. (2.0) reduces to the steepest descent form when

This means that the parameter vector Theta is updated along the negative gradient direction of the MSE in Eq. (2.0) with respect to Theta.

The step length parameter Mu in Eq. (2.0) is adaptable. At each iteration the value of Mu is doubled. This gives a preliminary parameter update. If the criterion is not decreased by the preliminary parameter update, Mu is halved until a decrease is obtained. The default initial value of the step length is Mu=20, but you can choose another value with the StepLength option.

The training with the steepest descent method will stop prior to the given number of iterations under the same conditions as the Gauss-Newton method.

Compared to the Levenberg-Marquardt and the Gauss-Newton algorithms, the steepest descent algorithm needs fewer computations in each iteration, since there is no matrix to be inverted. However, the steepest descent method is typically much less efficient than the other two methods, so that it is often worth the extra computational load to use the Levenberg-Marquardt or the Gauss-Newton algorithm.


The backpropagation algorithm is similar to the steepest descent algorithm with the difference that the step length Mu is kept fixed during the training. Hence the backpropagation algorithm is obtained by choosing R=I in the parameter update in Eq. (2.0). The step length Mu is set with the option StepLength, which has default Mu=0.1.

The training algorithm in Eq. (2.0) may be augmented by using a momentum parameter Alpha, which may be set with the Momentum option. The new algorithm is

Note that the default value of Alpha is 0.

The idea of using momentum is motivated by the need to escape from local minima, which may be effective in certain problems. In general, however, the recommendation is to use one of the other, better, training algorithms and repeat the training a couple of times from different initial parameter initializations.


If you prefer, you can use the built-in Mathematica minimization command FindMinimum to train FF and RBF networks. This is done by setting the option MethodRuleFindMinimum in NeuralFit. All other choices for Method are algorithms specially written for neural network minimization, which should be superior to FindMinimum in most neural network problems. See the documentation on FindMinimum for further details.

Examples comparing the performance of the various algorithms discussed here may be found in Chapter 7, Training Feedforward and Radial Basis Function Networks.

Radial Basis Function NetworksDynamic Neural Networks

Any questions about topics on this page? Click here to get an individual response.Buy NowMore Information