Newton's Method

One significant advantage the Wolfram Language provides is that it can symbolically compute derivatives. This means that when you specify Method->"Newton" and the function is explicitly differentiable, the symbolic derivative will be computed automatically. On the other hand, if the function is not in a form that can be explicitly differentiated, the Wolfram Language will use finite difference approximations to compute the Hessian, using structural information to minimize the number of evaluations required. Alternatively, you can specify a Wolfram Language expression, which will give the Hessian with numerical values of the variables.

This loads a package that contains some utility functions:
Click for copyable input
In this example, FindMinimum computes the Hessian symbolically and substitutes numerical values for x and y when needed:
Click for copyable input
This defines a function that is only intended to evaluate for numerical values of the variables:
Click for copyable input

The derivative of this function cannot be found symbolically since the function has been defined only to evaluate with numerical values of the variables.

This shows the steps taken by FindMinimum when it has to use finite differences to compute the gradient and Hessian:
Click for copyable input

When the gradient and Hessian are both computed using finite differences, the error in the Hessian may be quite large and it may be better to use a different method. In this case, FindMinimum does find the minimum quite accurately, but cannot be sure because of inadequate derivative information. Also, the number of function and gradient evaluations is much greater than in the example with the symbolic derivatives computed automatically because extra evaluations are required to approximate the gradient and Hessian, respectively.

If it is possible to supply the gradient (or the function is such that it can be computed automatically), the method will typically work much better. You can give the gradient using the Gradient option, which has several ways you can specify derivatives.

This defines a function that returns the gradient for numerical values of x and y:
Click for copyable input
This tells FindMinimum to use the supplied gradient. The Hessian is computed using finite differences of the gradient:
Click for copyable input

If you can provide a program that gives the Hessian, you can provide this also. Because the Hessian is only used by Newton's method, it is given as a method option of "Newton".

This defines a function that returns the Hessian for numerical values of x and y:
Click for copyable input
This tells FindMinimum to use the supplied gradient and Hessian:
Click for copyable input

In principle, Newton's method uses the Hessian computed either by evaluating the symbolic derivative or by using finite differences. However, the convergence for the method computed this way depends on the function being convex, in which case the Hessian is always positive definite. It is common that a search will start at a location where this condition is violated, so the algorithm needs to take this possibility into account.

Here is an example where the search starts near a local maximum:
Click for copyable input

When sufficiently near a local maximum, the Hessian is actually negative definite.

This computes the eigenvalues of the Hessian near the local maximum:
Click for copyable input

If you were to only apply the Newton step formula in cases where the Hessian is not positive definite, it is possible to get a step direction that does not lead to a decrease in the function value.

This computes the directional derivative for the direction found by solving . Since it is positive, moving in this direction will locally increase the function value:
Click for copyable input

It is crucial for the convergence of line search methods that the direction be computed using a positive definite quadratic model since the search process and convergence results derived from it depend on a direction with sufficient descent. See "Line Search Methods". The Wolfram Language modifies the Hessian by a diagonal matrix with entries large enough so that is positive definite. Such methods are sometimes referred to as modified Newton methods. The modification to is done during the process of computing a Cholesky decomposition somewhat along the lines described in [GMW81], both for dense and sparse Hessians. The modification is only done if is not positive definite. This decomposition method is accessible through LinearSolve if you want to use it independently.

This computes the step using , where is determined as the Cholesky factors of the Hessian are being computed:
Click for copyable input
The computed step is in a descent direction:
Click for copyable input

Besides the robustness of the (modified) Newton method, another key aspect is its convergence rate. Once a search is close enough to a local minimum, the convergence is said to be -quadratic, which means that if is the local minimum point, then

for some constant .

At machine precision, this does not always make a substantial difference, since it is typical that most of the steps are spent getting near to the local minimum. However, if you want a root to extremely high precision, Newton's method is usually the best choice because of the rapid convergence.

This computes a very high-precision solution using Newton's method. The precision is adaptively increased from machine precision (the precision of the starting point) to the maximal working precision of 100000 digits. Reap is used with Sow to save the steps taken. Counters are used to track and print the number of function evaluations and steps used:
Click for copyable input

When the option WorkingPrecision->prec is used, the default for the AccuracyGoal and PrecisionGoal is prec/2. Thus, this example should find the minimum to at least 50000 digits.

This computes a symbolic solution for the position of the minimum which the search approaches:
Click for copyable input
This computes the norm of the distance from the search points at the end of each step to the exact minimum:
Click for copyable input

The reason that more function evaluations were required than the number of steps is that the Wolfram Language adaptively increases the precision from the precision of the initial value to the requested maximum WorkingPrecision. The sequence of precisions used is chosen so that as few computations are done at the most expensive final precision as possible, under the assumption that the points are converging to the minimum. Sometimes when the Wolfram Language changes precision, it is necessary to reevaluate the function at the higher precision.

This shows a table with the precision of each of the points with the norm of their errors:
Click for copyable input

Note that typically the precision is roughly double the scale of the error. For Newton's method this is appropriate since when the step is computed, the scale of the error will effectively double according to the quadratic convergence.

FindMinimum always starts with the precision of the starting values you gave it. Thus, if you do not want it to use adaptive precision control, you can start with values, which are exact or have at least the maximum WorkingPrecision.

This computes the solution using only precision 100000 throughout the computation. (Warning: this takes a very long time to complete.)
Click for copyable input

Even though this may use fewer function evaluations, they are all done at the highest precision, so typically adaptive precision saves a lot of time. For example, the previous command without adaptive precision takes more than 50 times as long as when starting from machine precision.

With Newton's method, both line search and trust region step control are implemented. The default, which is used in the preceding examples, is the line search. However, any of them may be done with the trust region approach. The approach typically requires more numerical linear algebra computations per step, but because steps are better controlled, may converge in fewer iterations.

This uses the unconstrained problems package to set up the classic Rosenbrock function, which has a narrow curved valley:
Click for copyable input
This shows the steps taken by FindMinimum with a trust region Newton method for a Rosenbrock function:
Click for copyable input
This shows the steps taken by FindMinimum with a line search Newton method for the same function:
Click for copyable input

You can see from the comparison of the two plots that the trust region method has kept the steps within better control as the search follows the valley and consequently converges with fewer function evaluations.

The following table summarizes the options you can use with Newton's method.

option name
default value
"Hessian"Automatican expression to use for computing the Hessian matrix
"StepControl""LineSearch"how to control steps; options include "LineSearch", "TrustRegion", or None

Method options for Method->"Newton".