One significant advantage the Wolfram Language provides is that it can symbolically compute derivatives. This means that when you specify Method->"Newton" and the function is explicitly differentiable, the symbolic derivative will be computed automatically. On the other hand, if the function is not in a form that can be explicitly differentiated, the Wolfram Language will use finite difference approximations to compute the Hessian, using structural information to minimize the number of evaluations required. Alternatively, you can specify a Wolfram Language expression, which will give the Hessian with numerical values of the variables.
When the gradient and Hessian are both computed using finite differences, the error in the Hessian may be quite large and it may be better to use a different method. In this case, FindMinimum does find the minimum quite accurately, but cannot be sure because of inadequate derivative information. Also, the number of function and gradient evaluations is much greater than in the example with the symbolic derivatives computed automatically because extra evaluations are required to approximate the gradient and Hessian, respectively.
If it is possible to supply the gradient (or the function is such that it can be computed automatically), the method will typically work much better. You can give the gradient using the Gradient option, which has several ways you can specify derivatives.
In principle, Newton's method uses the Hessian computed either by evaluating the symbolic derivative or by using finite differences. However, the convergence for the method computed this way depends on the function being convex, in which case the Hessian is always positive definite. It is common that a search will start at a location where this condition is violated, so the algorithm needs to take this possibility into account.
It is crucial for the convergence of line search methods that the direction be computed using a positive definite quadratic model since the search process and convergence results derived from it depend on a direction with sufficient descent. See "Line Search Methods". The Wolfram Language modifies the Hessian by a diagonal matrix with entries large enough so that is positive definite. Such methods are sometimes referred to as modified Newton methods. The modification to is done during the process of computing a Cholesky decomposition somewhat along the lines described in [GMW81], both for dense and sparse Hessians. The modification is only done if is not positive definite. This decomposition method is accessible through LinearSolve if you want to use it independently.
Besides the robustness of the (modified) Newton method, another key aspect is its convergence rate. Once a search is close enough to a local minimum, the convergence is said to be -quadratic, which means that if is the local minimum point, then
At machine precision, this does not always make a substantial difference, since it is typical that most of the steps are spent getting near to the local minimum. However, if you want a root to extremely high precision, Newton's method is usually the best choice because of the rapid convergence.
The reason that more function evaluations were required than the number of steps is that the Wolfram Language adaptively increases the precision from the precision of the initial value to the requested maximum WorkingPrecision. The sequence of precisions used is chosen so that as few computations are done at the most expensive final precision as possible, under the assumption that the points are converging to the minimum. Sometimes when the Wolfram Language changes precision, it is necessary to reevaluate the function at the higher precision.
Note that typically the precision is roughly double the scale of the error. For Newton's method this is appropriate since when the step is computed, the scale of the error will effectively double according to the quadratic convergence.
FindMinimum always starts with the precision of the starting values you gave it. Thus, if you do not want it to use adaptive precision control, you can start with values, which are exact or have at least the maximum WorkingPrecision.
Even though this may use fewer function evaluations, they are all done at the highest precision, so typically adaptive precision saves a lot of time. For example, the previous command without adaptive precision takes more than 50 times as long as when starting from machine precision.
With Newton's method, both line search and trust region step control are implemented. The default, which is used in the preceding examples, is the line search. However, any of them may be done with the trust region approach. The approach typically requires more numerical linear algebra computations per step, but because steps are better controlled, may converge in fewer iterations.
You can see from the comparison of the two plots that the trust region method has kept the steps within better control as the search follows the valley and consequently converges with fewer function evaluations.
|"Hessian"||Automatic||an expression to use for computing the Hessian matrix|
|"StepControl"||"LineSearch"||how to control steps; options include "LineSearch", "TrustRegion", or None|
Method options for Method->"Newton".