The first thing to notice about the path taken in this example is that it starts in the wrong direction. This direction is chosen because at the first step all the method has to go by is the gradient, and so it takes the direction of steepest descent. However, in subsequent steps, it incorporates information from the values of the function and gradient at the steps taken to build up an approximate model of the Hessian.
The methods used by the Wolfram Language are the Broyden–Fletcher–Goldfarb–Shanno (BFGS) updates and, for large systems, the limited-memory BFGS (L-BFGS) methods, in which the model is not stored explicitly, but rather is calculated by gradients and step directions stored from past steps.
The BFGS method is implemented such that instead of forming the model Hessian at each step, Cholesky factors such that are computed so that only operations are needed to solve the system [DS96] for a problem with variables.
For large-scale sparse problems, the BFGS method can be problematic because, in general, the Cholesky factors (or the Hessian approximation or its inverse) are dense, so the memory and operations requirements become prohibitive compared to algorithms that take advantage of sparseness. The L-BFGS algorithm [NW99] forms an approximation to the inverse Hessian based on the last past steps, which are stored. The Hessian approximation may not be as complete, but the memory and order of operations are limited to for a problem with variables. In the Wolfram Language, for problems over 250 variables, the algorithm is switched automatically to L-BFGS. You can control this with the method option "StepMemory"->m. With , the full BFGS method will always be used. Choosing an appropriate value of is a tradeoff between speed of convergence and the work done per step. With , you are most likely better off using a "conjugate gradient" algorithm.
Quasi-Newton methods are chosen as the default in the Wolfram Language because they are typically quite fast and do not require computation of the Hessian matrix, which can be quite expensive both in terms of the symbolic computation and numerical evaluation. With an adequate line search, they can be shown to converge superlinearly [NW99] to a local minimum where the Hessian is positive definite. This means that
or, in other words, the steps keep getting smaller. However, for very high precision, this does not compare to the -quadratic convergence rate of Newton's method.
Newton's method is able to find ten times as many digits with far fewer steps because of its quadratic convergence rate. However, the convergence with the quasi-Newton method is still superlinear since the ratio of the errors is clearly going to zero.
|"StepMemory"||Automatic||the effective number of steps to "remember" in the Hessian approximation; can be a positive integer or Automatic|
|"StepControl"||"LineSearch"||how to control steps; can be or None|
Method options for Method->"QuasiNewton".