Newton's Method
One significant advantage
Mathematica provides is that it can symbolically compute derivatives. This means that when you specify
Method→"Newton" and the function is explicitly differentiable, the symbolic derivative will be computed automatically. On the other hand, if the function is not in a form that can be explicitly differentiated,
Mathematica will use finite difference approximations to compute the Hessian, using structural information to minimize the number of evaluations required. Alternatively you can specify a
Mathematica expression, which will give the Hessian with numerical values of the variables.
This loads a package that contains some utility functions. 
In this example, FindMinimum computes the Hessian symbolically and substitutes numerical values for x and y when needed.
Out[2]=  

This defines a function that is only intended to evaluate for numerical values of the variables. 
The derivative of this function cannot be found symbolically since the function has been defined only to evaluate with numerical values of the variables.
This shows the steps taken by FindMinimum when it has to use finite differences to compute the gradient and Hessian.
Out[4]=  

When the gradient and Hessian are both computed using finite differences, the error in the Hessian may be quite large and it may be better to use a different method. In this case,
FindMinimum does find the minimum quite accurately, but cannot be sure because of inadequate derivative information. Also, the number of function and gradient evaluations is much greater than in the example with the symbolic derivatives computed automatically because extra evaluations are required to approximate the gradient and Hessian, respectively.
If it is possible to supply the gradient (or the function is such that it can be computed automatically), the method will typically work much better. You can give the gradient using the
Gradient option, which has several ways you can "
specify derivatives".
This defines a function that returns the gradient for numerical values of x and y.
Out[5]=  

This tells FindMinimum to use the supplied gradient. The Hessian is computed using finite differences of the gradient.
Out[6]=  

If you can provide a program which gives the Hessian, you can provide this also. Because the Hessian is only used by Newton's method, it is given as a method option of Newton.
This defines a function which returns the Hessian for numerical values of x and y.
Out[7]=  

This tells FindMinimum to use the supplied gradient and Hessian.
Out[8]=  

In principle, Newton's method uses the Hessian computed either by evaluating the symbolic derivative or by using finite differences. However, the convergence for the method computed this way depends on the function being convex, in which case the Hessian is always positive definite. It is common that a search will start at a location where this condition is violated, so the algorithm needs to take this possibility into account.
Here is an example where the search starts near a local maximum.
Out[9]=  

When sufficiently near a local maximum, the Hessian is actually negative definite.
This computes the eigenvalues of the Hessian near the local maximum.
Out[10]=  

If you were to only apply the Newton step formula in cases where the Hessian is not positive definite, it is possible to get a step direction that does not lead to a decrease in the function value.
This computes the directional derivative for the direction found by solving ^{2}f (x_{k})s_{0}=  f (x_{k}) . Since it is positive, moving in this direction will locally increase the function value.
Out[11]=  

It is crucial for the convergence of
linesearch methods that the direction be computed using a positive definite quadratic model
B_{k} since the search process and convergence results derived from it depend on a direction with sufficient descent.
Mathematica modifies the Hessian by a diagonal matrix
E_{k} with entries large enough so that
B_{k} = ^{2}f (x_{k})+ E_{k} is positive definite. Such methods are sometimes referred to as modified Newton methods. The modification to
B_{k} is done during the process of computing a Cholesky decomposition somewhat along the lines described in [
GMW81], both for dense and sparse Hessians. The modification is only done if
^{2}f (x_{k}) is not positive definite. This decomposition method is accessible through
LinearSolve if you want to use it independently.
This computes the step using B_{0}s_{0}=  f (x_{k}), where B_{0} is determined as the Cholesky factors of the Hessian are being computed.
Out[12]=  

The computed step is in a descent direction.
Out[13]=  

Besides the robustness of the (modified) Newton method, another key aspect is its convergence rate. Once a search is close enough to a local minimum, the convergence is said to be
qquadratic, which means that if
x^{*} is the local minimum point, then
for some constant
> 0.
At machine precision, this does not always make a substantial difference since it is typical that most of the steps are spent getting near to the local minimum. However, if you want a root to extremely high precision, Newton's method is usually the best choice because of the rapid convergence.
This computes a very highprecision solution using Newton's method. The precision is adaptively increased from machine precision (the precision of the starting point) to the maximal working precision of 100000 digits. Reap is used with Sow to save the steps taken. Counters are used to track and print the number of function evaluations and steps used.
Out[14]=  

When the option
WorkingPrecision→prec is used, the
default for the
AccuracyGoal and
PrecisionGoal is
prec/2. Thus, this example should find the minimum to at least 50000 digits.
This computes a symbolic solution for the position of the minimum which the search approaches.
Out[15]=  

This computes the norm of the distance from the search points at the end of each step to the exact minimum.
Out[16]=  

The reason that more function evaluations were required than the number of steps is that
Mathematica adaptively increases the precision from the precision of the initial value to the requested maximum
WorkingPrecision. The sequence of precisions used is chosen so that as few computations are done at the most expensive final precision as possible under the assumption that the points are converging to the minimum. Sometimes when
Mathematica changes precision, it is necessary to reevaluate the function at the higher precision.
This shows a table with the precision of each of the points with the norm of their errors.
Out[17]//TableForm= 
 

Note that typically the precision is roughly double the scale
(log_{10}) of the error. For Newton's method this is appropriate since when the step is computed, the scale of the error will effectively double according to the quadratic convergence.
FindMinimum always starts with the precision of the starting values you gave it. Thus, if you do not want it to use adaptive precision control, you can start with values, which are exact or have at least the maximum
WorkingPrecision.
This computes the solution using only precision 100000 throughout the computation. (Warning: this takes a very long time to complete.)
Out[18]=  

Even though this may use fewer function evaluations, they are all done at the highest precision, so typically adaptive precision saves a lot of time. For example, the previous command without adaptive precision takes more than 50 times as long as when starting from machine precision.
With Newton's method, both "
line search" and "
trust region" step control are implemented. The default, which is used in the preceding examples is the line search. However, any of them may be done with the trust region approach. The approach typically requires more numerical linear algebra computations per step, but because steps are better controlled, may converge in fewer iterations.
This uses the unconstrained problems package to set up the classic Rosenbrock function, which has a narrow curved valley.
Out[19]=  

This shows the steps taken by FindMinimum with a trust region Newton method for a Rosenbrock function.
Out[20]=  

This shows the steps taken by FindMinimum with a line search Newton method for the same function.
Out[21]=  

You can see from the comparison of the two plots that the trust region method has kept the steps within better control as the search follows the valley and consequently converges with fewer function evaluations.
The following table summarizes the options you can use with Newton's method.
  
"Hessian"  Automatic  an expression to use for computing the Hessian matrix 
"StepControl"  "LineSearch"  How to control steps, can be "LineSearch", "TrustRegion", or None 
Method options for Method→"Newton".