# Unconstrained Optimization: Setting Up Optimization Problems in the Wolfram Language

## Specifying Derivatives

The function FindRoot has a Jacobian option; the functions FindMinimum, FindMaximum, and FindFit have a Gradient option; and the Newton method has a method option Hessian. All these derivatives are specified with the same basic structure. Here is a summary of ways to specify derivative computation methods.

Automatic | find a symbolic derivative for the function and use finite difference approximations if a symbolic derivative cannot be found |

Symbolic | same as Automatic, but gives a warning message if finite differences are to be used |

FiniteDifference | use finite differences to approximate the derivative |

expression | use the given expression with local numerical values of the variables to evaluate the derivative |

Methods for computing gradient, Jacobian, and Hessian derivatives.

The basic specification for a derivative is just the method for computing it. However, all of the derivatives take options as well. These can be specified by using a list {method,opts}. Here is a summary of the options for the derivatives.

option name | default value | |

"EvaluationMonitor" | None | expression to evaluate with local values of the variables every time the derivative is evaluated, usually specified with :> instead of -> to prevent symbolic evaluation |

"Sparse" | Automatic | sparse structure for the derivative; can be Automatic, True, False, or a pattern SparseArray giving the nonzero structure |

"DifferenceOrder" | 1 | difference order to use when finite differences are used to compute the derivative |

Options for computing gradient, Jacobian, and Hessian derivatives.

A few examples will help illustrate how these fit together.

With just Method->"Newton", FindMinimum issues an lstol message because it was not able to resolve the minimum well enough due to lack of good derivative information.

The following describes how you can use the gradient option to specify the derivative.

Symbolic derivatives are not always available. If you need extra accuracy from finite differences, you can increase the difference order from the default of 1 at the cost of extra function evaluations.

Note that the number of function evaluations is much higher because function evaluations are used to compute the gradient, which is used to approximate the Hessian in turn. (The Hessian is computed with finite differences since no symbolic expression for it can be computed from the information given.)

The information given from FindMinimumPlot about the number of function, gradient, and Hessian evaluations is quite useful. The EvaluationMonitor options are what make this possible. Here is an example that simply counts the number of each type of evaluation. (The plot is made using Reap and Sow to collect the values at which the evaluations are done.)

Using such diagnostics can be quite useful for determining what methods and/or method parameters may be most successful for a class of problems with similar characteristics.

When the Wolfram Language can access the symbolic structure of the function, it automatically does a structural analysis of the function and its derivatives and uses SparseArray objects to represent the derivatives when appropriate. Since subsequent numerical linear algebra can then use the sparse structures, this can have a profound effect on the overall efficiency of the search. When the Wolfram Language cannot do a structural analysis, it has to assume, in general, that the structure is dense. However, if you know what the sparse structure of the derivative is, you can specify this with the "Sparse" method option and gain huge efficiency advantages, both in computing derivatives (with finite differences, the number of evaluations can be reduced significantly) and in subsequent linear algebra. This issue is particularly important when working with vector-valued variables. A good example for illustrating this aspect is the extended Rosenbrock problem, which has a very simple sparse structure.

For a function with simple form like this, it is easy to write a vector form of the function, which can be evaluated much more quickly than the symbolic form can, even with automatic compilation.

The solution with the function, which is faster to evaluate, winds up being slower overall because the Jacobian has to be computed with finite differences since the x_List pattern makes it opaque to symbolic analysis. It is not so much the finite differences that are slow as the fact that it needs to do 100 function evaluations to get all the columns of the Jacobian. With knowledge of the structure, this can be reduced to two evaluations to get the Jacobian. For this function, the structure of the Jacobian is quite simple.

When a sparse structure is given, it is also possible to have the value computed by a symbolic expression that evaluates to the values corresponding to the positions given in the sparse structure template. Note that the values must correspond directly to the positions as ordered in the SparseArray (the ordering can be seen using ArrayRules). One way to get a consistent ordering of indices is to transpose the matrix twice, which results in a SparseArray with indices in lexicographic order.

In this case, using the sparse Jacobian is not significantly faster because the Jacobian is so sparse that a finite difference approximation can be found for it in only two function evaluations and because the problem is well enough defined near the minimum that the extra accuracy in the Jacobian does not make any significant difference.

## Variables and Starting Conditions

All the functions FindMinimum, FindMaximum, and FindRoot take variable specifications of the same form. The function FindFit uses the same form for its parameter specifications.

FindMinimum[f,vars] | find a local minimum of f with respect to the variables given in vars |

FindMaximum[f,vars] | find a local maximum of f with respect to the variables given in vars |

FindRoot[f,vars] | find a root with respect to the variables given in vars |

FindRoot[eqns,vars] | find a root of the equations eqns with respect to the variables given in vars |

FindFit[data,expr,pars,vars] | find values of the parameters pars that make expr give a best fit to data as a function of vars |

Variables and parameters in the "Find" functions.

The list vars (pars for FindFit) should be a list of individual variable specifications. Each variable specification should be of the following form.

{var,st} | variable var has starting value st |

{var,st_{1},st_{2}} | variable var has two starting values st_{1} and st_{2}; the second starting condition is only used with the principal axis and secant methods |

{var,st,rl,ru} | variable var has starting value st; the search will be terminated when the value of var goes outside of the interval [rl,ru] |

{var,st_{1},st_{2},rl,ru} | variable var has two starting values st_{1} and st_{2}; the search will be terminated when the value of var goes outside of the interval [rl,ru] |

Individual variable specifications in the "Find" functions.

The specifications in vars all need to have the same number of starting values. When region bounds are not specified, they are taken to be unbounded, that is, rl=-∞, ru=∞.

### Vector- and Matrix-Valued Variables

The most common use of variables is to represent numbers. However, the variable input syntax supports variables that are treated as vectors, matrices, or higher-rank tensors. In general, the "Find" commands, with the exception of FindFit, which currently only works with scalar variables, will consider a variable to take on values with the same rectangular structure as the starting conditions given for it.

Of course, this is not the best way to compute the eigenvalue, but it does show how the variable dimensions are picked up from the starting values. Since has a starting value of 1, it is taken to be a scalar. On the other hand, is given a starting value, which is a vector of length 3, so it is always taken to be a vector of length 3.

If you use multiple starting values for variables, it is necessary that the values have consistent dimensions and that each component of the starting values is distinct.

One advantage of variables that can take on vector and matrix values is that they allow you to write functions, which can be very efficient for larger problems and/or handle problems of different sizes automatically.

Note that since the value of the function would be meaningless unless had the correct structure, the definition is restricted to arguments with that structure. For example, if you defined the function for any pattern x_, then evaluating with an undefined symbol x (which is what FindMinimum does) gives meaningless unintended results. It is often the case that when working with functions for vector-valued variables, you will have to restrict the definitions. Note that the definition above does not rule out symbolic values with the right structure. For example, ExtendedRosenbrockObjective[{{x_{11},x_{12}},{x_{21},x_{22}}}] gives a symbolic representation of the function for scalar x_{11}, ….

The solution did not achieve the default tolerances due to the fact that the Wolfram Language was not able to get symbolic derivatives for the function, so it had to fall back on finite differences that are not as accurate.

A disadvantage of using vector- and matrix-valued variables is that the Wolfram Language cannot currently compute symbolic derivatives for them. Sometimes it is not difficult to develop a function that gives the correct derivative. (Failing that, if you really need greater accuracy, you can use higher-order finite differences.)

Jacobian and Hessian derivatives are often sparse. You can also specify the structural sparsity of these derivatives when appropriate, which can reduce overall solution complexity by quite a bit.

## Termination Conditions

Mathematically, sufficient conditions for a local minimum of a smooth function are quite straightforward: is a local minimum if and the Hessian is positive definite. (It is a necessary condition that the Hessian be positive semidefinite.) The conditions for a root are even simpler. However, when the function is being evaluated on a computer where its value is only known, at best, to a certain precision, and practically only a limited number of function evaluations are possible, it is necessary to use error estimates to decide when a search has become close enough to a minimum or a root, and to compute the solution only to a finite tolerance. For the most part, these estimates suffice quite well, but in some cases, they can be in error, usually due to unresolved fine scale behavior of the function.

Tolerances affect how close a search will try to get to a root or local minimum before terminating the search. Assuming that the function itself has some error (as is typical when it is computed with numerical values), it is not typically possible to locate the position of a minimum much better than to half of the precision of the numbers being worked with. This is because of the quadratic nature of local minima. Near the bottom of a parabola, the height varies quite slowly as you move across from the minimum. Thus, if there is any error noise in the function, it will typically mask the actual rise of the parabola over a width roughly equal to the square root of the noise. This is best seen with an example.

From the sequence of plots, it is clear that for changes of order , which is about half of machine precision and smaller, errors in the function are masking the actual shape of the curve near the minimum. With just sampling of the function at that precision, there is no way to be sure if a given point gives the smallest local value of the function or not to any closer tolerance.

The value of the derivative, if it is computed symbolically, is much more reliable, but for the general case, it is not sufficient to rely only on the value of the derivative; the search needs to find a local minimal value of the function where the derivative is small to satisfy the tolerances in general. Note also that if symbolic derivatives of your function cannot be computed and finite differences or a derivative-free method is used, the accuracy of the solution may degrade further.

Root finding can suffer from the same inaccuracies in the function. While it is typically not as severe, some of the error estimates are based on a merit function, which does have a quadratic shape.

For the reason of this limitation, the default tolerances for the Find functions are all set to be half of the final working precision. Depending on how much error the function has, this may or may not be achievable, but in most cases it is a reasonable goal. You can adjust the tolerances using the AccuracyGoal and PrecisionGoal options. When AccuracyGoal->ag and PrecisionGoal->pg, this defines tolerances and .

Given and FindMinimum tries to find a value such that . Of course, since the exact position of the minimum, , is not known, the quantity is estimated. This is usually done based on past steps and derivative values. To match the derivative condition at a minimum, the additional requirement is imposed. For FindRoot, the corresponding condition is that just the residual be small at the root: .

In multiple dimensions, the situation is even more complicated since there can be more error in some directions than others, such as when a minimum is found along a relatively narrow valley, as in the Freudenstein–Roth problem. For searches such as this, often the search parameters are scaled, which in turn affects the error estimates. Nonetheless, it is still typical that the quadratic shape of the minimum affects the realistically achievable tolerances.

When you need to find a root or minimum beyond the default tolerances, it may be necessary to increase the final working precision. You can do this with the WorkingPrecision option. When you use WorkingPrecision->prec, the search starts at the precision of the starting values and is adaptively increased up to prec as the search converges. By default, WorkingPrecision->MachinePrecision, so machine numbers are used, which are usually much faster. Going to higher precision can take significantly more time, but can get you much more accurate results if your function is defined in an appropriate way. For very high-precision solutions, Newton's method is recommended because its quadratic convergence rate significantly reduces the number of steps ultimately required.

It is important to note that increasing the setting of the WorkingPrecision option does no good if the function is defined with lower-precision numbers. In general, for WorkingPrecision->prec to be effective, the numbers used to define the function should be exact or at least of precision prec. When possible, the precision of numbers in the function is artificially raised to prec using SetPrecision so that convergence still works, but this is not always possible. In any case, when the functions and derivatives are evaluated numerically, the precision of the results is raised to prec if necessary so that the internal arithmetic can be done with prec digit precision. Even so, the actual precision or accuracy of the root or minimum and its position is limited by the accuracy in the function. This is especially important to keep in mind when using FindFit, where data is usually only known up to a certain precision.

If you specify WorkingPrecision->prec, but do not explicitly specify the AccuracyGoal and PrecisionGoal options, then their default settings of Automatic will be taken to be AccuracyGoal->prec/2 and PrecisionGoal->prec/2. This leads to the smallest tolerances that can realistically be expected in general, as discussed earlier.

The following table shows a summary of the options affecting precision and tolerance.

option name | default value | |

WorkingPrecision | MachinePrecision | the final working precision, prec, to use; precision is adaptively increased from the smaller of prec and the precision of the starting conditions to prec |

AccuracyGoal | Automatic | setting ag determines an absolute tolerance by tol_{a}=10^{-ag}; when Automatic, ag=prec/2 |

PrecisionGoal | Automatic | setting pg determines an absolute tolerance by tol_{r}=10^{-pg}; when Automatic, pg=prec/2 |

Precision and tolerance options in the "Find" functions.

A search will sometimes converge slowly. To prevent slow searches from going on indefinitely, the Find commands all have a maximum number of iterations (steps) that will be allowed before terminating. This can be controlled with the option MaxIterations that has the default value MaxIterations->100. When a search terminates with this condition, the command will issue the cvmit message.

The Levenberg–Marquardt method is converging slowly on this problem because the residual is nonzero near the minimum and the second-order part of the Hessian is needed. While the method eventually does converge in just under 400 steps, perhaps a better option is to use a method which may converge faster.

In a larger calculation, one possibility when hitting the iteration limit is to use the final search point, which is returned, as a starting condition for continuing the search, ideally with another method.

## Symbolic Evaluation

The functions FindMinimum, FindMaximum, and FindRoot have the HoldAll attribute and so have special semantics for evaluation of their arguments. First, the variables are determined from the second argument, then they are localized. Next, the function is evaluated symbolically, then processed into an efficient form for numerical evaluation. Finally, during the execution of the command, the function is repeatedly evaluated with different numerical values. Here is a list showing these steps with additional description.

Determine variables | process the second argument; if the second argument is not of the correct form (a list of variables and starting values), it will be evaluated to get the correct form |

Localize variables | in a manner similar to Block and Table, add rules to the variables so that any assignments given to them will not affect your Wolfram System session beyond the scope of the "Find" command and so that previous assignments do not affect the value (the variable will evaluate to itself at this stage) |

Evaluate the function | with the locally undefined (symbolic) values of the variables, evaluate the first argument (function or equations). Note: this is a change which was instituted in Mathematica 5, so some adjustments may be necessary for code that ran in previous versions. If your function is such that symbolic evaluation will not keep the function as intended or will be prohibitively slow, you should define your function so that it only evaluates for numerical values of the variables. The simplest way to do this is by defining your function using PatternTest (?), as in f[x_?NumberQ]:=definition. |

Preprocess the function | analyze the function to help determine the algorithm to use (e.g., sum of squares -> Levenberg–Marquardt); optimize and compile the function for faster numerical evaluation if possible: for FindRoot this first involves going from equations to a function |

Compute derivatives | compute any needed symbolic derivatives if possible; otherwise, do preprocessing needed to compute derivatives using finite differences |

Evaluate numerically | repeatedly evaluate the function (and derivatives when required) with different numerical values |

Steps in processing the function for the "Find" commands.

FindFit does not have the HoldAll attribute, so its arguments are all evaluated before the commands begin. However, it uses all of the stages described above, except instead of evaluating the function, it constructs a function to minimize from the model function, variables, and provided data.

You will sometimes want to prevent symbolic evaluation, most often when your function is not an explicit formula, but a value derived through running through a program. An example of what happens and how to prevent the symbolic evaluation is shown.

The command fails because of the symbolic evaluation of the function. You can see what happens when you evaluate it inside of Block.

Of course, this is not at all what was intended for the function; it does not even depend on xp. What happened is that without a numerical value for xp, NDSolve fails, so ReplaceAll (/.) fails because there are no rules. First just returns its first argument, which is x[1]. Since the function is meaningless unless xp has numerical values, it should be properly defined.

An advantage of having a simple function definition outside of FindRoot is that it can independently be tested to make sure that it is what you really intended.

From the plot, you can deduce two bracketing values for the root, so it is possible to take advantage of Brent's method to quickly and accurately solve the problem.

It may seem that symbolic evaluation just creates a bother since you have to define the function specifically to prevent it. However, without symbolic evaluation, it is hard for the Wolfram Language to take advantage of its unique combination of numerical and symbolic power. Symbolic evaluation means that the commands can consistently take advantage of benefits that come from symbolic analysis, such as algorithm determination, automatic computation of derivatives, automatic optimization and compilation, and structural analysis.