Unconstrained Optimization: Setting Up Optimization Problems in the Wolfram Language

Specifying Derivatives	Termination Conditions
Variables and Starting Conditions	Symbolic Evaluation

Specifying Derivatives

The function FindRoot has a Jacobian option; the functions FindMinimum, FindMaximum, and FindFit have a Gradient option; and the Newton method has a method option Hessian. All these derivatives are specified with the same basic structure. Here is a summary of ways to specify derivative computation methods.

Automatic	find a symbolic derivative for the function and use finite difference approximations if a symbolic derivative cannot be found
Symbolic	same as Automatic, but gives a warning message if finite differences are to be used
FiniteDifference	use finite differences to approximate the derivative
expression	use the given expression with local numerical values of the variables to evaluate the derivative

Methods for computing gradient, Jacobian, and Hessian derivatives.

The basic specification for a derivative is just the method for computing it. However, all of the derivatives take options as well. These can be specified by using a list {method,opts}. Here is a summary of the options for the derivatives.

option name	default value
"EvaluationMonitor"	None	expression to evaluate with local values of the variables every time the derivative is evaluated, usually specified with :> instead of -> to prevent symbolic evaluation
"Sparse"	Automatic	sparse structure for the derivative; can be Automatic, True, False, or a pattern SparseArray giving the nonzero structure
"DifferenceOrder"	1	difference order to use when finite differences are used to compute the derivative

Options for computing gradient, Jacobian, and Hessian derivatives.

A few examples will help illustrate how these fit together.

This loads a package that contains some utility functions:

This defines a function that is only intended to evaluate for numerical values of the variables:

With just Method->"Newton", FindMinimum issues an lstol message because it was not able to resolve the minimum well enough due to lack of good derivative information.

This shows the steps taken by FindMinimum when it has to use finite differences to compute the gradient and Hessian:

The following describes how you can use the gradient option to specify the derivative.

This computes the minimum of f[x,y] using a symbolic expression for its gradient:

Symbolic derivatives are not always available. If you need extra accuracy from finite differences, you can increase the difference order from the default of 1 at the cost of extra function evaluations.

This computes the minimum of f[x,y] using a second-order finite difference to compute the gradient:

Note that the number of function evaluations is much higher because function evaluations are used to compute the gradient, which is used to approximate the Hessian in turn. (The Hessian is computed with finite differences since no symbolic expression for it can be computed from the information given.)

The information given from FindMinimumPlot about the number of function, gradient, and Hessian evaluations is quite useful. The EvaluationMonitor options are what make this possible. Here is an example that simply counts the number of each type of evaluation. (The plot is made using Reap and Sow to collect the values at which the evaluations are done.)

This computes the minimum with counters to keep track of the number of steps and the number of function, gradient, and Hessian evaluations:

Using such diagnostics can be quite useful for determining what methods and/or method parameters may be most successful for a class of problems with similar characteristics.

When the Wolfram Language can access the symbolic structure of the function, it automatically does a structural analysis of the function and its derivatives and uses SparseArray objects to represent the derivatives when appropriate. Since subsequent numerical linear algebra can then use the sparse structures, this can have a profound effect on the overall efficiency of the search. When the Wolfram Language cannot do a structural analysis, it has to assume, in general, that the structure is dense. However, if you know what the sparse structure of the derivative is, you can specify this with the "Sparse" method option and gain huge efficiency advantages, both in computing derivatives (with finite differences, the number of evaluations can be reduced significantly) and in subsequent linear algebra. This issue is particularly important when working with vector-valued variables. A good example for illustrating this aspect is the extended Rosenbrock problem, which has a very simple sparse structure.

This gets the extended Rosenbrock function with 1000 variables in symbolic form ready to be solved with FindRoot using the UnconstrainedProblems` package:

This solves the problem using the symbolic form of the function:

For a function with simple form like this, it is easy to write a vector form of the function, which can be evaluated much more quickly than the symbolic form can, even with automatic compilation.

This defines a vector form of the extended Rosenbrock function, which evaluates very efficiently:

This extracts the starting point as a vector from the problem structure:

This solves the problem using a vector variable and the vector function for evaluation:

The solution with the function, which is faster to evaluate, winds up being slower overall because the Jacobian has to be computed with finite differences since the x_List pattern makes it opaque to symbolic analysis. It is not so much the finite differences that are slow as the fact that it needs to do 100 function evaluations to get all the columns of the Jacobian. With knowledge of the structure, this can be reduced to two evaluations to get the Jacobian. For this function, the structure of the Jacobian is quite simple.

This defines a pattern SparseArray, which has the structure of non-zeros for the Jacobian of the extended Rosenbrock function. (By specifying _ for the values in the rules, the SparseArray is taken to be a template of the Pattern type as indicated in the output form.)

This solves the problem with the knowledge of the actual Jacobian structure, showing a significant cost savings:

When a sparse structure is given, it is also possible to have the value computed by a symbolic expression that evaluates to the values corresponding to the positions given in the sparse structure template. Note that the values must correspond directly to the positions as ordered in the SparseArray (the ordering can be seen using ArrayRules). One way to get a consistent ordering of indices is to transpose the matrix twice, which results in a SparseArray with indices in lexicographic order.

This transposes the nonzero structure matrix twice to get the indices sorted:

This defines a function that will return the nonzero values in the Jacobian corresponding to the index positions in the nonzero structure matrix:

This solves the problem with the resulting sparse symbolic Jacobian:

In this case, using the sparse Jacobian is not significantly faster because the Jacobian is so sparse that a finite difference approximation can be found for it in only two function evaluations and because the problem is well enough defined near the minimum that the extra accuracy in the Jacobian does not make any significant difference.

Variables and Starting Conditions

All the functions FindMinimum, FindMaximum, and FindRoot take variable specifications of the same form. The function FindFit uses the same form for its parameter specifications.

FindMinimum[f,vars]	find a local minimum of f with respect to the variables given in vars
FindMaximum[f,vars]	find a local maximum of f with respect to the variables given in vars
FindRoot[f,vars]	find a root with respect to the variables given in vars
FindRoot[eqns,vars]	find a root of the equations eqns with respect to the variables given in vars
FindFit[data,expr,pars,vars]	find values of the parameters pars that make expr give a best fit to data as a function of vars

Variables and parameters in the "Find" functions.

The list vars (pars for FindFit) should be a list of individual variable specifications. Each variable specification should be of the following form.

{var,st}	variable var has starting value st
{var,st₁,st₂}	variable var has two starting values st₁ and st₂; the second starting condition is only used with the principal axis and secant methods
{var,st,rl,ru}	variable var has starting value st; the search will be terminated when the value of var goes outside of the interval [rl,ru]
{var,st₁,st₂,rl,ru}	variable var has two starting values st₁ and st₂; the search will be terminated when the value of var goes outside of the interval [rl,ru]

Individual variable specifications in the "Find" functions.

The specifications in vars all need to have the same number of starting values. When region bounds are not specified, they are taken to be unbounded, that is, rl=-∞, ru=∞.

Vector- and Matrix-Valued Variables

The most common use of variables is to represent numbers. However, the variable input syntax supports variables that are treated as vectors, matrices, or higher-rank tensors. In general, the "Find" commands, with the exception of FindFit, which currently only works with scalar variables, will consider a variable to take on values with the same rectangular structure as the starting conditions given for it.

Here is a matrix:

This uses FindRoot to find an eigenvalue and corresponding normalized eigenvector for A:

Of course, this is not the best way to compute the eigenvalue, but it does show how the variable dimensions are picked up from the starting values. Since has a starting value of 1, it is taken to be a scalar. On the other hand, is given a starting value, which is a vector of length 3, so it is always taken to be a vector of length 3.

If you use multiple starting values for variables, it is necessary that the values have consistent dimensions and that each component of the starting values is distinct.

This finds a different eigenvalue using two starting conditions for each variable:

One advantage of variables that can take on vector and matrix values is that they allow you to write functions, which can be very efficient for larger problems and/or handle problems of different sizes automatically.

This defines a function that gives an objective function equivalent to the ExtendedRosenbrock problem in the UnconstrainedProblems package. The function expects a value of

that is a matrix with two rows:

Note that since the value of the function would be meaningless unless had the correct structure, the definition is restricted to arguments with that structure. For example, if you defined the function for any pattern x_, then evaluating with an undefined symbol x (which is what FindMinimum does) gives meaningless unintended results. It is often the case that when working with functions for vector-valued variables, you will have to restrict the definitions. Note that the definition above does not rule out symbolic values with the right structure. For example, ExtendedRosenbrockObjective[{{x₁₁,x₁₂},{x₂₁,x₂₂}}] gives a symbolic representation of the function for scalar x₁₁, ….

This uses FindMinimum to solve the problem given a generic value for the problem size. You can change the value of

without changing anything else to solve problems of different size:

The solution did not achieve the default tolerances due to the fact that the Wolfram Language was not able to get symbolic derivatives for the function, so it had to fall back on finite differences that are not as accurate.

A disadvantage of using vector- and matrix-valued variables is that the Wolfram Language cannot currently compute symbolic derivatives for them. Sometimes it is not difficult to develop a function that gives the correct derivative. (Failing that, if you really need greater accuracy, you can use higher-order finite differences.)

This defines a function that returns the gradient for the ExtendedRosenbrockObjective function. Note that the gradient is a vector obtained by flattening the matrix corresponding to the variable positions:

This solves the problem using the symbolic value of the gradient:

Jacobian and Hessian derivatives are often sparse. You can also specify the structural sparsity of these derivatives when appropriate, which can reduce overall solution complexity by quite a bit.

Termination Conditions

Mathematically, sufficient conditions for a local minimum of a smooth function are quite straightforward: is a local minimum if and the Hessian is positive definite. (It is a necessary condition that the Hessian be positive semidefinite.) The conditions for a root are even simpler. However, when the function is being evaluated on a computer where its value is only known, at best, to a certain precision, and practically only a limited number of function evaluations are possible, it is necessary to use error estimates to decide when a search has become close enough to a minimum or a root, and to compute the solution only to a finite tolerance. For the most part, these estimates suffice quite well, but in some cases, they can be in error, usually due to unresolved fine scale behavior of the function.

Tolerances affect how close a search will try to get to a root or local minimum before terminating the search. Assuming that the function itself has some error (as is typical when it is computed with numerical values), it is not typically possible to locate the position of a minimum much better than to half of the precision of the numbers being worked with. This is because of the quadratic nature of local minima. Near the bottom of a parabola, the height varies quite slowly as you move across from the minimum. Thus, if there is any error noise in the function, it will typically mask the actual rise of the parabola over a width roughly equal to the square root of the noise. This is best seen with an example.

This loads a package that contains some utility functions:

The following command displays a sequence of plots showing the minimum of the function

over successively smaller ranges. The curve computed with machine numbers is shown in black; the actual curve (computed with 100 digits of precision) is shown in blue:

From the sequence of plots, it is clear that for changes of order , which is about half of machine precision and smaller, errors in the function are masking the actual shape of the curve near the minimum. With just sampling of the function at that precision, there is no way to be sure if a given point gives the smallest local value of the function or not to any closer tolerance.

The value of the derivative, if it is computed symbolically, is much more reliable, but for the general case, it is not sufficient to rely only on the value of the derivative; the search needs to find a local minimal value of the function where the derivative is small to satisfy the tolerances in general. Note also that if symbolic derivatives of your function cannot be computed and finite differences or a derivative-free method is used, the accuracy of the solution may degrade further.

Root finding can suffer from the same inaccuracies in the function. While it is typically not as severe, some of the error estimates are based on a merit function, which does have a quadratic shape.

For the reason of this limitation, the default tolerances for the Find functions are all set to be half of the final working precision. Depending on how much error the function has, this may or may not be achievable, but in most cases it is a reasonable goal. You can adjust the tolerances using the AccuracyGoal and PrecisionGoal options. When AccuracyGoal->ag and PrecisionGoal->pg, this defines tolerances and .

Given and FindMinimum tries to find a value such that . Of course, since the exact position of the minimum, , is not known, the quantity is estimated. This is usually done based on past steps and derivative values. To match the derivative condition at a minimum, the additional requirement is imposed. For FindRoot, the corresponding condition is that just the residual be small at the root: .

This finds the

to at least 12 digits of accuracy, or within a tolerance of

. The precision goal of

means that

, so it does not have any effect in the formula. (Note: you cannot similarly set the accuracy goal to

since that is always used for the size of the residual.)

This shows that the result satisfied the requested error tolerances:

This tries to find the minimum of the function

to 8 digits of accuracy. FindMinimum gives a warning message because of the error in the function as seen in the plots:

This shows that though the value at the minimum was found to be basically machine epsilon, the position was only found to the order of

or so:

In multiple dimensions, the situation is even more complicated since there can be more error in some directions than others, such as when a minimum is found along a relatively narrow valley, as in the Freudenstein–Roth problem. For searches such as this, often the search parameters are scaled, which in turn affects the error estimates. Nonetheless, it is still typical that the quadratic shape of the minimum affects the realistically achievable tolerances.

When you need to find a root or minimum beyond the default tolerances, it may be necessary to increase the final working precision. You can do this with the WorkingPrecision option. When you use WorkingPrecision->prec, the search starts at the precision of the starting values and is adaptively increased up to prec as the search converges. By default, WorkingPrecision->MachinePrecision, so machine numbers are used, which are usually much faster. Going to higher precision can take significantly more time, but can get you much more accurate results if your function is defined in an appropriate way. For very high-precision solutions, Newton's method is recommended because its quadratic convergence rate significantly reduces the number of steps ultimately required.

It is important to note that increasing the setting of the WorkingPrecision option does no good if the function is defined with lower-precision numbers. In general, for WorkingPrecision->prec to be effective, the numbers used to define the function should be exact or at least of precision prec. When possible, the precision of numbers in the function is artificially raised to prec using SetPrecision so that convergence still works, but this is not always possible. In any case, when the functions and derivatives are evaluated numerically, the precision of the results is raised to prec if necessary so that the internal arithmetic can be done with prec digit precision. Even so, the actual precision or accuracy of the root or minimum and its position is limited by the accuracy in the function. This is especially important to keep in mind when using FindFit, where data is usually only known up to a certain precision.

Here is a function defined using machine numbers:

Even with higher working precision, the minimum cannot be resolved better because the actual function still has the same errors as shown in the plots. The derivatives were specified to keep other things consistent with the computation at machine precision shown previously:

Here is the computation done with 20-digit precision when the function does not have machine numbers:

If you specify WorkingPrecision->prec, but do not explicitly specify the AccuracyGoal and PrecisionGoal options, then their default settings of Automatic will be taken to be AccuracyGoal->prec/2 and PrecisionGoal->prec/2. This leads to the smallest tolerances that can realistically be expected in general, as discussed earlier.

Here is the computation done with 50-digit precision without an explicitly specified setting for the AccuracyGoal or PrecisionGoal options:

This shows that though the value at the minimum was actually found to be even better than the default 25-digit tolerances:

The following table shows a summary of the options affecting precision and tolerance.

option name	default value
WorkingPrecision	MachinePrecision	the final working precision, prec, to use; precision is adaptively increased from the smaller of prec and the precision of the starting conditions to prec
AccuracyGoal	Automatic	setting ag determines an absolute tolerance by tol_a=10^-ag; when Automatic, ag=prec/2
PrecisionGoal	Automatic	setting pg determines an absolute tolerance by tol_r=10^-pg; when Automatic, pg=prec/2

Precision and tolerance options in the "Find" functions.

A search will sometimes converge slowly. To prevent slow searches from going on indefinitely, the Find commands all have a maximum number of iterations (steps) that will be allowed before terminating. This can be controlled with the option MaxIterations that has the default value MaxIterations->100. When a search terminates with this condition, the command will issue the cvmit message.

This gets the Brown–Dennis problem from the Optimization`UnconstrainedProblems` package:

This attempts to solve the problem with the default method, which is the Levenberg–Marquardt method, since the function is a sum of squares:

The Levenberg–Marquardt method is converging slowly on this problem because the residual is nonzero near the minimum and the second-order part of the Hessian is needed. While the method eventually does converge in just under 400 steps, perhaps a better option is to use a method which may converge faster.

In a larger calculation, one possibility when hitting the iteration limit is to use the final search point, which is returned, as a starting condition for continuing the search, ideally with another method.

Symbolic Evaluation

The functions FindMinimum, FindMaximum, and FindRoot have the HoldAll attribute and so have special semantics for evaluation of their arguments. First, the variables are determined from the second argument, then they are localized. Next, the function is evaluated symbolically, then processed into an efficient form for numerical evaluation. Finally, during the execution of the command, the function is repeatedly evaluated with different numerical values. Here is a list showing these steps with additional description.

Determine variables	process the second argument; if the second argument is not of the correct form (a list of variables and starting values), it will be evaluated to get the correct form
Localize variables	in a manner similar to Block and Table, add rules to the variables so that any assignments given to them will not affect your Wolfram System session beyond the scope of the "Find" command and so that previous assignments do not affect the value (the variable will evaluate to itself at this stage)
Evaluate the function	with the locally undefined (symbolic) values of the variables, evaluate the first argument (function or equations). Note: this is a change which was instituted in version 5, so some adjustments may be necessary for code that ran in previous versions. If your function is such that symbolic evaluation will not keep the function as intended or will be prohibitively slow, you should define your function so that it only evaluates for numerical values of the variables. The simplest way to do this is by defining your function using PatternTest (?), as in f[x_?NumberQ]:=definition.
Preprocess the function	analyze the function to help determine the algorithm to use (e.g., sum of squares -> Levenberg–Marquardt); optimize and compile the function for faster numerical evaluation if possible: for FindRoot this first involves going from equations to a function
Compute derivatives	compute any needed symbolic derivatives if possible; otherwise, do preprocessing needed to compute derivatives using finite differences
Evaluate numerically	repeatedly evaluate the function (and derivatives when required) with different numerical values

Steps in processing the function for the "Find" commands.

FindFit does not have the HoldAll attribute, so its arguments are all evaluated before the commands begin. However, it uses all of the stages described above, except instead of evaluating the function, it constructs a function to minimize from the model function, variables, and provided data.

You will sometimes want to prevent symbolic evaluation, most often when your function is not an explicit formula, but a value derived through running through a program. An example of what happens and how to prevent the symbolic evaluation is shown.

This attempts to solve a simple boundary value problem numerically using shooting:

The command fails because of the symbolic evaluation of the function. You can see what happens when you evaluate it inside of Block.

This evaluates the function given to FindRoot with a local (undefined) value of xp:

Of course, this is not at all what was intended for the function; it does not even depend on xp. What happened is that without a numerical value for xp, NDSolve fails, so ReplaceAll (/.) fails because there are no rules. First just returns its first argument, which is x[1]. Since the function is meaningless unless xp has numerical values, it should be properly defined.

This defines a function that returns the value x[1] as a function of a numerical value for x'[t] at t=-1:

An advantage of having a simple function definition outside of FindRoot is that it can independently be tested to make sure that it is what you really intended.

This makes a plot of fx1:

From the plot, you can deduce two bracketing values for the root, so it is possible to take advantage of Brent's method to quickly and accurately solve the problem.

This solves the shooting problem:

It may seem that symbolic evaluation just creates a bother since you have to define the function specifically to prevent it. However, without symbolic evaluation, it is hard for the Wolfram Language to take advantage of its unique combination of numerical and symbolic power. Symbolic evaluation means that the commands can consistently take advantage of benefits that come from symbolic analysis, such as algorithm determination, automatic computation of derivatives, automatic optimization and compilation, and structural analysis.