Minimization of function

Minimization of a function.

In many cases we are interested in finding the minimum of a function where is a vector of coordinates. We shall motivate the study of this problem by considering the linear problem:

where is a matrix and is a vector. This is widely known and used task. A formal solution to this problem can be written as . The inversion of the matrix is required. Inversion of a matrix can be an expensive operation if the size of the matrix is large and is (in general) proportional to where is the dimensionality of the matrix.

Clearly, for large matrices the computations become formidable.

It is possible to rewrite the above linear problem as a minimization problem in which we seek the minimum of the function

a minimum of is when , which is what we are looking for. The gradient of - is the vector of derivatives of with respect to all the components of , , it defines the direction of maximum change for . In the simplest approach we are performing a search for a minimum along one dimension defined by the direction of .

If we come back to the specific example above we have

The new, partially optimized coordinate is searched along the direction

We need to determine the single unknown such that the function will be at a minimum along the line defined by and .

At the minimum along that line the scalar product of the gradient of the function and is zero. We therefore have

Exactly the same procedure can be repeated at the new point . We will have and can search for such that is a minimum along the line defined by and . Such a search defined by the local gradient is referred to steepest descent search. This search is not efficient since the minimization along the direction may bring us into a new point with a component of the gradient along . This means that at some point we will have to go back and minimize along . It would be nice if we could set the search direction in such a way that if we minimize along the direction, we would never required to minimize along that direction again. If we have such a wonderful algorithm, a system of N dimensions will minimize to the global minimum after N linear optimizations. The computational complexity of the above problem is to operate with a matrix on a coordinate vector N times. If the matrix is sparse, this can be more efficient than matrix inversion.

Conjugate Gradient (CG) algorithm is doing exactly that for a quadratic system. The system we have above is indeed quadratic. There is no such guarantee for systems that are not quadratic in the variables. However close to minima any system is indeed quadratic and the CG seems to work quite well in general.

How is CG doing it?

Rather than minimizing along , we minimize along a new direction . The direction is set in such a way that at the minimum along the line the function is minimized with respect to the direction as well. is determined as a mix of the previous direction and the current gradient. . The unknown parameter is determined from the requirement that the gradient of the function at will be orthogonal to . Hence we have two unknowns - and two conditions to satisfy . Solving for the conditions we have

and

The next low point is

and so on…