On iterative Krylov-dogleg trust-region 
steps for solving neural networks 
nonlinear least squares problems 
Eiji Mizutani 
Department of Computer Science 
National Tsing Hun University 
Hsinchu, 30043 TAIWAN R.O.C. 
eiji @wayne. cs. nthu. edu. tw 
James W. Demmel 
Mathematics and Computer Science 
University of California at Berkeley, 
Berkeley, CA 94720 USA 
demmel @cs. berkeley. edu 
Abstract 
This paper describes a method of dogleg trust-region steps, or re- 
stricted Levenberg-Marquardt steps, based on a projection pro- 
cess onto the Krylov subspaces for neural networks nonlinear least 
squares problems. In particular, the linear conjugate gradient (CG) 
method works as the inner iterative algorithm for solving the lin- 
earized Gauss-Newton normal equation, whereas the outer nonlin- 
ear algorithm repeatedly takes so-called "Krylov-dogleg" steps, re- 
lying only on matrix-vector multiplication without explicitly form- 
ing the Jacobinn matrix or the Gauss-Newton model Hessian. That 
is, our iterative dogleg algorithm can reduce both operational 
counts and memory space by a factor of O(n) (the number of pa- 
rameters) in comparison with a direct linear-equation solver. This 
memory-less property is useful for large-scale problems. 
I Introduction 
We consider the so-called neural networks nonlinear least squares prob- 
lem  wherein the objective is to optimize the n weight parameters of neural 
networks (NN)[e.g., multilayer percepttons (MLP)], denoted by an n-dimensional 
vector O, by minimizing the following: 
 " = r r (0)r(O), 
S(S) =  E: (.p(s) - tp) 2 
() 
where %(0) is the MLP output for the pth training data pattern and tp is the 
desired output. (Of course, these become vectors for a multiple-output MLP.) Here 
r(O) denotes the m-dimensional residual vector composed of ri(O), i = 1, ..., m, 
for all m training data. 
Thc posed problem can be viewed as an implicitly constrained optimization problem as 
long as hidden-node outputs are produced by sigmoidal "squashing" functions [1]. Our al- 
gorithm exploits the special structure of the suni of squared error measure in Equation (1); 
hence, the other objective functions are outside the scope of this paper. 
The gradient vector and Hessian matrix are given by g = g(O) = Jrr and 
H = H(O) -- JTJ+S, where J is the mxn Jacobian matrix off, and S denotes the 
matrix of second-derivative terms. If S is simply omitted based on the "small resid- 
ual" assumption, then the Hessian matrix reduces to the Gauss-Newton model 
Hessian: i.e., jrj. Furthermore, a family of quasi-Newton methods can be ap- 
plied to approximate term S alone, leading to the augmented Gauss-Newton model 
Hessian (see, for example, Mizutani [2] and references therein). 
With any form of the aforementioned Hessian matrices, we can collectively write 
the following Newton formula to determine the next step 5 in the course of the 
Newton iteration for 0next: 0now + 5: 
Ha: -g. (2) 
This linear system can be solved by a direct solver in conjunction with a suitable 
matrisc factorization. However, typical criticisms towards the direct algorithm are: 
 It is expensive to form and solve the linear equation (2), 
O(mn 2) operations when m > n; 
 It is expensive to store the (symmetric) Hessian matrix H, 
'('+) memory storage. 
2 
which requires 
which requires 
These issues may become much more serious for a large-scale problem. 
In light of the vast literature on the nonlinear optimization, this paper describes how 
to alleviate these concerns, attempting to solve the Newton formula (2) approxi- 
mately by iterative methods, which form a family of inexact (or truncated) 
Newton methods (see Dembo 8e Steihaug [3], for instance). An important sub- 
class of the inecact Newton methods are Newton-Krylov methods. In particular, this 
paper focuses on a Newton-CG-type algorithm, wherein the linear Gauss-Newton 
normal equation, 
(jTj) _ _jTr, (3) 
is solved iteratively by the linear conjugate gradient method (known as CGNR) 
for a dogleg trust-region implementation of the well-known Levenberg-Marquardt 
algorithm; hence, the name "dogleg trust-region Gauss-Newton-CGNR" algorithm, 
or "iterative Krylov-dogleg" method (similar to Steihaug [4]; Toint [5]). 
2 Direct Dogleg Trust-Region Algorithms 
In the NN literature, several variants of the Levenberg-Marquardt algorithm 
equipped with a direct linear-equation solver, particularly Marquardt's original 
method, have been recognized as instrumental and promising techniques; see, for 
example, Demuth  Beale [6]; Masters [7]; Shepherd [8]. They are based on a simple 
direct control of the Levenberg-Marquardt parameter/u in (H+/uI) 5 = -g, although 
such a simple/-control can cause a number of problems, because of a complicated 
relation between parameter/u and its associated step length (see Mizutani [9]). 
Alternatively, a more efficient dogleg algorithm [10] can be employed that takes, 
depending on the size of trust region/, the Newton step 5Newton [i.e., the solution 
of Eq. (2)], the (restricted) Cauchy step 5Cauchy , or an intermediate dogleg step: 
def 
adogleg = aCauchy + h(aNewton - aCauchy), (4) 
which achieves a piecewise linear approximation to a trust-region step, or a restricted 
Levenberg-Marquardt step. Note that 5Cauchy is the step that minimizes the local 
quadratic model in the steepest descent direction (i.e., Eq. (81 with k - 11. For 
details on Equation (4), refer to Powell [10]; Mizutani [9, 2]. 
When we consider the Gauss-Newton step for 5Newton in Equation (4), we must 
solve the overdetermined linear least squares problem: minimize 5 [[r + J5[[2, for 
which three principal direct linear-equation solvers are: 
(1) Normal equation approach (typically with Cholesky decomposition); 
(2) QR decomposition approach to J5 --r; 
(3) Singular value decomposition (SVD) approach to J5 - -r (only recom- 
mended when J is nearly rank-deficient). 
details, 
(with a 
operations to form the Gauss-Newton model Hessian by: 
those three direct solvers, approach (1) to Equation (3)is fastest. (For more 
refer to Demmel [11], Chapters 2 and 3.) In a highly overdetermined case 
large data set; i.e., m >> n), the dominant cost in approach (1) is the rnn  
jTj= -] uiui T, 
(5) 
i----1 
where u/T is the ith row vector of J. This cost might be prohibitive even with 
enough storage for jrj. Therefore, to overcome this limitation of direct solvers for 
Equation (3), we consider an iterative scheme in the next section. 
3 Iterative Krylov-Dogleg Algorithm 
The iterative Krylov-dogleg step approximates a trust-region step by iteratively 
approximating the Levenberg-Marquardt trajectory in the Krylov subspace via lin- 
ear conjugate gradient iterates until the approximate trajectory hits the trust- 
region boundary; i.e., a CG iterate falls outside the trust-region boundary. In this 
context, the linear CGNR method is not intended to approximate the full Gauss- 
Newton step [i.e., the solution of Eq. (3)]. Therefore, the required number of CGNR- 
iterations might be kept small [see Section 4]. 
The iterative process for the linear-equation solution sequence {5} is called the 
inner 2 iteration, whereas the solution sequence {0h} from the Krylov-dogleg algo- 
rithm is generated by the outer iteration (or epoch), as shown in Figure 1. We now 
describe the inner iteration algorithm, which is identical to the standard linear CG 
algorithm (see Demmel [11], pages 311-312) except steps 2, 4, and 5: 
Algorithm 3.1: The inner iteration of the Krylov-dogleg algorithm (see Figure 1). 
l. Initialization: 
50 = 0; do = r0 = -gnow, and k = 1. (6) 
2. Matrix-vector product (compare Eq. (5) and see Algorithm 3.2): 
T 
z = Hnowd = Jnow(Jnowd) = -](ud)ui. (7) 
i=1 
aNonlinear conjugate gradient methods, such as Polak-Ribiere's CG (see Mizutani 
and Jang [13]) and Moller's scaled CG [14], are also widely-employed for training MLPs, 
but those nonlinear versions attempt to approximate the entire Hessian matrix by gen- 
erating the solution sequence {0k} directly as the outer nonlinear algorithm. Thus, they 
ignore the special structure of the nonlinear least squares problem; so does Pearlmutter's 
method [15] to the Newton formula, although its modification may be possible. 
YES 
Inner Iteration 
Algorithm 3.1 
- Evaluate ) 
(D E( 8nex t 
O 
" 
Evaluate Vnow 
Update Rno I IF 
NO . I Decrease -- 
Algorithm for local-model check 'L Rnow J 
Figure 1' The algorithmic flow of an iterative Krylov-dogleg algorithm. For detailed 
procedures in the three dotted rectangular bosces, refer to Mizutani and Demmel [12] 
and Algorithm 3.1 in tesct. 
3. Analytical step size: 
4. Approximate solution: 
If Ila, II < f/now, 
and terminate. 
rk_rk_ 
5k = 5k_  + /k dk. 
then go onto the next step 5; otherwise compute 
8 =/now'-- 
(8) 
(9) 
5. Linear-system residual: rh = rh- -- /kz. 
If IIr112 is small enough, then set f/now - 
Otherwise, continue with step 6. 
6. Improvement: fik+ = rJrk 
rJ_rk_  
7. Search direction: dk+ = rk + fik+dk. Then, 
and terminate. 
set k = k + 1 and back to step 2. 
The first step given by Equation (8) is always the Cauchy step 5Cauchy , moving 
0now to the Cauchy point 0Cauchy when f/now > ]]SCauchy]]. Then, departing 
from Cauchy, the linear CG constructs a Krylov-dogleg trajectory (by adding a CG 
point one by one) towards the Gauss-Newton point Newton until the constructed 
trajectory hits the trust-region boundary (i.e., II&ll > now is satisfied in step 4), 
or till the linear-system residual becomes small in step 5 (unlikely to occur for 
small forcing terms; e.g., 0.01). In this way, the algorithm computes a vector 
between the steepest descent direction and the Gauss-Newton direction, resulting 
in an approximate Levenberg-Marquardt step in the Krylov subspace. 
In step 2, the matrix-vector multiplication of Hd in Equation (7) can be performed 
with neither the Jacobian nor Hessian matrices explicitly required, keeping only 
several n-dimensional vectors in memory at the same time, as shown next: 
Algorithm 3.2: Matrix-vector multiplication step. 
for i = 1 to m; i.e., one sweep of all training data: 
(a) 
(b) 
(c) 
end for. 
do forward propagation to compute the MLP output a (0) for datum i; 
do backpropagation 3 to obtain the ith row vector u of matrix J; 
compute (ud)u and add it to z; 
For one sweep of all m data, each of steps (a) and (b) costs at least 2ran (plus 
additional costs that depend on the MLP architectures) and step (c) [i.e., Eq. (7)] 
costs 4ran. Hence, the overall cost of the inner iteration (Algorithm 3.1) can be 
kept as O(mn), especially when the number of inner iterations is small owing to 
our strategy of upper-bounded trust-region radii (e.g., luppe r = 1 for the parity 
problem). Note for "Algorithm for local-model check" in Figure 1 that evaluating 
Pnow (a ratio between the actual error reduction and the reduction predicted by 
the current local quadratic model) needs a procedure similar to Algorithm 3.2. For 
more details on the algorithm in Figure 1, refer to Mizutani and Demmel [12]. 
4 Experiments and Discussions 
In the NN literature, there are numerous algorithmic comparisons available (see, for 
example, Moller [14]; Demuth 8e Beale [6]; Shepherd [8]; Mizutani [2, 9, 16]). Due to 
the space limitation, this section compares typical behaviors of our Krylov-dogleg 
Gauss-Newton CGNR (or iterative dogleg) algorithm and Powell's dogleg-based 
algorithm with a direct linear-equation solver (or direct dogleg) for solving highly 
overdetermined parity problems. In our numerical tests, we used a criterion, in 
which the MLP output for the pth pattern, ap, can be regarded as either "on" 
(1.0) if ap _> 0.8, or "off" (-1.0) if ap _< -0.8; otherwise, it is "undecided." The 
initial parameter set was randomly generated in the range [-0.3, 0.3], and the two 
algorithms started exactly at the same point in the parameter space. 
Figure 2 presents MLP-learning curves in RMSE (root mean squared error) for the 
20-bit and 14-bit parity problems. In (b) and (c), the total execution time [roughly 
(b) 32 days (500 epochs); (c) two hours (450 epochs), both on 299-MUz UltraSparc] 
of the direct dogleg algorithm was normalized for comparison purpose. Notably, the 
3The batch-mode MLP backpropagation can be viewed as an efficient matrix-vector 
multiplication (2mr, operations) for computing the gradient J'r without forming explicitly 
the m x r, Jacobian matrix or the m-dimensional residual vector (with some extra costs). 
I1:-.1"" directdogleg J/ II t'-' directdogleg I 
0.8 t/:' t 0.8 --: 
0.6 '-.. 0.6 "'. .... 
0'4I.1 t'.. 1 =0'4 ". 
0'2! '' 5 !0 0.2 
0 0 
O0 1 0 0.5 
(a) Epoch (b) Normalized exec. time 
1 -I iterative dogleg It 
II '"["'--' direct dogleg J/ 
0.8 
0.6 "- 
0.4 
0.2 "'. 
0 
0.5 
(c) Normalized exec. time 
Figure 2: ML?-lcarning curvcs of RMSE (root mcan squarcd crror) obtaincd by 
thc "itcrativc doglcg" (solid linc) and thc "dircct doglcg" (brokcn linc): (a) "epoch" 
and (b) "normalized escecution time" for the 20-bit parity problem with a standard 
20 x 19 x 1 MLP with hyperbolic tangent node functions (rn = 22, n = 419), and 
(c) ''normalized escecution time" for the ld-bit parity problem with a 14 x 13 x 
MLP (rn = 2 TM, n = 209). In (a),(b), thc iterative doglcg reduced thc number of 
incorrect patterns down to 21 (nearly RMSE -- 0.009) at epoch $35, whereas the 
direct dogleg reached the same error level at epoch 355. In (c), the iterative dogleg 
solved it perfectly at epoch 1, 03d and the direct dogleg did so at epoch 
iterative dogleg converged faster to a small RMSE 4 than the direct dogleg at an 
early stage of learning even with respect to epoch. Moreover, the average number 
of inner CG iterations per epoch in the iterative dogleg algorithm was quite small, 
5.53 for (b) and 4.61 for (c). Thus, the iterative dogleg worked nearly (b) nine times 
and (c) four times faster than the direct dogleg in terms of the average execution 
time per epoch. Those speed-up ratios became smaller than n mainly due to the 
aforementioned cost of Algorithm 3.2. Yet, as n increases, the speed-up ratio can 
be larger especially when the number of inner iterations is reasonably small. 
5 Conclusion and Future Directions 
We have compared two batch-mode MLP-learning algorithms: iterative and direct 
dogleg trust-region algorithms. Although such a high-dimensional parity problem is 
very special in the sense that it involves a large data set but the size of MLP can be 
kept relatively small, the algorithmic features of the two dogleg methods can be well 
understood from the obtained experimental results. That is, the iterative dogleg 
has the great advantage of reducing the cost of an epoch from O(mn 2) to O(mn), 
and the memory requirements from O(n 2) to O(n), a factor of O(n) in both cases. 
When n is large, this is a very large improvement. It also has the advantage of faster 
convergence in the early epochs, achieving a lower RMSE after fewer epochs than 
the direct dogleg. Its disadvantage is that it may need more epochs to converge to a 
very small RMSE than the direct dogleg (although it might work faster in execution 
time). Thus, the iterative dogleg is most attractive when attempting to achieve a 
reasonably small RMSE on very large problems in a short period of time. 
The iterative dogleg is a matrix-free algorithm that extracts information about the 
Hessian matrix via matrix-vector multiplication; this algorithm might be character- 
ized as ieraive batch-mode learning, an intermediate between direct batch- 
4A standard steepest descent-type online pattern-by-pattern learning (or incremental 
gradient) algorithm (with or without a momentum term) failed to converge to a small 
RMSE in those parity problems due to hidden-node saturation [1]. 
mode learning and online pattern-by-pattern learning. Fnrthermore, the algorithm 
might be implemented in a block-by-block npdating mode if a large data set can 
be split into mnkiple proper-size data blocks; so, it wonld be of onr great inter- 
est to compare the performance with online-mode learning algorithms for solving 
large-scale reaLworld problems with a large-scale NN model. 
Acknowledgment s 
We would like to thank Stuart Dreyfus (IEOR, UC Berkeley) and Rich Vuduc (CS, 
UC Berkeley) for their valuable advice. The work was supported in part by SONY 
US Research Labs., and in part by "Program for Promoting Academic Excellence 
of Universities," grant 89-E-FA04-1-4, Ministry of Education, Taiwan. 
References 
[1] E. Mizutani, S. E. Dreyfus, and J.-S. R. Jang. On dynamic programming-like recursive 
gradient formula for alleviating hidden-node satuaration in the parity problem. In 
Proceedinss of the International Workshop on [ntellisent Systems Resolutions the 
8th Bellman Continuum, pages 100 104, Hsinchu, TAIWAN, 2000. 
[2] Eiji Mizutani. Powell's dogleg trust-region steps with the quasi-Newton augmented 
Hessian for neural nonlinear least-squares learning. In Proceedinss of the IEEE Int'l 
Conf. on Neural Networks (vol. 2), pages 1239 1244, Washington, D.C., July 1999. 
[3] R. S. Dembo and T. Steihaug. Truncated-Newton algorithms for large-scale uncon- 
strained optimization. Math. ProS., 26:190 212, 1983. 
[4] Trond Steihaug. The conjugate gradient method and trust regions in large scale 
optimization. SIAM J. Numer. Anal., 20(3):626 637, 1983. 
[5] P. L. Toint. On large scale nonlinear least squares calculations. SIAM J. Sci. Statist. 
Cornput., 8(3):416 435, 1987. 
[6] H. Demuth and M. Beale. Neural Network Toolbox for Use with MATLAB. The 
MathWorks, Inc., Natick, Massachusetts, 1998. User's Guide (version 3.0). 
[7] Timothy Masters. Advanced algorithms for neural networks: a C+ + sourcebook. John 
Wiley g; Sons, New York, 1995. 
[8] Adrian J. Shepherd. Second-Order Methods for Neural Networks: Fast and Reliable 
Training Methods for Multi-Layer Percepttons. Springer-Verlag, 1997. 
[9] Eiji Mizutani. Computing Powell's dogleg steps for solving adaptive networks nonlin- 
ear least-squares problems. In Proc. of the 8th Int'l Fuzzy Systems Association World 
Congress (IFSA '99), vol. 2, pages 959 963, Hsinchu, Taiwan, August 1999. 
[10] M. J. D. Powell. A new algorithm for unconstrained optimization. In Nonlinear 
Programming, pages 31 65. Edited by J.B. Rosen et al., Academic Press, 1970. 
[11] James W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997. 
[12] Eiji Mizutani and James W. Demmel. On generalized dogleg trust-region steps using 
the Krylov subspace for solving neural networks nonlinear least squares problems. 
Technical report, Computer Science Dept., UC Berkeley, 2001. (In preparation). 
[13] E. Mizutani and J.-S. R. Jang. Chapter 6: Derivative-based Optimization. In Neuro- 
Fuzzy and Soft Computing, pages 129 172. J.-S. R. Jang, C.-T. Sun and E. Mizutani. 
Prentice Hall, 1997. 
[14] Martin Fodslette Moller. A scaled conjugate gradient algorithm for fast supervised 
learning. Neural Networks, 6:525 533, 1993. 
[15] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 
6(1):147 160, 1994. 
[16] E. Mizutani, K. Nishio, N. Katoh, and M. Biasgem Color device characterization of 
electronic cameras by solving adaptive networks nonlinear least squares problems. In 
Proc. of the 8th IEEE Int'l Conf. on Fuzzy Systems, vol. 2, pages 858 862, 1999. 
