BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR94-1414
ENTRY:: 1994-04-19
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Faster SVD for Matrices with Small $m/n$
AUTHOR:: Bau, David
DATE:: March 1994
PAGES:: 10
ABSTRACT::
The singular values of a matrix are conventionally computed using either the 
bidiagonalization algorithm by Golub and Reinsch (1970) when $m/n < 5/3$, or 
the algorithm by Lawson and Hanson (1974) and Chan (1982) when $m/n > 5/3.$ 
However, there is an algorithm that is faster and that does not involve a 
discontinuous choice, as follows: in all cases, perform a QR factorization as 
in Lawson-Hanson-Chan, but rather than do this right at the beginning, do it 
after zeros have already been introduced in the first $j = 2n - m$ rows and 
columns.

The same technique applies when computing singular vectors, with one small 
modification. If left singular vectors are needed, the new algorithm becomes 
advantageous only when $m > 1.2661n$, and the best $j$ in this case is 
$3n - m$.

The benefits of the new algorithm appear in terms of classical scalar 
floating-point operation counts; the effects of locality and parallelization 
are not considered in the analysis.
END:: CORNELLCS//TR94-1414
BODY::
Faster SVD for Matrices
With Small min
David Bau*
TR 94-1414
March 1994
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
* Thanks to Nick Trefethen for first suggesting to me that the flop count for
bidiagonalization shouldn't have a corner at m/n = 5/3. This work was supported by
an NSF Graduate Fellowship and by NSF Grant DMS-91 16110 and DOE Grant DE-
FG02-94ER251 99.
Faster SVD for Matrices with Small m/n
David Bau, Cornell University*
The singular values of a matrix are conventionally computed using either the bidiagonalization
algorithm by Golub and Reinsch (1970) when min < 5/3, or the algorithm by Lawson and
Hanson (1974) and alan (1982) when mln > 5/3. However, there is an algorithm that is
faster and that does not involve a disoontinuous choioe, as follows: in all cases, perform a QR
factoriltation as in Lawson-Hanson-alan, but rather than do this right at the beginning, do it after
zeros have already been introduced in the first j = 2n --H m rows and columns.
The same technique applies when computing singular vectors, with one small modification.
If left singular vectors are needed, the new algorithm becomes advantageous only when m>
l.266ln, and the best j in this case is 3n --H
The benefits of the new algorithm appear in terms of classical scalar floating-point operation
counts; the effects of locality and parallelization are not considered in the analysis.
1 Bidiagonalization of Skinny Matrices
We seeka bidiagonalization ofA, that is, a matrix
B = UTAv
(1)
where U and V &eorthogonalandbq = Oforj # i,i + 1.
Golub and Reinsch [CiR 70] suggest that to bidiagonalize an m x n rnatrix, House-
holder reflections be applied alternately on the left and right so that, at the ith step, zeros
areintroducedintothei+l,...,rnentriesintheithcolumnandinthei+2,...,fl
entries in the ith row. The number offlops required (including additions) is 4rnn2 --H 43m3.
?Ph?u C?I?ih?Rpinc??h Ri?i			aIiw?h?
unt .? --??---H?o+- --H o+?.?goll?????3ll
A			UTAv
Thanks to Nick Trefethen for first suggesting to me that the flop count for bidiagonalization shouidn't
have a comer at in/n = 5/3. Ths work was supported by an NSF Graduate Fellowship and by NSF Grant
DMS-91 16110 and DOE Grant DE-FGl-94ER25199.
Lawson and Hanson [LH 74, p.1l9] and also Chan [Chan 82] notice that if the ratio
mln is sufficiently large, itis cheaper to bidiagonalize in twophase& They firstcompute
the QR factorization ofthe matrix, and then they apply the GolubReinsch procedure to
bidiagonalize the resulting square Rmatrix. The QR factorization requires an additional
i2 --H 2n3 fin			bidi'			`l17siinn ofth?			matrix R uses onlv
2m1			.			?p5, out tnL .?Jgor?-? rciiiaiiiiiig Th ?
?n3 flops, for a total of 2rnn2 + 2n3, cheaper than GolubReinsch if m/n> ?.
`I..
Two-Phase Lawson-Hanson-Chan lii(Iia?ona..zaUoll
A			UTQTA?
2 Bidiagonalization of Medium-Shaped Matrices
???1
If
QTA
Extending the idea of Lawson and Hanson and Chan, we make the observation that,
in the process of applying Golub-Reinsch bidiagonalization to any m )c n matrix with
m > n, the problem of bidiagonalizing skinny submatrices resurfaces. After the ith
step ofGolubReinsch, the remaining problem is to bidiagonalize the (m --H x (n --H i)
lower-right submatrix, and this submatrix can become arbitrarily skinny. ff the aspect
ratio (m --H --H i) is sufficiently large, a QR factorization ofthe remaining submatrix
is warranted to reduce it to a square matnx.
A
Three-Phase Bidiagoinalization
Th.
Ul2?-
?TU1WAvj QTQTUrAvivi
The only remaining problem is to determine the best point at which the QR factor-
ization should be applied. A brief calculation follows.
If the GolubReinsch algorithm is interrupted after j steps in order to apply a QR
factorization before proceeding with the rest of the bicliagonalizaflon, the total numoer
of flops used is asymptotically
4mn2 --H--H43n3--H4m--Hj)(n--Hj)2			10
2
(2)
We wish to find 0 < j < n that minimizes this formula. Local minimization yields the
solution when n < m < 2n; for these matrices, the j we seek is
j=2n--Hm.			(3)
In other words, it is hest to apply QR factorization at the point at which the remaining
--H			x			--H j) submaLrix has aspect ratio
m--Hj --H 2m--H2n =2.			(4)
n --H j			m --H n
Using this choice of j, we find that, for n < m < 2n, the numher of flops required to
- - -?			II,
compute a bidiagonalizaflon is asymptoflca?q
4mn2 --H			?4?3			--H 2
?			--H(m --H ?)3			(5)
3
For m = n our algorithm reduces to GolubReinsch bidiagonalization. For m > 2n,
we get j = 0, and it reduces to Lawson-Hanson-Chan bidiagonalization. In betw?ii,
new algorithin is slightly cheaper than either of the other algorithms for matrices with
n < m < 2n, at least in terms of classical, scalar floatin&point operation counts. The
new algorithin uses 3.7% fewer flops when min --H 3.
flops
n3
7
6
5
4
3
Bidiagonalization Flops for m x n Matrices
-- I
Thr?e-pna5e bidsgaonalizaflon
2
573			2
aspect ratio mln
3
3 Computing Singular Values
4
Once a bidiagonalization is obtained, singularvalues are obtained by an iterative process
related to the QR algorithin that applies Givens rotations to both sides of the matrix.
The algoritlirn is described in [GVL 8.3.2]. After a 0(n2) Givens rotations, the off-
diagonal elements converge sufficiently to zero, and the singular values are left on the
diagonal. For filture reference, let C be the constant that gives sufficient convergence
when Cn2 rotations are applied on each side (typically C 2).
The flop requirerren, t42thi?s Process is only 0(n) so the work ofthe SVD algorithin
as a natural extension ofthe sequence of U?; each reflection still requires 4n(m --H i + 1)
flops to apply.
(8)
11 Uj
i--H--Hj+l
is dominated bv the U(?7? flops of the bidiagonatzatlon.
4 Computing Singular Vectors
If the singular vectors as well as the singular values need to be computed, the three-
phase algorithmcontinues to be a cheaperaltemative to the GolubReinschand Lawson-
Hanson-Chan algorithms in certain cases. The analysis is presented here.
4.1 Roster of Orthogonal Operations
The left and right singularvectors are computed by accumulating many small orthogonal
operations. To analyze the cost of the accumulation, we begin by giving names to each
elementary operation.
The first bidiagonalization phase operates by applying Householder reflections on
the left and the right of A to obtain U1TAVi. Explicitly,
Ul =			(6)
jj?u?
i=I
jiVi
V1			(7)
where the applicaion of Uj = to the left of a dense m x n matrix requires 4n(m --H
i + 1) flops, and the applicalon of Vi --H v%T to the right requires 4m(n --H i) flops.
The QR phase applies Householder reflections only on the left of the resulting
matrix; let us write the reflections
The secondary bidiagonalization phase again applies Householder reflections on
the left and the right of the matrix; we write out the reflections as follows.
U2 =			(9)
H?Wi
i=j+1
H?Vi
i=j+1
v2=			(10)
In comparison to the previous left reflections Uj, the reflections applied on the left now
are relatively cheaper because they operate on fewer rows. Each reflection Wi =
applied to the left of a dense m x n matrix requires only 4n(n --H i + 1) flops. The
reflections Vi applied on the right cost as much as before: 4n(m --H i) flops.
Fmally, reduction ofthe bidiagonal matrix B to a diagonal matrix ? is accomplished
by a sequence of Givens rotations on altemating sides of B. (Notice that, although the
Givens operations must be realized as true Givens rotations when reducing B to ?, they
can be accumulated using fast Givens when constructing singular vectors, as long as
attention is given to controlling the scaling of the fast Givens multipliers.)
If Gt is the accumulation of rotations on the left and GR is the accumulation of
rotations on the right, we have
GL =
&LBGR
fI???i
DR.
(11)
(12)
GR =			(13)
The cost of applying a single fast Givens transformation on the left of a dense m x n
matrix is 4Th flops; the cost of applying a fast Givens transformation on the right is 4m
flops. Observe that for m > n, applying Givens on the left can be significantly cheaper
than applying Givens on the right. This fact will impact the cost of accumulating left
singular vectors.
Expanding B in full we have
= UTAV
- GtU?TQTUlTAviV2GR
U = llUi llUi __
i=1 i=j+1
n 3
V = fiVi llVi
i=j+1 i--Hi
5
f[Wi
DR.
Cn2
fi Mi DL
i=1
(14)
(15)
(16)
(17)
4.2 First Method for Computing Left Singular Vectors
3. The Householder reflections of of fl?i=i Uj are accumulated on to the left side
of the result from the second step by multiplying the rightmost terms first. This
step requires 2mn2 --H 2(m --H j)(n --H j)2 --H (2/3)n3 + (2/3)(n --H j)3 flops.
The left singular vectors are the first n columns of U, in other words, U = UImfl,
where 1mn is the m x n principal submatrix of the identity. Perhaps the most natural
way to compute this matrix is to multiply terms into Imn starting from right to lefL
However, there are two other ways to compute this matrix that do better at exploting
sparsity. The first method can be written by inserting parentheses into (16) as follows.
ll?w			Cn2
Mj
i=j+1			i--H--Hi
tHliUi			jf[?jU?
I?n.			(18)
Let's begin with a breakdown of this approach. It consists of several steps.
1. The first n columns of llt?--H--Hj+i Uj are accumulated by multiplying them into
Imn, beginning with the riglitmost terms. This step requires 2(m --H j)(n --H j)2 --H
(2/3)(n --H j)3 flops.
2. The Householder reflections from the QR factorization Ht?=j+i W?? are accumu-
lated on to the right side of result from the first step by multiplying the leftmost
terms first. This step requires 2(m --H j)(n --H j)2 flops.
4. The fast Givens operations are accumulated on to the result from the last step
beginning with the leftmost terms. At the end, the diagonal scaling factor is
multiplied in. The accumulation requires 4Cmn2 flops.
(4C + 6)mn2 --H 2n3 + (lO/3)(n --H j)3.
The total number of flops is
(4C + 2)mn2 + 2(m --H j)(n --H j)2 --H (2/3)n3.
(19)
When j = n, this reduces to the (4C + 2)mn2 --H (2/3)n3, which is the cost of
accumulating left singular vectors using the algorithm of Golub and Reinsch. In fact,
this is the right way to use this method, because, as will be shown now, j = n is the
optimal choice ofj.
The total cost of computing left singular vectors, including the bidiagonalization
cost discussed earlier, is
(20)
Differentiating and optirnizing (the derivative is --H lO(n --H j)2), we find that j = n is
the optimal choice of j. In other words, the best strategy when using this method for
accumulating left singular vectors is to stick with the GolubReinsch algorithm and
never execute a QR factorization.
6
The best cost in flops with this method is therefore, as mentioned before,
(4C + 6)mn2 --H 2n3
(21)
This is the performance of the GolubReinsch algorithm. In a moment, however, we
shall see that for matrices with min slightly larger than 1, there is a more efficient
algorithin for computing left singular vectors, based on a different multiplication order.
4.3 Second Method for Computing Left Singular Vectors
The second method for computing the left singular vectors is merely based on accumu-
lating the product (16) in another order. The method can be written as follows:
?= ll?'?u?. j=fij+jU?
j=f?j+?W? iij??M?
Following is an algorithmic breakdown of the method.
1.
(22)
The first n columns of fl?1 Uj are accumulated by multiplying the Householder
reflections into 1mn, begining with the rightmost terms. This requires a total of
2mn2 --H (2/3)n3 flops.
2. The (n --H j) nontrivial columns and rows of flflj??j+j Wt are accumulated by
multiplying the rigutmost terms first into Inn. This requires (4/3)(n --H j)3 flops.
3. The fast Givens operations are accumulated on to the result from step two,
begnning with the leftmost terms (and, at the end, the diagonal scaling factor
is multiplied in). On a dense starting matrix, the accumulation would require
4Cn3 flops. However, the matrix from step two is not dense; the first j rows
have many zeros that are largely preserved by the particular Givens operations
we are multiplying. Each sequence of Givens operations combines a column i
with i + 1, and they are applied a sequences of increasing i. After n Givens
operations, at most one column and one subdiagonal element of the other sparse
columns are filled in. The upshot is that the flop count is slightly lower than the
dense case: the total accumulation requires 4Cn3 --H (2/3)j3 --H 2j2(n --H j) flops.
4. The m x n matrix from the first step is multiplied by the m )< n matrix from
the third step, using an elementwise matrix multiplication. The number of flops
required is 2mn2.
The number offlops totals 4mn2+ (4C --H2/3)n3--H (2/3)j3--H2j2(n--Hj) +(4/3)(n--Hj)3.
When j = 0 this reduces to (4C + 2/3)n3 + 4mn2, which is the cost of accumulating
left singular vectors using the Lawson-Hanson-Chan scheme.
Notice that, when m is large relative to n, the second method promises to be more
efficient than the first because it avoids multiplying the Givens operations by too many
7
rows. On the other hand, when m is very close to n, the first method will he more
efficientbecause it avoids the extra step ofan explicitmetrix multiplication. The correct
crossoverpoint between the two methods will be calculated in a momenL; first we finish
the analysis of the second method.
The number of flops required to compute left singular vectors ifthe second method
is use (includin? the cost of bidiagonalization) is
8mn2+(4C--H2)n3--H2(m--Hj)(n--Hj)2+(l4/3)(n--Hj)3--H2j2(n--Hj)--H(2/3)j3. (23)
Setting the derivative (which is 4(n --H j)(m --H 3n + j)) to zero, we obtain the
following solution to j:
?=?03m--Hm if mm><33nn			(24)
For m > 3n the method reduces to the algorithm of Lawson, Hanson, and Chan.
The cost in flops is 6mn2 + (4C + S/3)n3.
Form < 3n wehavej = 3n --H m, so the cost in flops is
--H-2m3 +4m2n+(4C+-38)n3			(25)
3
The formula for this flop count involves m3 and m2n terms which are not terribly
enlightening; a plot of the flop count for manices of various aspect ratios is more so.
Flops to Conipute the Left Singular Vectors of m x nMatrices
35-
flops
n3
30-
25-
20-
15-
0'
,,,, <&???-
Using three-pbase bidigaonalization
10'
1			1.2661
7
2			3
aspect ratio min
8
The graph is drawn assuniing that C = 2, The original GolubReinsch SVD is
faster than the newly proposed method for m close to n; the crossover point where the
flop counts (25) and (21) are equal is at m/n = 1.2661. At m/n = ?, the new method
requires 10.5% fewer flops than either the Lawson-Hanson-Chan or the GolubReinsch
methods.
4.4 RightSingularVectors
TherightsingularvectorsarethecolumnsofV;wecomputethembyevaluatingthe
product(17)inthefollowingorder:
DR			(26)
i);[D' ll??vi			H?jA?
The multiplication goes in two phases:
1. The Householder reflections llt?i Vi are accumulated onto the left ofI?? begin-
ning with the rightmost terms. `This requires (4/3)n3 flops.
2. The Givens rotations are accumulated onto the right of the resulting matrix,
beginning with the leftinost terms. This requires 4Cn3 flops.
The total number of flops is
(4C+4/3)n3.			(27)
Notice that the acummulation of right singular vectors has no dependence on the
choice ofj. Therefore, itdoesn'tmake any difference whether GolubReinsch,Lawson-
Hanson-Chan, or the three-phase method is used: the cost of computing right singular
vectors is the same in all three cases.
The choice between the various algorithms and choices ofj should be made based
on the cost of computing the bidiagonalization and the left singular vectors if they are
needed. The best algorithm for the case where right singular vectors are not computed
will remain the best algorithm for the case where right singular vectors are computed.
5 Summary of Results
The new algorithin for computing singular values provides a pleasing, unified frame-
work in which both the Golub-Reinsch algorithm and the Lawson-Hanson-Chan algo
rithni can be described as special cases.
The new algorithin does not reduce smoothly to the Golub-Reinsch algorithm when
left singular vectors are needed because Golub-Reinsch uses a method for accumulating
left singular vectors thatdoes not generalize well for different choices ofj. On the other
hand, when using the new algorithm to compute left singular vectors, a 10% reduction
in floating point operations can be achieved for certain mln, making the algorithin
interesting from the point of view of performance.
9
The reduction in floating point operations of the new algorithin is not dramatic;
with modern computer architectures, this degree of savings in reduced floating-point
opereration counts may be dwarfed by the effects of data locality and parallelization.
However, an adjustment of the parameter j that is chosen to minimize clock cycles
rather than floating point operations will likely allow the new algorithin to achieve a
performance advantage in practice.
SVD Flops for m x n Matrices
Result			GiR SVD			3-Phase SVD			LIHIC SVD
4mn2 --H 4-3n3 4mn2 --H 4-3fl3 --H			-23(m--Hn)3			2mn2+2n3
?v			4mn2+8n3			4mn2+8n3--H?(m--Hn)3			2mn2+--H?n3
?U			14mn2--H2n3			--H23m3+4m2n+%2n3			6mn2+--H332n3
?U,V			14mn2+ --H232n3			--H-23m3 +4m2n+20n3			6mn2 +20n?
Aspect Ratios for the Three Algorithms
Result			GIR SVD			3-Phase SVD			L?I?/C SVD
? or			Tn =			n < m < 2n			2n <m
?Uor ??,V m<1.2661n 1.266ln<m<3n			3n<m
References
[Chan 82] Chan, T.F. 1982. "An improved algorithm for computing the singular value
decomposition," ACM Trans. Math, Soft. 8, 72-83.
[GR 70] Golub, G.H. and Reinsch, C. 1970. "Singular value decomposition and least
squares solutions," Numer. Math. 14,403-420.
[GVL 89] Golub, G.H. and Van Loan, C. 1989. Matrix Computations, 2nd ed., Johns
Hopkins University Press, Baltimore, MD. (See pp. 236-239,427-435.)
[LH 74] Lawson, C.L and Hanson, R.J. 1974. Solving Least Squares Problems.
Prentice-Hall, Bnglewood Cliffs, NJ. (See pp. 110-119.)
10
