BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR93-1331
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Solving $L_{p}$-Norm Problems and Applications
AUTHOR:: Li, Yuying 
DATE:: March 1993
PAGES:: 23
ABSTRACT::
The $l_{p}$ norm discrete estimation problem min$_{x\in\Re^{n}} \Vert b-A^{T} 
x\Vert^{p}_{p}$ has been solved in many data analysis applications, e.g. 
geophysical modeling. Recently, a new globally convergent Newton method 
(called GNCS) has been proposed for solving $l{p}$ problems with 1 $\leq p 
\leq$ 2 [5]. This method is much faster than the widely used IRLS method when 
1 $\leq p \leq$ 1.5 and comparable to it when $p >$ 1.5. In this paper, 
modification is made to the line search prodedure so that the GNCS method is 
applicable for $l_{p}$ problems with 1 $\leq p < \infty$. The global 
convergence results for $l_{1}$ problems are obtained under weaker 
assumptions than required in [2]. In addition, the usefulness of 
$l_{p}$ norm solution with 1 $\leq p \leq$ 2 is demonstrated by applying 
the GNCS algorithm to a synthetic geophysical tomographic inversion problem. 
Additional numerical results are included to support the efficiency of GNCS.

Key Words: linear regression, discrete estimation, tomographic inversion, 
IRLS, GNCS, linear programming, Newton method.

Subject Classification: AMS/MOS: 65H10, 65K05, 65K10.
END:: CORNELLCS//TR93-1331
BODY::
Solving Lp?Norm Problems and Applications
Yuying Li*
TR 93-1331
March 1993
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
*Computer Science Department, Cornell University, Ithaca, New York, 14853.
Research partially supported by the Applied Mathematical Sciences Research
Program (KC-04-02) of the Office of Energy Research of the U.S. Department of
Energy under grant DE-FG02-86ER2501 3.AOOO, and in part by NSF, AFOSR and
ONR through grant DMS-8920550, and by the Cornell Theory Center, which receives
major funding from the National Science Foundation and IBM Corporation, with
additional support from New York State and members of its Corporate Research
Institute.
SOLVING Lp-NORM PROBLEMS AND APPLICATIONS
YUYING LI*
Abstract. The 4 norm discrete estimation problem min?E? IIb?ATzIIPp has been solved in many data analysis
applications, e.g. geophysical modeling. Recently, a new globally convergent Newton method (called GNCS) has
been proposed for solving 4 problems with 1 < p < 2 [5]. This method is much faster than the widely used IRLS
method when 1 < p < 1.5 and comparable to it when p > 1.5. In this paper, modification is made to the line
search procedure so that the GNCS method is applicable for 4 problems with 1 < p < 00. The global convergence
resalts for ti problems are obtained under weaker assumptions than required in [2]. In addition, the usefulness
of 4 norm solution with 1 < p < 2 is demonstrated by applying the GNCS algorithm to a synthetic geophysical
tomographic inversion problem. Additional numerical results are included to support the efficiency of GNCS.
Key Words. linear regression, discrete estimation, tomographic inversion, IRLS, GNCS, linear programming,
Newton method
Subject Classification. AMS/MOS: 65H10, 65Ko5, 65K10.
1. Introduction. In empirical science, the following discrete approximation problem is of-
ten solved
(1.1)			mill IIATx --H bII?p
The least squares solution (p = 2) has been mostly used because it is an application of the
principle of maximum likelihood under the normal distribution assumption. The ti solution has
shown to be an increasingly attractive alternative to 12 since the li solution has the maximum
resistance property: a small number of isolated large errors do not usually change the solution
(e.g.,[1j, [1o],[13]). However, the li solution may not be unique while there exists exactly one
solution for any 1? problem with p> 1. Thus it may be desirable to obtaln the !? solution with
1 <p < 2.
As an example, consider a synthetic geophysical tomographic inversion problem described in
(lOJ (See 5 for detalls). The exact solution 1p solution with 1 < p < co when no error is present
is given in Figure 1. Now assuming both large spike errors and small Gaussian noise in the data,
the least squares and li solutions are shown in Figure 2 and Figure 3 respectively. Compared
to Figure 1, it is clear that the li solution is less affected by the large spike error and thus is
preferable.
* Computer Sdence Department, Cornell University, Ithaca, New York, 14853. Research partially supported
by the Applied Mathematical Sciences Research Program (KC-o4-o2) of the Office of Energy Research of the U.S.
Department of Energy under grant DEFGO2-86ER25013.A000, and in part by NSF, AFOSR, and ONR through
grant DMS-8920550, and by the Cornell Theory Center which receives major funding from the National Science
Foundation and IBM Corporation, with additional support from New York State and members of its Corporate
Research Institute.
0.014
0fl12
0.01
0.008
0.006
0.004
0.002
0
0			20			40			60			80			100 120 140
FIG. 1. EzactSointionfor a Synthetic Geophysical Tomographic Inversion Problem
The l? norm solution with 1 < p < 2 has been used less often in practice. Nontheless,
the (p solutions, with 1 < p < 2, can be useful when one seeks a solution with some degree of
resistance property and the character of least squares solution simultaneously(e.g, [10], [14]). For
the synthetic seismic tomographic inversion problem considered, the l? solutions, with p close to
unity, have similar error resistance properties (see Figure 4 and 5).
The IRLS method [9] has been widely used to solve (1.1), particularly in geophysical data
processing. Since the choice of the parameter p may need to be determined experimentally [11],
the fact that the IRLS method can solve l? minimization problems for 1 < p < 2 is certainly
an advantage. Even for li problems for which many LP-related methods exist, IRLS can still be
more desirable because it works on the original problem directly. This is particularly true when
a crude approximate solution is satisfactory. Moreover, the (? norm, 1 S p < 2, is decreased at
each iteration and a readily available good starting point can be immediately exploited.
Mthough the IRLS method has been widely used, it can be extremely slow when p is close
to unity (e.g., [9] and [5]). The slow convergence of IRLS, when p is close to unity, is not due
to the lack of line search. Indeed, an IRLS method with a line search (IRLSL) implemented in
[5] has similar slow convergence behavior, though small improvements over IRLS are observed.
Moreover, numerical examples illustrate that the methods are slow globally ([5]. see also 4).
Based on the complementary slackness conditions, a new globally convergent Newton aigo-
rithm called GNCS is proposed in [5]. It uses the same line search procedure of IRLSL. Compared
to IRLSL, the method has better convergence property and better practical performance. The
method is computationally very similar to the IRLS method and retains all the advantages of the
IRLS method over other approaches.
In this paper, we first briefly review the GNCS method (2). A modification is made to the
2
0.1?
0.1
0.05
<?.05
?2j
0
0			20
40			60			80			100 120 140
FIG. 2. 12 Solution Least Squares Solution with Spike and Gaussian Erros
.3
18x 10
16
14
12
10
8
6
4
A
20			40			60			80			100			120			140
FIG. 3. L1 Solution with Spike and Gaussian Errors
3
x 10?
is
10
5
0			20			40			60			80			100			120			140
FIG. 4. l? Solution, p = 1.2, with Spike and Gaussian Errors
0.02.?
0.021
0.015
0.011
A
0.0051
?.005l
I--H
0			20			40
80			80			100			120			140
FIG. 5. l? Solution, p = 1.3, with Spike and Gaussian Errors
4
simple line search procedure adopted ill [5J to extend the GNCS method for ip problems with
1 < p < 00. In 3, a global convergence result of the GNCS method, in the case of p = 1, is
obtained under a weaker condition than the one assumed in [2J. In 3, we report some of our
numerical experience with the method for large sparse problems and demonstrate its advantages
over the IRLS approach. Finally, in 5, the GNCS method is applied to a synthetic geophysical
tomographic inversion problem to investigate the ip norm solution in the presence of a few large
erratic data and Gaussian noise and thus demonstrate the usefulness of the 1p norm solution when
1 <p<?2.
2. The GNCS method. The motivation of the GNCS method has been described in [5J
and [2]. In this section, we briefly summarize the basic components of the method. Modifications
are made to the line search procedure so that GNCS solves (? problems with 1 < p < 00.
Assume that A has full rank. Let the rows of the matrix Z form a basis for the null space of
A, i.e., AzT = 0. The problem (1.1) is equivalent to the following
(2.1)			subject to
rw?nm?(k(r) =
Zr = Zb.
This formulation is used because of the following reason: the points of nondifferentiability (either
first or second order) are simply expressed as r? = 0 for some `. Notice that Z is not needed
computationally.
The objective of GNCS is to derive a method which is efficient for all !? problems with
p ranging from 1 to 00. However, the problems when p = 1 and p > 1 are very different
mathematical problems. The objective function is piecewise differentiable when p = 1 while
continuously differentiable for p> 1. This is reflected by the different optimality conditions in
the two cases. For p > 1, the optimality conditions for (2.1) formulation are simply that there
exists w E ?m-n such that
(2.2)			V4)(r) ZTw = 0 and Zr = Zb.
When p = 1, the optimality conditions are
(2.3)			diag(r)(??(r) --H zTw) = 0, Zr = Zb
and IA = ZTwI < e. We use e to denote the vector with each component equal to unity and
diag(r) represents a diagonal matrix with the components of r on the diagonal. The condition is
def
referred to as the complementarity condition. Let g = V?(r). If, for every r? = 0, Aj $ g?, strict
complementarity is said to be satisfied. It is easy to see that the multipliers w (and thus A) are
uniquely defined at points satisfying the strict complementarity condition.
In order to develop a method which solves the t? norm problems for all p efficiently, one
system of equations is chosen as a template to determine suitable descent directions for all
The IRLS method chooses the system (2.2) to determine such directions [5j. But (2.2) does not
5
include the optimality condition for p = 1. This explains the inefficiency of the IRLS method for
p close or equal to unity.
The key observation which motivates the development of GNCS is to use the system (2.3).
The advantage of using (2.3) is the fact that, in addition to being part of the optimality condition
for p = 1, (2.3) is the optimality condition for p> 1 if there are no zero residuals at the solution.
hi this case, a method based on the Newton directions defined by (2.3) will be quickly convergent
if suitable stepsizes are taken. When there are zero residuals at the solution for 1 < p < 00,
convergence is achieved through a globalization technique described below. When 1 <p < 2 and
there exist zero residuals at the solution, it is not clear how one can achieve fast convergence
since the function ?(r) is nonlinear and the Hessian matrix is not defined at the solution. In
this case, it is reasonable to achieve linear convergence. When p> 2, the globalization technique
ensures fast convergence if the Hessian matrix of `t)(x) def IIATx --H bIIPp is nonsingular.
In this paper, we define the multiplication and division of two vectors are to be interpreted
as componentwise operations. Let Dr = diag(IrI) A --H ZTw and DA = diag(sgn(r)(pg--H A)). The
Newton step for (2.3) is defined by
(2.4)			[DA, DrZT]			d = --HDr(g --H zTw) and Zr = Zb.
dw -
Alternatively, let d --H ATd_ and we have
(2.5)
Hence
(2.6)
dx
[DAAT,?DrZT]			=?Dr(y?ZTw).
d = AT(AD?D??lAT)?lAg
It is not difficult to show that the Newton direction defined by (2.6) is a descent direction for the
objective function ?(r) when sufficiently close to the solution ([2],[5]). However, it may not be a
descent direction far away from the solution. Thus, a globalization technique is required.
Since the IRLSL method is globally convergent [9], one way to achieve global convergence is
to combine the Newton direction (2.6) with the IRLS direction.
It can be easily verified [5] that the IRLS direction is d --H ATd_ where
(2.7)			[diag(?g?)AT, DrZTh
dx
dw.
--H Dr(g --H zTw).
By observing the similarity between (2.7) and (2.6), a globalization technique is formed by intro-
ducing a diagonal matrix D0 whose diagonal entries are componentwise convex combinations of
the components of diag(g) and D?:
(2.8) De = diag(i0g + (e --H O)(pg --H A)j)
6
Here 0 is a vector. Furthermore,
_			lie			Dr(g--HA)I
(2.9)			0 --H ?I?I + lie' li = max(max(			?(r0)			),max(max(iAi --H IgI,o)))
where 0 < ? < 1 is a constant. We use the Matlab [7j definition for the function max: max(x)
denotes the maximum component of a vector x and the value of max(x, y) is a vector whose
components are the maximum of the corresponding components of x and y.
The following lemma can be easily established from the definition of D8 and 0 (e.g., [5]). It
indicates that the diagonal matrix D6 behaves like diag(igi) globally.
LEMMA 1. Suppose 0 < ? < 1 and 1 < p < 00. Assume D0 is defined by (2.8). Then D0
satisfies
(2.10) (p --H 1 + (1 --H ?)min(0))diag(igi) < IDol < (p+ 1)diag(gI).
To complete the description for the algorithm, we only need to discuss our line search strategy.
The line search technique used in [5j exploits the special convex structure of the objective function
?(r) and the property of the descent direction determined by the GNCS algorithm. It is a finite
procedure when 1 < p < 2.
Let J denote the ordered (ascent) steplengths which result in zero residuals moving along a
descent direction dk, i.e.,
(2.11)
J = fPk? : Pk? = r,'k rj?dj? <0, Pk0 < Pk1 < ??,Pk1J
dik'
We call [Pk?) breakpoints. For notational convenience, we define Pk0 def
Let a? denote the optimal stepsize along dk for the following quadratic interpolation function
where 9k = V?(r?). Hence
From Lemma 1 and
= 1 m I9ikI r?2' + m 1kiki
2			Z(Iri?IP --H			--H Iri?I2),
Iri?I			i=1			2 Iri?I
ykTdk
d?Tdiag(p?r??P?2)d?
(ADekDr??lAT)dxk = --HAgk,
it is straightforward to verify that, for 1 < p < 2,
(2.12) (p--H l+(l--H?)min(0?)) < a? < (p+ 1).
7
lii addition, it can be shown that [5], for 1 < p < 2,
?rk + otkdk) < ?r?) + 1gTdk
2k
Hence, for 1 <p < 2, the sequence ?a?J is uniformly bounded away from zero and produces a
sufficient decrease of the objective function. However, when p > 2, all increase of the objective
function may occur with this stepsize because U(r) is not a dominant quadratic function for ?(r)
when p> 2.
We define a? for p> 2 as follows.
(2.13)			?k = --H--H X
p d?Tdiag(p(?r??)P?2)d?
Given a small constant E> 0. Define Pk* to be the first breakpoint, ill the interval (min[(p --H
1)E, ak), amax), at which the direction dk ceased to be descent:
Pk* %?ef rnjnf??? min?(p --H 1)E, a?J <Pk? < amax and y(r? + Pkidk)Tdk > 0).
Here a large upper bound amax > ?1is imposed on Pk* and Pk* is defined to be amax if no such
Pk? exists. This definition of Pk* is different from the one used in [5] where it was required that
Pk* > ak. We find this new definition computationally more efficient because it allows a? to be
better approximation to the exact minimum.
Assume 0 <p < 1. The stepsize is determined in the following way:
(2.14) ck = first(a E fPk*, l,ak,pak,p2ak? ) : ?(r? + ad?) < ?(r?) + ??1 ?jag?Td?),
p
i.e., ak equals the first stepsize among the ordered set (by appearance) fPk*, 1, a?,pa?,p2a?. 1
such that the sufficient decrease condition is satisfied. Notice that, for p = 1, Pk* is always taken.
If 1 < p < 2, 0Lk < ak always gives sufficient decrease and ?(r) is evaluated at most three times.
When p> 2, more evaluations may be required but our computation experience shows that ak,
defined above, often produces a sufficient decrease.
The points with exact zero residual should be avoided since the objective function ?r) may
be nondifferentiable at such points. We achieve this by applying a step-back technique, when
necessary:
(2.15)			ak & a? + r?(a? --H
where a? = max(P?? < a?). The amount of backtracking is determined by 7k which is given as
follows
(2.16)
r?=max(m1--H T)k
? + ?1k
8
Here r is a constant (e.g., 0.975). It is clear the backtracking distance converges to zero as the
iterates converge to the solution, i.e., when [nkY converges to zero. Moreover, it is easy to verify
that the same sufficient decrease condition is satisfied at the perturbed a? due to convexity of
A summary of the GNCS algorithm is given ill Figure 6. We want to emphasize that the only
difference between IRLSL and GNCS is the definition of scaling matrix Dk. In fact, there is only
one line difference in the actual codes. Moreover, the main computation is in Step 2: a weighted
least squares solve. For 1 <p < 2, the line search (Step 3) is a finite procedure and takes, in
the worst case 0(m) operations. For p > 2, we observe that the stepsize ok usually produces
sufficient decrease. The weighted least squares solutions dxk can be obtained either through a
direct solve or iterative solve. Notice that, for li problems, the above line search procedure is
slightly different from the exact line search technique used in [2].
3. Convergence Properties. The convergence property of the GNCS algorithm has been
studied in [5] for 1 <p < 2 and [2] for p = 1. Although the definition of Pk* is slightly different
from the one used in [5], it is easy to see that the convergence result established in [5] holds
because (Ph*Y is still uniformly bounded away from zero. For p > 2, the function ?(r) is at
least twice continuously differentiable and the convergence theory foliows from the conventional
unconstrained minimization convergence analysis [3]. Hence, for 1 < p < 00, fr?1 converges to
r*. Moreover, if [a? : = are linearly independent at r*, fAki converges and [r?) converges
to the solution.
The global converge results for li problems, however, are obtained in [5] under fairly restrictive
nondegeneracy assumptions. Even convergence to a vertex is established under the condition
that an l? problem is both primal and dual nondegenerate at every vertex. In this section, we
establish the global convergence results of GNCS for li under weaker assumption. Assume A is
of full rank n. We prove that the sequence f?(r?)1 converges and every limit point satisfies the
complementary slackness condition, i.e., diag(r)(g --H zTw) = 0 for some w. If we further assume
that (Ak) converges, every limit point of frkj is a minimum of ?(r).
We assume that p = 1 for the rest of this section. The foliowing recent result of Stewart [12]
is used.
THEoREM 2. Let D denote the set of all positive definite m X m real diagonal matrices. Let
A be an m X n real matrix of rank n. Then there exist constants XA and XA such that for any
D E D,
1. II(ATD?1A)?1ATD?1II < XA, and
? IIA(ATD?1A)?1ATD?1II < XA
Let Pk be the orthogonal projector onto the orthogonal space of ZDk, i.e.,
Pk = I --H DkzT(z(Dk)2zT)?1zDk
9
Given an initial point r0 = AT:o --H 6 with Irol > 0, A0, p, ?, 0 < Pj < 2' a small constant e and
a la?e constant Qmax>
p
fork=1,2, do
Step 1. Compute 9k = p(Ir?l??1sgn(r?) and
??ke			IDrk(Yk--HAk)I
0h = ?I9kI + Tike' Ilk = max(max( ?(r0) ).max(max(IAkI --H ykI,O)))
Let Drk = diag(Ir?I), Dek = diag(lpg? --H (e --H Ok)Aki) and define Dk =
(DrkDe?1)?;
Step 2. Compute the direction dk by
Dk?lATdxk ?=? --HDkgk
dk --H ATdx?;
Update : Ak+1 & DTk?1Dek4 + 9k;
9kTdk			?r?2?nn1o+1			rRt? 2fl? ii
Step 3. Compute ok = --H pd?Tdjag(p(?r??)P,?)??? ompu?e ??y??tS k?kJ
Pk* = min?P?? : minf(p --H 1)E, 0kY < PkS < omax and g(rk + Pki4)T4 > 0); Set
0Lk = first(o ? fPk*, 1,0k,P0k,P20k o+) : ?r? + Qdk) < ?r?) + ylp!oykT4);
Compute Tk by
r?=max(r,1--H Ilk
? + Ilk
and perturb 0k according to 0k Pk + Tk(ak --H 0k?) if necessary; Update:
Xk+I & Xk + c?d?;
Flo. 6. The GNCS Atgorithm
10
where Dk equals (IDrkDek?1I)21. Then
dk			=			?AT(A(Dk)?2AT)?1Agk
(3.1)			--H			--HDkPkDkgk
--H			--H(D?)2(g? --H Ak+1),
where Ak+1 = zTwk+l and wk+1 is the least squares solution to
DkzTwk+l ?? Dhgk.
Using Stewart's theorem we immediately have the following result.
LEMMA 3. The sequences ?wkJ and fAkJ are bounded, i.e., there exists XA > 0 such that
IIwkII < XA and IlAkil < XA
LEMMA 4. The sequences fdkl and tcx?41 converge to zero if lim??? inf??J > 0.
Proof. Since p = 1, the objective function b(r) is piecewise linear. Therefore
kftl(?(rj + (1--H rk)minfP31,amaxjd,) --H
min(1 --H r, "k) minfP31., omaxYgjTdj.
3=0
By assumption, there exists Xn > 0 such that iik > x,, for sufficiently large k. Using definition
(2.9) of 0k and the boundedness of ?AkY, there exists ?o > 0 such that 0k > xo for sufficiently
large k.
From (2.10), we have IDeki > ((1 --H ?)xe)' Hence
(3.2)			111k --H (e --H Ok)AkI > (1 --H
Using (2.11) and (3.1),
Pk1 --H I115k--H(1--HOjk)AjkI
1115k --H Ajk+iI
Hence Pk1 > \l7+%\X?e, we conclude that llm??oo gkTdk = 0. This is equivalent to lim??oo IIPkDkgkII =
0.
The boundedness of fDkl follows directly from the definition Dk = (IDrkDe??1 I)?21 (3.2)
and the boundedness of fr?J. Since dk = --HDkPkDhgh and llm??oo PkDkgk = 0, we conclude
that lim??? dh = 0. Moreover, from the line search assumption, we have oLk < amax. Hence
llm??oo a?4 = 0.			I
11
Next we establish the two main results of this section. Theorem 5 establishes that any
limit point is a vertex and Theorem 6 shows that any limit point is a minimum assuming the
convergence of fAk1.
THEoREM 5. The sequence f?(r?)) conve?es. Moreover, at any limit point r of ?r?1, there
ezists in such that diag(r)(g --H zTw) = 0.
Proof. We prove the results by considering two separate cases.
First we consider the case that zero is a limit point of ?iikJ This means that there exists
a subsequence ?!? converging to the solution r*. Since f?(r?)J is monotonically decreasing, this
means that ??(r?)1 converges to the minimum value f?(r*)). Therefore, any limit point r is a
minimum. Hence there exists in such that diag(r)(y --H zTu,-) = 0.
Now let us assume that zero is not a limit point of [,ikJ Since fr?J is bounded and,
using Lemma 4, ?()tkdk1 converges to zero, the limit points of ?rk1 form a closed connected set
(see [8]). Since f?r?)1 is monotonically decreasing, bounded below and ?(r) is continuous,
there exists r such that lim??oo ?r?) = ?(r). Using Lemma 4, [dkj converges to zero. The
equation (3.1) immediately implies that, at any limit point r of ?r?1, there exists in such that
diag(r)(g --H zTin) = 0.			I
THEoREM 6. Assume that (Ak = ZTink) converges to zTw. Then any limit point of (r?1 is
a solution of (2.1).
Proof. Assume that r is a limit point of (r?1. From Theorem 5, there exists A = zTin such
that diag(r)(y --H A) = 0. This means, by assumption, that the sequence (Ak) converges to A.
Therefore, there exists a constant K such that Aik remains the same sign for k > K if Ak $ 0.
Let ? <Lef (i : = 0) and 11 def (i : Tj $ 01. To prove that r is a solution, we need to
show lAil < 1, Vi Elo. We prove this by contradiction.
Assume the existence of i0 E 10 such that Ajo > 1. Then the rjo is zero at every limit
point since the complementarity condition is satisfied. In other words, (rio?) converges to 0. We
consider two cases.
Assume that there exists a k > K with rio?Aio? > 0. Then gjo?(gjo? --H Aiok+l) < 0. Hence
dio?rio? > 0 and Irio?+iI > Irio?I This is impossible because r?o = 0.
The only other possibility is that, for all k > K, rio?Aio? < 0. But we prove next that
?iok+l Ajo k+1 > 0 for some sufficiently large k > K which is a contradiction.
By definition (2.11) and (3.1), a breakpoint is given by the foliowing formula:
Pk? = Irjkgjk--H(1--HOjk)AjkI --H Igj?--H(1--HOjk)A,'kI
Irj?(gj? --H Ajk+l)I			lYik --H Ajk+lI
Notice that, Okj = 0ki for j $ i. For notation simplicity, we regard 0k as a scalar. We partition
the set 10 as follows
r?mall def (i E 10 : lAji > 1 or (lAji = 1 and Aigi? <0))
12
??I&?C def fi E			: jAil < 1 or (jAil = 1 and Aigi? > O)?.
From (3.1), gtkdtk <0 for all i E Iksmall. (Recall that Aillik < 0, for i E I?smaII.)
From the fact that 0 < 0 < 1, it can be easily proved that, for jAil > 1 and jAjI < 1
1+U?O)jAij < l+UffiO)Aj
1+ jAil			1+Aj
Moreover, for jAil > 1 and jAjI = 1, then for k sufficiently large, if gi?Ai <0 and 9ikAj > 0,
1+(1--H0k)jAikj < 1--H(1--HOk)gj?Aj?
1+jAik+11			l?gj?Aj?+1
and
1+(1--HOk)lAikl
1+ lAik+1l			< 1 ? ?max
Hence, when k is sufficiently large, the breakpoints ill I?l&?? will be larger than the breakpoints
in I?small.
In addition, from AA = 0, we obtain
Arlk9rlk =
(3.3)
By definition, dk --H ATdxk. Thus
g4T1A17??dx? (since ?ri? = g4i)
--HA??A1%dxk (from (3.3)).
Hence
(3.4)			g1i?Td1i? = A?Trodrok.
Now let g+ denote the new gradient after passing all the breakpoints in I?small. Then
= --H ? Yikdik+ Z gj?Td?j?+g1i?Td1i?
1EI?sJflaU
--H			? gi?di? + ? gjkTdjk --H AI%d?k			(from (3.4))
tEI?sma11			j??rge
--H z (?Ai?9ik)dik+ z (9jk?Aj)djk.
iEIkSW?			i??arge
But Aidik < Yikdik < 0, for i E Iksmall. Moreover, since gj?dj? < 0 if A,o+l < 1, --HjAjd?j?j >
gjkdjk for j E I?I&?C with jAjI < 1. In addition ?j? --H A5 = 0 if j E ?l?ge and gikAs = 1. Hence
z (--HAi --H Yik)dik < 0 and ? (gik --H Aj)djk < 0.
iEISkmall
13
We conclude g+Td? < 0 Therefore, rio?+iAio?+2 > 0 which is contradiction.
If we assume that all !i problem is primal and dual nondegenerate at every vertex as ill [2J,
then [Ak) converges and the result that the limit point of [r?) is the solution follows directly
from Theorem 6. Therefore, the assumption that [Ak) converges is weaker. In fact, from the
linear programming theory, [Ak) converges if the strict complementarity condition is satisfied at
every limit point of [r?).
Assume that a limit point r of [r?) is both primal and dual nondegenerate. The sequence
[r?) converges to r* superlinearly (see [2]). The local superlinear convergence property of the
algorithm suggests efficiency of the method when a good initial guess of the solution is available.
This will be demonstrated by a geophysical tomographic inversion problem later.
4. Comparison between GNCS and IRLSL. Before we examine the efficiency of the two
methods, we emphasize that the computational cost of GNCS and IRLSL is equivalent: a weighted
least squares solve (same structure) and the same line search procedure. The computational
superiority of the GNCS method over the IRLS method, when 1 <p < 2 has been indicated in
[5]. In this section, we make a more detailed comparison between the two methods. Moreover,
examples of the performance of the both methods when p > 2 are included. Some limited
experience with large sparse problems is also reported.
For the results reported in this paper, the parameters required by the aigorithms are set as
follows:
r			0.975,			P1			10--H?,			? & 0.99,			c & 0.01,			?max			io4.
Moreover, we stop the computation when
j?(rk+1)--H?(rk)I < ?			k
either			?rk+l)			or ?1 <r8 or itcount> 50
where Ts has been set to ?2110?11 and itcount denotes the number of iterations.
All the experiments are done in Matlab [7]. Each entry of a table is generated from 10 random
problem instances using Matlab random matrix generator. The maximum number of iterations
and the average number of iterations for problem of each size are reported.
For completeness, we include Table 1 to compare the typical behavior of GNCS and IRLSL
when 1 < p < 2. When p = 1, IRLSL fails to compute a solution with required accuracy after 50
iterations which is the maximum number of iterations allowed. This suggests that the efficiency
of GNCS is not solely due to the local superlinear convergence of GNCS.
In order to illustrate that the GNCS method is better than IRLSL even when one only seeks
a crude approximation, we have run both algorithms for 5 iterations only. For 1 < p < 1.6, the
computed function values for GNCS are always strictly smaller than that of IRLSL at the end of
the fifth iteration. Moreover, the closer p is to one, the better approximation the GNCS solutions
are. Therefore, GNCS is better than IRLSL even in the early stage of computation. When p is
14
Number of Iterations			( m = 200, n = 100)
p			1			1.1			1.2			1.3			1.4			1.5			1.6			1.7			1.8			1.9
IRLSL			50/50			27.9/30			18.2/21			15/17			12.2/15			7.1/8			8.5/12			6.4/8			6.1/8			4.5/5
GNCS			16.7/20			10.7/11			10/11			9.4/10			8.6/10			7.6/97.3/8			6.6/7			6.5/7			5.6/6
TABLE 1
Comparison between IRLSL and GNCS (1 < p < 2)
greater than 1.6, the computed solutions from both methods are fairly close to the solutions at
the end of the fifth iteration.
Table 2 lists comparisons between IRLSL and GNCS for p > 2. The two methods are
comparable in this case.
Number of Iterations			( m = 200, n = 166)
p			3			4			5			6			7 8			9			10			11			12
IRLSL			5.4/6			6.2/7			7.1/8			7.8/9			8.3/9			8.8/11			9.6/11			10.2/12			11.3/13			11.7/13
GNCS			6.4/8			7.1/8			7.7/9			8.4/10			8.5/9			9.1/11			10/11			10.7/13			11.5/13			11.6/13
TABLE 2
Comparison between IRLSL and GNCS (p > 2)
In Table 1, a least squares solution is chosen as a starting point. Theoretically, a starting point
has to satisfy the requirement that there is no zero residual, i.e., Irol > 0. The GNCS method
maintains this for subsequent iterations. Numerically, however, a positive constant about the size
of machine precision is added to the diagonal scallng matrix Dr and De to prevent numerical zero
due to the inexactness of the representation of real numbers on a computer. Hence, in practical
implementation, a starting point with zero residual may be allowed.
We experiment with the least squares solution and a vertex as starting points in Table 3.
The vertex starting point is determined by the solving the first n equations exactly. Since the
vertex starting point is chosen arbitrarily without attempting to minimize the objective function,
it is not surprising that the least squares starting point is better. Indeed, the function value ?(r)
of the least squares starting point is much lower than the vertex starting point we pick.
It is interesting to note, however, the reasonable good performance of the GNCS algoritlim
starting from an arbitrary vertex. The iterates often leave the starting vertex in one iteration
and approach the solution. This wffl be further illustrated by the tomographic inversion problem
described in the next section.
We want to emphasize that the GNCS method is a superbnearly convergent method and it
can start from anywhere. Nonetheless, the GNCS algorithm benefits from a warm start, i.e., a
point close to a solution. This is illustrated by the last row of Table 3 which provides the number
15
AVG/MAX Number of Iterations for li Problems
(200,20)			(200,100)			(200,180)			(300,50)			(300,150)			(300,280)
is			13.8/17			15.6/17			12.3/14			21.9/31			18.7/21			12.4/14
Vertex			18.8/23			26/33			19.2/25			28.5/36			32.7/40			20.1/25
NearOpt(0.001)			5.6/7			6.6/8			6.4/8			6.4/7			7.7/8			6.3/8
TABLE 3
Performance of GNCS with Different Starting Point
AVG/MAX Number of Iterations
(m,n)			(p=l)			(p=l.l)			(p=l.2)			(p=l.3)			(p=l.4) (p=l.S) (p=l.8)
(1000,100)			7.4/8			11.6/13			10/11			8.8/10			8.2/10			7.1/8			5.7/6
(2000,100)			8/9			12/13			10/12			8.9/11			8/9			6.7/8			5.3/6
(3000,100)			8.4/10			11.8/12			9.9/12			8.6/10			7.7/8			6.6/8			5.1/6
(1000,300)			17.8/20			11.5/13			10.4/11			9.1/10			8.2/9			7.5/9			6.1/7
TABLE 4
Performance of GNCS for Larger Problems
of average and maximum iterations required by GNCS starting from a perturbed optimal point
= x* +0.001*rand(n, 1). Here rand(n, 1)is a Matlab function which returns a random vector
in ??
We have also done experiments with some random large sparse problems. The results are
reported in Table 4. The problems have about 1% of nonzero entries.
Our experience clearly indicates that the GNCS method is much better than the IRLS method
when p is less than 1.5. in our implementation, a QR factorization is solved at each iteration
to obtain a descent direction; however, it is also possible to apply an iterative procedure, e.g.,
conjugate gradients, as has been done for IRLS in [10].
5. An Application ofl? Norm Minimization. In this section, we compare the perfor-
mance of IRLSL and GNCS for a synthetic geophysical tomographic inversion problem described
in [10]. The parameter settings are the same as before.
The origin of the word "tomography" comes from the Greek word "tomos" meaning cut or
slice and "graphos" meaning drawing. The objective of tomography inversion is to reconstruct
the slices of an object based on wavefield observations which have passed through the object and
are recorded on its surface [6]. The essential idea behind travel time tomography is based on the
following fact: the travel time for a seismic ray is the integration of slowness along the ray path,
16
r
18
16
14
?12
?1O
p
0o			2			4			6			8 10 12 14 16 18
HORZ?NTALDISTANCI
FIG. 7. Geometry for Vertical Seismic Profile
t(ray) = lay s(x,y,z)di
where s(x, y, z) is slowness (reciprocal velocity) at the point (x, y, z) and d? is the differential
length along the ray. Here the ray path itself depends on slowness. The idea of linearized
tomographic inversion is to assume all initial slowness model and use the difference between the
computed travel times and the observed travel times to invert for the slowness perturbations.
This results ill an overdetermined system of equation
Ax =
where b d=ef ?? are the differences between the travel times computed for the initial model and the
observed travel times, x d=ef ?? are the differences between the initial slowness model and the true
slowness. The a?? element of the rectangle matrix A is the distance the ith ray traveled in the
jth cell. The linearization process can be repeated to produce the improved velocity models.
The overdetermined system of equations is solved by minimizing a norm of the residuals. The
usual choice has been the 12 norm. However, increasingly, the Ii measure has shown increasingly
to be advantageous (e.g., [10]). The 1? solutions, 1 <p < 2, have been used but far less often.
Our intention is to indicate, through a synthetic tomographic inversion example, that 1? norms
with p between 1 and 1.5 may be eligible candidates as well.
We consider a synthetic vertical seismic profile, used in [10], in which a square velocity
anomaly is to be reconstructed. Figure 7 shows the geometry of this synthetic vertical seismic
profile: the square velocity anomaly appears in the middle layer. Figure 8 shows the ray illumi-
nation from 18 sources on the surface to 18 receivers on the left side of the model. Straight rays
17
14
2			4			6			8			10
souRo?
14			16			18
FIG. 8, Ray ii!uminzation from 18 sources and 18 reveivers
are assumed and the lower left triangular region was covered with 136 squares cells of constant
slowness.
The initial guess consisted of the three layers with constant slowness, i.e., without the square
anomaly in the middle layer. Figure 1 shows the graph of the exact model (see also [10] and [4]).
It is not difficult to see that all the 1? solutions are equivalent to the exact model.
Next we introduce four noise spikes of the magnitude 0.9 x max(IbI). Figure 9 graphs the
least squares solution in the presence of those spike errors. Figure 10 graphs the ii solution.
The ip solutions when 1.2,1.3 are shown in Figure 11 and 12. The number of iterations taken
are listed in Table 5. It is clear that the ip solutions, p = 1, 1.2, 1.3, recover the exact solution
well. The 12 solution, on the other hand, totally distort the exact solution. Furthermore, GNCS
computes a solution more quickly than IRLSL when p = 1.
p			1			1.1			1.2			1.3			1.4			1.5
GNCS			4			11			10			10			8			8
IRLSL			11			11			14			14			11			8
TABLE 5
Number of Iterations for Different p with Spike Errors
Now we consider the case where some Gaussian noise is present in travel times perturbation
b. Assume that the noise has a normal distribution with deviation of 0.01 x max(IbI). The li
solution is computed by GNCS in 16 iterations while IRLSL failed to compute a solution with the
required accuracy after 50 iterations. The least squares solution and the li solutions, computed
by GNCS, are shown in Figure 14 and 13. The li and 12 solutions look very similar. Once again,
18
0.1
0.1
0.05
?.05
0.1.
021
20			10
60			80			100 120 140
FIG. 9. Least Sqtiares Sojution with Spike Errors
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
0			20			40			80			80			100			120			140
FIG. 10. L1 So?tion with Spike Errors
19
x io?
14
12
0
B
.2;'
20			40
60			80			100 120 140
FIG. 11. l? Solution, p = 1.2, with Spike Errors
0.0
0.0151
0.01
0.0051
.0.005I
I--H
0.0
0			20			40
60			60			100			120			140
FIG. 12. l? Solution, p = 1.3, with Spike Errors
20
x 10
15
10
5
0--H
o			20			40			60			80			100			120			140
FIG. 13. L1 Solution with Gauss Errors
p			1.1 1.2 1.3 1.4 1.5
GNCS			23			13			11			10			8
IRLSL			50			20			16			11			10
ABLE 6
Number of Iterations for Different p with Spike and Gaussian Errors
GNCS is more efficient than IRLSL for p close or equal to one.
Finally, we consider the more realistic situation when both the Gaussian noise and isolated
spike errors described above are added simultaneously. Table 6 lists the number of iterations
taken by each method. The li, 12,11.3 solutions computed by GNCS are displayed in Figure 3, 2
and 5 respectively. The GNCS method is definitely preferable to IRLSL when 1 < p < 1.5.
The synthetic geophysical tomographic inversion problem illustrates that I? solutions 1 <
p < 1.5, have similar error resistant quality and the closer p is to one, the more error resistant
the solution is. GNCS is significantly more efficient than IRLSL for this range of p.
6. Conclusion. In this paper, we obtain the convergence of the GNCS method proposed
in [5] under much less restrictive conditions. The performance comparisons between IRLSL and
GNCS over a synthetic geophysical tomographic inversion problem are reported. The possible
application of the l? solutions, with 1 < p < 1.5, is illustrated. Both methods are extended for l?
problems with p> 2.
Our computational experience for both random generated problem and synthetic geophysi-
cal tomographic inversion problem clearly indicate the superiority of GNCS over IRLSL. While
the two methods are computationally cost-equivalent, GNCS computes a solution with required
accuracy much faster for 1 < p < 1.5. The performance of the two methods are comparable for
21
x 1O?
16
14
12
10
8
6
4
2
o			M
20			40			60			80			100 120 140
FIG. 14. Least Squares Solution with Gauss Errors
p> 1.5.
7. Acknowledgement. The author would like to thank Tom Coleman for his careful read-
ing of the manuscript and suggestions.
22
REFERENCES
[1] J. F. CLAERBOUT AND F. M?, Robust modeling with erratic data, Geophysics, 38 (1973), pp. 826--H844.
[2] T. F. COLEMAN AND Y. Li, A globally and quadratically convergent affine scaling method for linear li
problems, Mathematical Programming, 56 (1992), pp. 189--H222.
[3] J. E. DENNIS AND R. B. SCHNABEL, Numerical Methods for Unconstrained Optimization, Prentice-Hall, 1983.
[4] A. G. Io? A. SCALES AND 5. TREITEL, Fast Ip solution of large, sparse, linear systems application to
seismic travel time tomography, Journal of Computational Physics, 75 (1988), pp. 314--H333.
[5] Y. Li, A globally and quadratically convergent method for linear lp problems, SlAM Journal on Optimization,
(to appear).
[6] L. R. LiNES, Applications of seismic traveltime tomograph--Ha review, in Geophysical Inversion, 1992, pp. 333--H
337. SlAM.
[7] C. B. MoLER, J. LITTLE, 5. BANGERT, AND 5. KLEIMAN, ProMatlab User's guide, MathWorks, Sherborn,
MA, 1987.
[8] J. M. ORTEGA AND W. C. RHEINBOLDT, Iterative Solution of Nonlinear Equations in Several Variables,
Academic Press, 1970.
[9] M. R. OSBORNE, Finite Algorithms in Optimization and Data Analysis, Wiley Series in Probability and
Mathematical Statistics, 1985.
[10] J. A. SCALES, Tomographic inversion via the conjugate gradient method, Geophysics, 52 (1987), pp. 179--H185.
[11] II. SP?TH, Mathematical Algorithms for Linear Regression, Academic Press, 1992.
[12] G. W. STEWART, On scaled projections and pseudoinverse, Linear Algebra and its Applications, 112 (1989),
pp. 189--H193.
[13] H. L. TAYLOR, 5. C. B??ics, AND J. F. McCoy, Deconvolution with the li norm, Geophysics, (1979),
pp. 39--H52.
[14] R. YARLAGADDA, J. B. BEDNAR, AND T. L. WATT, Fast algorithms for lp deconvolution, IEEE Transactions
on Acoustics, Speech and Signal Processing, ASSP-33 (1985), pp. 174--H182.
23
