BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR92-1294
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: A Singular Loop Transformation Framework Based on Non-Singular 
        Matrices
AUTHOR:: Li, Wei 
AUTHOR:: Pingali, Keshav  
DATE:: July 1992
PAGES:: 22
ABSTRACT::
In this paper, we discuss a loop transformation framework that is based on 
integer non-singular matrices. The transformations included in this framework 
are called $\Lambda$-transformations and include permutation, skewing and 
reversal, as well as a transformation called loop scaling. This 
framework is more general than the existing ones; however, it is also more 
difficult to generate code in our framework. This paper shows how integer 
lattice theory can be used to generate efficient code. An added advantage of 
our framework over existing ones is that there is a simple completion 
algorithm which, given a partial transformation matrix, produces a full 
transformation matrix that satisfies all dependences. This completion 
procedure has applications in parallelization and in the generation of code 
for NUMA machines.
END:: CORNELLCS//TR92-1294
BODY::
A Singular Loop Transformation Framework
Based on Non-Singular Matrloes
Wel I-i
Keshav Pingali*
TR 92-1294
July 1992
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
*This work was supported by the Cornell Theory Center, NSF Presidential Young
investigator award CCR-8958543, by NSF grant CCR-9008526 and by a grant from
the Hewien-Packard Corporation.
This paper will appear in the Proceedings of the Fifth Annual Workshop on Language
and Compilers for Paralielism, New Haven, August, 1992.
A Singular Loop Transformation Framework
Based on
Non-singular -Matrices
Wei Li
Keshav Pingali *
Department of Compnter Science
Cornell University
Ithaca, New York 1?853
Abstract
In this paper, we discuss a loop transformation framework that is based on int?
ger non-singular matrices. The transformations included in this framework are called
A-transformations and include permutation, skewing and reversal, as well as a transfor-
mation called loop scaling. This framework is more general than existing ones; however,
it is also more dificult to generate code in our framework. This paper shows how in-
teger lattice theory can be used to generate efficient code. An added advantage of
our framework over existing ones is that there is a simple completion algorithm which,
given a partial transformation matrix, produces a full transformation matrix that sat-
isfies all dependences. This completion procedure has applications in parallelization
and in the generation of code for NUMA machines.
*This work was supported by the Cornell Theory Center, NSF Presidential Young Investigator award
#CCR-8958543, by NSF grant #CCR-9008526, and by a grant from the Hewlett-Packard Corporation.
This paper will appear in the Proceedings of the Fifth Annual Workshop on Language and Compilers for
Parallelism, New Haven, August, 1992.
1 Introduction
The importance of loop transformations in generating good code for vector and parallel ma-
chines is widely recognized [9, 1, 13]. A recent advance in this area is the work of Banerjee
who showed that thr?important loop transformations, namely permtttation, sketuing and re-
versal, can be modeled using unimodular matrices. Unimodular matrices have integer entries
and a determinant that is 1 or -1; therefore, they are closed under matrix product. It follows
that any sequence of these loop transformations can also be represented as a unimodular
matrix; conversely, any unimodular matrix can be interpreted as representing a sequence
of permutation, skewing and reversal transformations. The main benefit of the unimodular
abstraction is that it provides an approach to tackling the s?called `phase-ordering problem'
--H for many problems where there is no obvious order in which the transformations should
be performed, it is often possible to generate a unimodular matrix from which the desired
order of loop transformations can be determined easily. Banerjee has used this framework to
address the problem of generating parallel loops [3]; Wolf and Lam have used this framework
extensively to address both this problem and that of promoting data reuse for improving
cache performance [11, 12].
In this paper, we propose to use non-singular matrices, rather than unimodular ma-
trices, as a foundation for modeling loop transformations. Non-singular matrices include
unimodular matrices as a special case, and permit us to include a new transformation called
loop scaling in this framework. The real pay-off, though, is that it is easier to work with
non-singular matrices than with unimodular matrices. A typical algorithm that uses the
matrix framework, such as the generation of parallel outermost loops [3] or the exploitation
of locality in NUMA architectures [7], determines the first few rows of the matrix, and then
`pads out' the remaining rows to generate a matrix that represents a legal transformation.
It is easier to generate a non-singular matrix than a unimodular matrix since there are fewer
constraints to be satisfied, and in this paper, we give a completion procedure that produces
a non-singular matrix, given the first few rows. This completion procedure is non-trivial
since we must ensure that the result matrix respects the dependencies of the loop nest. Sur-
prisingly, it turns out that generating the transformed loop nest is somewhat more intricate
when non-singular matrices are used, than when unimodular matrices are used, and it is the
main concern of this paper.
The rest of the paper is organized as follows. In Section 2, we define the problem for-
mally, outline our loop restructuring framework and discuss the difficulties in generating the
transformed loop nest. In Section 3, we sketch the code generation technique for the case
of unimodular matrices, and discuss why it cannot be used directly for non-singular matri-
ces. In Section 4, we solve the code generation problem for non-singular matrices. The key
technical result is that a non-singular matrix can be decomposed into the product of a lower
triangular matrix with positive diagonal elements and a unimodular matrix. Using these
two matrices, we generate the transformed loop nest. In Section 5, we give a completion
procedure for non-singular matrices. The last section discusses related work.
2
2 Linear Loop Transformations
In this section, we intwduce integer lattices as a model of the iteration space of loops, and
non-singular matrices as a model of loop transformations.
2.1 Iteration.Spaces and Integer Lattices
Consider the loop nest in Figure 1(a) whose iteration space is shown in Figure 1(c). The
points in the iteration space of this loop can be modeled as integer vectors in the two
dimensional space Z2, where Z is the set of integers. For example, the iteration (i = 2,
j = 3) can be represented by the vector (2, 3). In general, points in the iteration space
of a loop nest of depth n can be represented by integer vectors from the space Z?. It is
convenient to use the theory of integer lattices [4] and view the points in the iteration space
as being generated by integral linear combinations of a set of basis vectors. For example, it
is easy to see that the points in the iteration space shown in Fi re 1(c) can be generated by
integral linear combinations of two integer vectors ( 01) and ol) Similarly, the iteration
space in Figure 1(d) is generated by linear combinations of the integer vectors ( ?12 ) and
For future reference, we define these concepts more precisely.
Definition 2.1 Let a1, a2,..., am be a set of linearly independent integer vectors. The set
A = ?A1a1 + A2a2 + ... + Amam A1,..., Am E ZJ is called an integer lattice generated by the
basisa?,a?,...,am.
We will call an integer matrix a basis matrix, if its columns are a basis.
The loop nest in Figure 1(a) has the property that every integer point within the loop
bounds is a point in the iteration space of the loop nest. We will call this a dense iteration
space. By contrast, Figure 1(d) shows a sparse iteration space because the integer point (2,
3), for example, is within the loop bounds but does not represent a point in the iteration
space of the loop. The notion of dense and sparse can be formally defined as follows.
Definition 2.2 An iteration space is dense, if for any two integer vectors v1 and v2 repre-
senting loop iterations, any integer vector v3 = Av1 + (1 --H A)v2 for some 0 < A < 1 also
represents a loop iteration. An iteration space is sparse if it is not dense.
The significance of this classification of iterations spaces is that it is considerably more
difficult to generate code for a loop nest if the iteration space is sparse, than if it is dense,
as we will show in Section 3.
Let e? be an n-dimensional vector with 1 in the ith entry and 0 elsewhere.
Theorem 2.1 The integer vectors from a n-dimensionai dense iteration space form an in-
teger lattice with the basis e1, e2,... e?.
Proof: Obvious. 0
3
for i = 1, 3
forj = 1, 3			___
A[?-2i+3, i+il =,;			2
(a) The original code
j
3
2
0
123
(c) The original iteration space
1<			i			<3
1<			j			<3
for tt = -2, JO step 2
for v = --H? + 3max(1,
--H? + 3min(3,
step 3
Aj?+3, v] = (`L+ 2v)/6;
(b) The target code
-2-11 1 2 3 4 5 6 7 8 9 10 u
(d) The target iteration space
--H2< u < 10
max((u + 6)/4, (6 --H u)/2) <v
v < min((u + 18)/4, (18 --H
(e)			Loop Bounds			(f) Image of Bounds
Figure 1: The working example
2.2 Loop ?ansformations
In this paper, we will focus on transformations that can be represented by linear, one-to
one mappings from the iteration space of the source program to the iteration space of the
t&get program. This class of transformations includes permutation, skewing and reversal,
as well as a new transformation called scaling. Examples of these transformations are shown
in Figure 2. These transformations are standard except for scaling which corresponds to
replacing a loop iteration variable by an integer multiple of it.
Linear, one-toone mappings between iteration spaces can be modeled using integer,
non-singitiar matrices. The reader can verify that the matrices shown in Figure 2 perform
4
for i = 1, 3
forj = 1, 3
A[i, 2j]=j
The original loop nest
fr?rtt = 1, 3
for v = 1, 3
A[v, 24 =
(a) loop interchange
for t' = 1, 3
for v = -3, -J
A[u, -24 = -v
(b) loop reversal
for u = 1, 3
for v = `L+J, t'+3
A[i, 2(v-n)j =
(c) loop skewing
for u = 1, 3
for v = 2, 6, 2
A[u, v] = v/2
(d) loop scaling
Figure 2: Primitive Transformations
5
the desired mappings from the source iteration space to the target iteration space, and are
integer and non-singular. Similarly, the points in the target iteration space of Figure 1(d)
are the image of the source iteration space points under the integer, non-singular matrix T
= ( i? 4i )
Definition 2.3 A loop transformation is called a A-transformation if it can be modeled by
an integer non-singular matrix.
Performing a sequence of transformations corresponds to composing the mappings be
tween iteration spaces, which, in turn, can be modeled as the product of the matrices rep-
resenting the individual transformations. Since the product of any number of integer, non-
singular matrices is integer and non-singular, it follows that the set of A-transformations is
closed under composition.
Conversely, we can show that the transformation represented by any integer non-singular
matrix can be viewed as a composition of the four basic transformations. More precisely, we
have the following result.
Theorem 2.2 The transformation represented by any integer non-singular matrix can be
viewed as a composition of permutation, skewing, reversal and scaling.
Proof: By applying the appropriate elementary row and column operations, an integer
non-singular can be reduced to a diagonal matrix [10]. The elementary operations can be
represented by multiplying interchange, reversal and skewing matrices on the left hand side
and on the right hand side the non-singular matrix. The diagonal matrix can be further
reduced to a product of scaling matrices. 0
As mentioned earlier, unimodular transformations are the subset of A-transformations
without loop scaling. An important property of unimodular transformations is the following.
Theorem 2.3 Unimodular transformations map a dense (sparse) iteration space to another
dense (sparse) iteration space.
Proof: The lattice remains the same, since only the basis is changed [10]. 0
2.3 Legality of A-transformations
Not every A-transformation is valid with respect to the data dependencies in the original
loop nest. Data dependencies can be represented by distance or direction vectors that are
lexicographically positive. For example, a distance vector d = (: ) means that the iteration
(i, j) depends on the iteration (i --H 3, j --H 2). Then Td is the dependence vector in the new
iteration space, since A-transformations are linear. A A-transformation T is legal if and only
if Td is lexicographically positive.
6
2.4 Generating Code
To generate code for the target loop nest, we must generate DO-loops that scan the points
of the target iteration space in lexicographic order, and replace occurences in the loop body
of the old loop indices with the new loop indices. The first problem is non-trivial and is
discussed in Sections 3 and 4. On the other hand, the problem of transforming the loop
body is relatively straight-forward and we sketch a solution here for completeness. If vectors
Si and Sj represent the source and target iteration variables, notice that Si = T??Sj. This
is just a set of equations expressing the old subscripts in terms of the new ones, and it can
be used to eliminate occurrences of the source iteration variables in the body of the loop in
favor of the new ones. For our running example, this set of equations is the following:
--H1/6
The transformed loop body is shown in Figure 1(b).
4/6			u
2/6			v
Note that T-1Sj will always be an integer point even though T-1 may be a rational
matrix; therefore, expressions like u/2 and (u + 2v)/6 in Figure 1(b) should be strength
reduced.
3 Difficulties in Generating Code
The difficulty in generating DO-loops to scan the target iteration space is that a A-transformation,
in general, does not preserve lexicographic order (two iterations may be performed in one
order in the source loop nest but in a different order in the target loop nest), so there is no
obvious way to use the source loop nest to generate code. As a first attempt, we can find the
image of the original bounds (the four inclined lines in Figure 1(d) for the running example),
and then generate a loop nest that visits in lexicographical order all the integer points in the
area bounded by the image. In this section, we show that this approach works well when
the target iteration space is dense; sparse iteration spaces will require additional machinery.
3.1 Computing Image of Bounds
There are many ways to compute the image of the original bounds; here, we describe a
simple method that uses Fourier-Motzkin elimination.
Given a non-singular matrix representing the transformation, the image bounds for the
target loop can be computed using the inverse of the transformation. Let Sj = (ii,.., j?)T
and S? = (ji, .., j?)T be source and target loop indices respectively under the non-singular
transformation T. Let the loop bounds for loop ?k be an affine function of loop indices
li, .?(k--H1) Each lower bound is in the form of a?1i1 + .. + %(k-1)?(k-i) + bj < ?k There may
7
be many such
bounds, there
bounds in the
lower bounds whose maximum is the lower bound for ?k Similarly for upper
may be many affine bounds whose minimum is the upper bound for ?k The
loop nest can be written in the following matrix form:
LbSi+bI<IlSI and i?Si<UtSi+bt
where Lb(Ub) is an mj x n x n) matrix, bi(bu) is a vector of length mt(m?). Ii(It) is an
identity matrix with some of its rows replicated to an ml X n (m? x n) matrix. Each row of
Lb (Ub) plus the corresponding row from b1(b?) form one lower(upper) bound.
Then the source iteration space is bounded by the following inequalities:
AS??<?b, where A= L6D41b -b1
The bounds for Sj are found by replacing Si by T-1Sj.
ATh1S <b			(1)
These inequalities can not be used directly as loop bounds, since the bounds for a loop
can only be a function of outer loop indices. We use the Fourier-Motzkin elimination alg?
rithm [5j suggested in [2] to compute the suitable bounds. The Fourier-Motzkin algorithm
may introduce redundant constraints, but these may be eliminated [2].
The Fourier-Motzkin algorithm is quite simple. Consider a system of linear inequalities
_			< bj,			i =
This system can be partitioned into three sets of inequalities according to the sign of the
coefficient of X?.
< Dj(?w), i=(1,..,p)
>--H Ej(?x), j=(1,..,q)
0 < Fk(x), k=(l,..,r)
where Di, Ej and Fk are linear functions of x = (xi, ..X(n?1)).
Now, we can eliminate r? from the system to get the following reduced sy8tem.
8
q
3
2
-1			1			2			3			4			5			p
--H1< p <5			--H1< p <5
max(1, 2 )<			o+P+3			____			p+3
q <m:n(3,2)			mar(1, [P+2l1)< q <min(3,[ 2
(a) Image of Bounds			(b) Exact Bounds
Figure 3: Dense Iteration Space
Ej(?x)			<			Di(?x),			i = (l,..,p), j =
0			<			Fk(?x),			k=(1,..,r)
This process can be repeated until there is exactly one variable left. The bounds for this
variable can be determined from inspection of the reduced system of equations.
Going back to our problem, consider the system of inequalities for Sj. The loop bounds for
j? can be computed by solving the inequalities for j?. The bounds for ?k can be computed by
first eliminating i(k+i), ..j? from the system using Fourier-Motzkin elimination, then solving
for ik etc.
Consider the working example. The iteration space (Figure 1(c)) is represented by the
integer vectors bounded by the system of linear inequalities in Figure 1(e). By computing
i,j in terms of u, v, replacing i,j by u, v in the inequalities and using the Fourier-Motzkin
eliminatiun, w? havt: the image of the source bounds (Figure 1(f)). Unfortunately, we cannot
use these inequalities directly to generate code for the target loop nest. There are two
9
problems. First, the lower and upper bounds may not even be integers --H for example, when
u = 4, the lower bound for v is ?2 Furthermore, even though the source iteration space is
dense, the target iteration space is sparse. This means that we must find some way to skip
over points (like (2,3) in our example) that are not in the iteration space of the target loop
nest.
3.2 Dense Spaces
For the special case when the target iteration space is dense (such as when a unimodular
matrix is used to transform a loop nest with a dense iteration space (Theorem 2.3)), both
these problems can be solved easily. If the target iteration space is dense, there is no need
to skip over points that are not in the iteration space of the target loop nest. Furthermore,
we can use floor and ceiling operations to get the nearest integers within the image bounds.
For example, consider the unimodular transformation U = ( ?o1 2i ) on the working
example.
For the source bounds in Figure 1(e), we can compute the image bounds shown in Fig-
ure 3(a). Since the target space is dense, we can use the ceiling and floor operations to
compute the exact bounds shown in Figure 3(b).
3.3 Discussion
For sparse iterations spaces, the ceiling and floor operations cannot solve the problem. For
the example in Figure 1(d), (4, 3) is the closest integer point to the boundary of v when
u = 4, but the starting point of the target loop nest is (4, 4). One possibility is to use
conditional tests in the loop body to avoid executing the loop body at points that do not
correspond to points in the target iteration space. This approach has been used by other
researchers (8], but it involves visiting integer points that are not necessary; moreover, the
conditional tests are expensive.
4 Algorithm for Code Generation
The key insight to solving the general problem is that an integer non-singular matrix T
can be decomposed into the product of a lower triangular matrix H with positive diagonal
elements and a unimodular matrix U. This decomposition is related to the Hermite normal
form of the transformation matrix (10]. We show that if U is used to transform the program,
10
the resulting program executes iterations in the same lexicographic order as the program
obtained by using T as the transformation matrix, We also show that the diagonal elements
of H correspond to loop step sizes. Putting these observations together gives an algorithm
that generates efficient code for the general case of non-singular matrices.
4.1 Auxiliary Iteration Space
By applying column operations to an integer non-singular matrix T, we can reduce it to
an integer lower triangular matrix with positive diagonal elements. This lower triangular
matrix is related to the Hermite normal form [lOj of the matrix T. It follows that T can
be written as the product of a lower triangular matrix H with positive diagonal elements
and a unimodular matrix U that represents the composition of the column operations. This
decomposition is not unique, but for our purpose, any such decomposition is adequate; to
avoid being pedantic, we will abuse terminology and refer to any such H as the Hermite
form of the transformation matrix T. Figure 4 shows how to compute H and U
Let T = HU, and let the source space be Si, and the target space be ?i' Define 5k = USi.
Then,
S5=T&=HU&=H&
Definition 4.1 The iteration space 5k is called the auriliary iteration space of Si with
respect to the decomposition HU
Theorem 4.1 The auxtUary iteration space is a dense space if the source space is dense.
Proof: Follows from Theorem 2.3 since U is unimodular. 0
Therefore the exact loop bounds of the auxiliary space can be computed using the alg?
rithm in Section 3.
An important property of the auxiliary space is that it executes iterations in the same
lexicographic order as the target iteration space. To see this, consider our running example.
H is lower triangular with positive diagonal elements, and U is unimodular. Consider
using U to transform the source program.
11
Input: An n X n integer matrix T.
Output: The Hermite form H of T and a unimodniar matrix U.
Algorithm Hermite
begin
U = I, where I is an n x n identity matrix.
For i = 1 to n do
/* Consider the s?bmatrix T[i:n, i:n] */
While T[i, i+J:n] ? O? do
Apply elementary colttmn operations Uc to make T[i, q
positive and T[i, i+J:n] zero.
U = U??1U
End. While
End-For
H =T
end
Figure 4: Computing the Hermite form
This is the unimodular transformation considered in Figure 3 in Section 3, and the bounds
of the auxiliary iteration space are shown in Figure 3(b). To develop the readers insight into
the relationship between the source, auxiliary and target iteration spaces, the mappings
between these spaces are shown below --H notice that both (p, q) and (tt, v) are traversed in
the same lexicographical order, but that this order is different from that of the source. This
can also be seen by comparing the iteration space diagrams in Figures 1(d) and 3.
(3, 1)			(-1, 1)			H			(-2, 4)
(2, 1)			(0, 1)			(0, 3)
(1, 1)			(1, 1)			(2, 2)
(3, 2)			(1, 2)			(2, 5)
(2, 2)			.			(2, 2)			(4, 4)
(1, 2)			(3, 2)			(6, 3)
(3, 3)			(3, 3)			(6, 6)
(2, 3)			(4, 3)			(8, 5)
(1, 3)			(5, 3)			(10, 4)
To show that this property is true in general, let ? be the lexicographical order.
Theorem 4.2 If the auxiliary iteration space is traversed in the lexicographical order, then
the target iteration space is also traversed in the lexicographical order.
12
Proof: ? = HSk, where H is a lower triangular matrix with positive diagonal. Let k? ?
be two iterations in the auxiliary iteration space, and d1 = k2 --H k1 be the distance of the
two vectors. C?early d1 ? 0. To see that the lexicographical order is preserved, consider the
new distance d2.
d?=3?-)?=Hk2-Hk? =Hd?
If d? (i), the ith element of c7 is			nonzero,			must be positive, since			? 0.
?, the?eading			d?(i)
T?hen?he leading nonzero of d2 is hiidi(i), which is also positive. Therefore ci; ? 0, and
Ji ?j2 0
This result yields a technique for code generation - decompose T into HU, generate
the DO-loops for traversing the auxiliary space using the technique of Section 3 (or any
other technique that works for unimodular matrices) and compute the target iteration space
variables in the loop body. Using the bounds for the auxiliary space computed earlier, the
target code for our running example is the following:
Jorp=-1, 5
for q = mar(1, [P+211), min(3, [P+23J)
(:)a(2i :)(;)
A[u+3,? =
Although this code avoids making conditional tests, it can be improved considerably.
Notice that the computation of u is invariant in the inner loop; moreover, u is a linear
function of the outer loop index and it can be strength reduced. Similarly, v is a linear
function of p and q and it can be strength reduced. Although such optimizations can be left
to a later optimization phase, it is preferable to use the induction variables u and v directly
as the loop control variables instead of p and q. We show how to do this next.
4.2 Target Iteration Space
Since H is lower triangular, it is easy to convert the bounds in the auxiliary space into
bounds in the target space. For our example, the relation between these two spaces is given
by the following equation:
From the first equation, it follows that the bounds foru are the bounds of p multiplied
by 2. Therefore, the bounds for u are the following:
13
Thput: The Hermite form H, and the bounds of the au?liary space Sk
Output: The bounds of the target space ?.
Algorithm Bounds(1k, uk) : (t?, ui)
begin
Sk = H?1Sj
For i = 1 to n do
/* Compute the offset by replacing k1, .., k(i?i) by ji, ?(i--Hl) */
Vj = hiiki + .. + hi(i?l)k(i?l) = fi(il, ..,i(i-i))
/* Compute lower bound with k1, .., k(i?i) in ltk
replaced by ii, ?(i-l) using Sk = H--H1S1. */
i'j =vi+hiil.k
/* Compute upper bound with k1, .., k(i?i) in ut
replaced by ii, ?(i--Hl) using Sk = H?1Sj. */
u?j = Vj + hijut
End-For
end
Figure 5: Computing the loop bounds
--H2 <u < 10
The bounds for v are the bounds of q multiplied by 3 with the offset --Hp which is
2
Therefore, the bounds for v are the following:
U2+1<< u			___
U2max,2?v??2min(3,[ 2
These bounds on v are constant, and the bounds on v depend only on v. Therefore,
these bounds can be used directly to construct the loop nest, as is shown in Figure 1(b).
The general algorithm is given in Figure 5.
The proof of correctness of this algorithm depends on the following lemma and the fact
that the diagonal elements of H are positive, and is omitted.
Lemma 4.1 Let Pi = (ji, ...,j?) be a point in the target iteration space that is the image
of a point P2 = (k1, ..., kn) in the auxiliary space. Any co-ordinate ii can be written as a
function of k1, ..., k?; similarly, kj can be written as a function ofji, ...,jj.
14
Proof: Follows from the observation that any leading principal sub-matrix H[1 : i, 1: i] of
H is lower triangular and non-singular, and that the inverse of a lower triangular matrix is
also lower triangular [6j. 0
To complete the generation of code, we need to skip over points within the bounds that
are not in the target iteration space. This would be difficult to do if these points appeared
in some irregular??attern within the loop nest bounds; fortunately, we can show that this is
not the case. In fact, we show that it suffices to use DO-loops with constant step sizes, and
that these step sizes are the integers in the diagonal of the Hermite form.
For the working example, H has the diagonal [2,3], which means that the loop step is 2
for the outer loop, and 3 for the inner loop. More generally, we have the following theorem.
Theorem 4.3 The positive integers on the diagonal of the Hermite form are the gaps in
each dimension.
Proof: Consider two points Pi = (k1, k2, ...kj, k(i+l)...kn) and P2 = (k1, k2, ...kj+i, a(i+i)...an)
in the auxiliary space which have the same coordinates in the first i --H 1 dimensions. Let
P3 and P4 be their images in the target space. From Lemma 4.1, it follows that P3 and P4
have the same coordinate for the first (i --H 1) dimensions; moreover, their difference in the
ith dimension is hjj since H is lower triangular. 0
Hence a loop nest can be constructed to traverse the target space directly. For our
running example, the target program is the following:
fortt = -2, 10 step 2
forv = --H? + 3max(1, ?-??i)? --H? + 3min(3, [122+3J)step 3
= (`L+2v)/6;
Notice the auxiliary space is used only to compute the bounds for the target space.
4.3 Sparse Source Iteration Space
So far, we have considered only the case when the source iteration space is dense. Our
technique also works when the source iteration Si is sparse as long as the source space is
?vlar? A regular sparse space is one that can be represented by an integer lattice Aj. The
base space Sb is always dense. The lattice basis Aj can be thought of as a A-transformation
from Sb to S?. The bounds for 56 can be computed by the Fourier-Motzkin elimination.
Then any A-transformation T on S? can be considered as a new A-trasnformation TAj on
the base space S?,?since A-transformations are closed under composition.
This observation lets us handle source loops in which lower bounds are affine functions
of loop indices, upper bounds are piece-wise affine functions of loop indices and step sizes
15
are constant, as shown below.
Theorem 4.4 The following loop nest with constant steps is regular.
for i1 = 0 to ubi step 81
for i2 = a21i1 to ub2 step 82
for ?? = a?iii + ?n2?2 + . + ?n(n?1)2n?1 to ubn step s?
Proof: Without loss of generality, the loop nest can be shifted so that 0 is a loop point.
This just changes the origin of the lattice. It is easy to show that the triangular matrix S
whose jth row comes from the coefficients of the lower bound and the loop step of the jth
loop forms a basis for the lattice of the loop points.
?i			0			0
=			a21			?2			0
a??			?n2			8n
5 Completion Procedure
One advantage of using integer non-singular matrices is that there is a simple completion
procedure that takes the first few rows of a desired transformation matrix and generates a
complete transformation matrix that respects dependences. As an example that illustrates
the need for a completion procedure, consider parallelizing the following program for a MIMD
machine (3j.
for i = 4, 8
forj = 3, 8
Aji, ji = Ab-3, j-2] + 1;
The dependence matrix for this program is D =
The outermost loop is parallel if and only if it does not carry any dependences; that
is, the first entry of every dependence vector is 0. In our example, the outermost loop is
not a parallel loop, since iteration i depends on iteration i --H 3. We can transform the loop
nest into one in which the outermost loop is parallel if we can find a transformation T such
that every entry in the first row of TD is 0. Therefore, the condition that must be satisfied
?ll ?l2
for transformation T --H ?21 ?22 to achieve the goal is that t11 t12 2 = 0. The
condition can be satisfied by choosing t11 = 2 and t12 = --H3. This determines the first row
of the transformation matrix, and now we must add additional row(s) to get a non-singular
matrix that respects all the dependences of the loop nest.
16
for i = ii, Uj			for u = lt, Ut
forj=l?, Uj			for v = 1v? Uv
B[j, 2i-j] = 0;			Bfrxpr, t'] = 0;
(a)			(b)
Figure 6: Transformation for Data Locality
A second example that illustrates the need for a completion procedure is the generation
of code for a NUMA architecture starting from a language like FORTRAN-D with user-
specified data distributions. This problem is discussed at length in a companion paper; here,
we will simply show an example [7]. Consider the program of Figure 6(a). Assume that a
two dimensional array B has a wrapped column distribution, i.e the columns of an array
are distributed in a round-robin manner to the processors: if N is the number of processors,
then processor 0 gets columns 0, ]V? 2N and so on, while processor 1 gets columns 1, N+1,
2N+J, etc.
Distributing iterations of the outer loop among the processors (Figure 6(a)) results in
processor p executing iterations p, p + ]V? etc. Consider accesses to elements of array B.
There will be many remote memory accesses. Now, consider the transformed program of
Figure 6(b). If we distribute the outermost loop among the processors as before, there are
no remote accesses to B.
For this particular goal, we need to transform the loop nest so that the subscript in the
column index of B is the new outer loop index. This means that if the transformation is T
= ?ii ?12 the first row of the transformation must be (2-1). Again, we need to invoke a
t21 t22
completion procedure to add additional row(s) to get the full matrix.
We will refer to this as the completion of a partial transformation.
5.1 Completion Procedure
Our completion procedure requires that the following precondition be satisfied.
Precondition: The partial transformation must have full row rank, and should not violate
dependences.
These conditions are reasonable: if a row of the transformation matrix is linearly de
pendent on the others, it is clearly impossible to generate a non-singular matrix by adding
additional rows. Similarly, if some row of the partial transformation violates one or more
dependences, this cannot be rectified by extending the matrix with additional rows.
First, we delete all dependence vectors that are carried by the loops corresponding to the
rows of the partial transformation1 since they do not have to be considered when filling in the
rest of the matrix. The completion procedure works by finding a vector that is independent
17
of the existing row vectors in the partial transformation and within 90 degrees of each
dependence vector. This vector is appended a new row to the partial transformation and all
dependencies carried by the loop corresponding to this row are dropped from consideration.
This technique is repeatedly applied until there are no further dependencies to be satisfied,
at which point we can apply standard linear algebra techniques to complete the generation
of a non-singular matrix.
To find?the desired rows, we make use of the following invariant:
Invariant: The dependence vectors are in the orthogonal complement of the subspace
spanned by the rows of the partial transformation.
We find a vector that is within 90 degrees of every dependence vector and strictly within
90 degrees with at least one dependence vector. This vector can be found by looking for the
first row of the dependence matrix with nonzero entries. Let k be that row index.
Lemma 5.1 ek is within 90 degrees of every dependence vector, and strictly within 90 degrees
of at least one dependence vector.
Proof: It is easy to check that ek satisfy the condition, since for any dependence vector d,
4d > 0, and Jkd> 0 for at least one d. 0
But ek is not necessarily linearly independent of the rows in the partial transformation.
Therefore, we project ek to the orthogonal complement of the subspace spanned by the
rows of the partial transformation so that the projected vector is linearly independent of
the existing rows. If pT is the partial transformation, it is easy to see that the projector is
Q = (I --H p(pTp)?ipT) The projected vector is y = Qe?. Let x = cy for some positive
scaling number c that makes all of the entries integers and relative primes. For pT --H (2-3)
and D =			the projection is shown in Figure 7.
Theorem 5.1 The projected vector y is linearly independent of the rows and within 90
degrees of every dependence vector.
Proof:
0
o+ Linear independence:
We can prove a even stronger result, i.e. y is orthogonal to the rows in pT
pTy = pT(1 --H P(PTP)?iPT)e? =0
o+ Within 90 degrees:
For any dependence vector d, yTd = eT?QTd. Q is symmetric and d is already in the
orthogonal complement by the invariant. This means QTd --H Qd = d. Then yTd = Jkd.
Hence y is within 90 degrees of every dependence vector by Lemma 5.1.
18
3
2
Pi
y
-1
-2
-3			P
Figure 7: Extending Partial Transformation by Projection
The complete algorithm is presented in Figure 8.
Theorem 5.2 The algorithm Completion generates a legal A-transformation.
Proof: immediately follows from Theorem 5.1. 0
We now apply the algorithm to the examples in the previous section. The first example
can be satisfied by choosing t21 = 3 and t22 = 2. The new dependence vector is iOa )
which means that the new outer loop is a parallel loop. The matrix (: 2? ) satisfies the
conditions. For the second example, there is no dependence, so the full transformation is
(: ,i).
5.2 Discussion
The completion technique discussed here works even when dependences are represented using
direction vectors. There is considerable flexibility in the choice of the projector, and we have
19
Thput: An m x n partial transformation matrix pT
and a dependence matrix D.
Output: An n x n legal transformation matrix T.
Algorithm Completjon(PT, D) : Matrix
begin
/*LetP?T be rowi 0f pT
For i = 1, m
fT =i?rD
D = D - dj,
End-For
wheref[j]> 0
and di be column i of D */
r = m + 1;
While D is not empty do
let k be the first nonzero row of D;
x = c(I --H P(PTP)?lPT)e?;
where c is a positive number that makes x
an integer vector and the entries relative primes.
D = D - d, if d is a dependence vector and d[k]> 0
--H xT;
r = r + 1;
End- While
R = I, where I is an n x n identity matrix.
For i = 1, r do
/* Consider the submatrix PT[i:m, i:n] */
apply the elementary column operations to make PT[j, i] nonzero
and PT[i, i+J:n] zero.
If columns i and j have been exchanged Then
exchange rows i and j of R
End-If
End-For
return(append(PT, R[r+J:n, J:n]));
end
Figure 8: Computing a Legal Full Transformation
20
shown just one possibility; the choice of the most desirable projector will depend on the
application.
6 Related Work
Generalized loop transformations have been studied by Banerjee who showed that unimod-
ular matrices can model loop interchange, skewing and reversal [3]. Wolf and Lam [11, 12]
extended the unimodular framework to deal with both distance and direction vectors and
to transform loop nests for cache locality. Ancourt and Irigoin [2] developed algorithms for
scanning polyhedra using loop nests. They considered only dense iteration spaces. Lu and
Chen [8] have used injective ft'nctions to model loop transformations, but they use condi-
tional tests in their target code; our technique eliminates the need for such tests. We are
not aware of any prior work on general completion procedures for partial transformations.
7 Conclusions
We have introduced a loop transformation framework called A-transformations based on
integer non-singular matrices. Efficient code can be generated for target loop nests using
integer lattice theory. We have also presented a simple completion algorithm that generates
correct transformations from partial transformations.
8 Acknowledgments
We would like to thank Mark Charney, Tom Coleman, Shmuel Onn, Danny Ralph, Paul
Stodghill, Eva Tardos, Mike Todd, Charles Van Loan, Zhijun Wu for listening to the ideas
and for their comments.
References
[1]
R. Allen and K. Kennedy. Automatic translation of FORTRAN programs to vector
form. ACM Transactions on Progamming Lang?ages and Systems, 9(4):491--H542, Octo
ber 1987.
[2] C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In Third AUM Sympo-
sium on Principles and Practice of Parallel Programming, pages 39--H50, April 1991.
[3]
U. Banerjee. Unimodular transformations of double loops. In Proceedings of the Work-
shop on Advances in Languages and Compilers for Parallel Processing, pages 192--H219,
August 1990.
21
[4j J. W. 5. Cassels. An introduction to the geometry of numbers. Berlin, Springer, 1959.
[5] G. B. Dantzig and B. 0. Eaves. Fourier-motzkin elimination and its dual. Journal of
Combinatorial Theory(A), 14:288--H297, 1973.
[6] G. H. Golub and 0. F. Van Loan. Matrix Computations. Johns Hopkins University
Press, 1989.
(7] W. Li and K. Pingali. Access normalization: loop restructuring for NUMA compil-
ers. In Proc. 5th International Conference on Architectural Support for Programming
Languages and Operating Systems, 1992.
[8]
L. Lu. A unified framework for systematic loop transformations. In 3rd ACMSIGPLAN
Symposium on Principles and Practice of Parallel Programming, pages 28--H38, April
1991.
[9] D. Padua and M. Wolfe. Advanced compiler optimizations for supercomputers. Com-
munications ofACM, 29(12):1184--H1201, December 1986.
[10] A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1986.
[11]
M. Wolf and M. Lam. A data locality optimizing algorithm. In Proc. ACM SIGPLAN
91 Conference on Programming Language Design and Implementation, pages 30-44,
June 1991.
[12] M. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize
parallelism. IEEE Transactions on Parallel and Distributed Systems, October 1991.
[13] Michael Wolfe. Optimizing Supercompilers for Supercomputers. Pitman Publishing,
London, 1989.
22
