BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR94-1431
ENTRY:: 1994-06-27
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: The Lambda Loop Transformation Toolkit (User's Reference Manual)
AUTHOR:: Li, Wei 
AUTHOR:: Pingali, Keshav
DATE:: June 1994
PAGES:: 30
ABSTRACT::
Loop transformations are becoming critical to exploiting parallelism and data 
locality in parallelizing and optimizing compilers. This document describes 
the Lambda loop transformation toolkit, an implementation of the non-singular 
matrix transformation theory, which can represent any linear one-to-one 
transformation.

Lambda has a simple interface, and is independent of any compiler intermediate 
representation. It has been used in parallelizing compilers for multiprocessor 
machines as well as optimizing compilers for uniprocessor machines.

Keywords: Parallel programming, parallelizing compilers, loop transformations, 
linear transformations, nonsingular transformations.
END:: CORNELLCS//TR94-1431
BODY::
The Lambda Loop Transformation Toolkit
(User's Reference Manual)*
Wei Li
Keshav Pingali
TR 94-1431
June 1994
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
This work was partly supported by Cornell University, by a grant from the Hewlett-
Packard Corporation, and by the University of Rochester.
The
Lambda Lo6p Transformation Toolkit
(User's Reference Manual)
Wei Li
Department of Computer Science
University of Rochester
Keshav Pingali
Department of Computer Science
Cornell University
May 1994
University of Rochester Technical Report 511
Cornell University Technical Report 94-1431
Abstract
Loop transformations are becoming critical to exploiting parallelism and data lo
cality in parallelizing and optimizing compilers. This document describes the Lambda
loop transformation toolkit an implementation of the non-singular matrix transforma-
tion theory, which can represent any linear one-to-one translormation.
Lambda has a simple interface, and is independent of any compiler intermediate
representation. It has been used in parallelizing compilers for multiprocessor machines
as well as optimizing compilers for uniprocessor machines.
Keywords: Parallel programming, parallelizing compilers, loop transformations, linear
transformations, nonsingular transformations.
0This work was p&tly supported by Cornell University, by a grant from the Hewlett-Packard Corporation,
and by the University of Rochester.
Contents
1 Introduction
2 A Linear Loop Transformation Theory
3 Data Dependences
3.1 Data Types
3.2 Routines
4 Constructing Transformations
4.1 Data Types
4.2 Nonsingularity
4.3 Data Dependences
4
4
5 Code Restructuring
5.1			Computing Loop Bounds
5.1.1			Linear Expressions
5.1.2			Single Loops			.
5.1.3			Loop Nests
5.2			Computing Loop Body
6 Utility Routines
6.1 Print Routines
6.2 Vector Operations
6.3 Matrix Operations
6.4 Integer Operations
6.5 Data Dependences
7 How to Use Lambda?
7.1 Compiling and Linking
6
6
7
8
8
9
10
11
11
11
12
13
14
2
14
15
15
17
20
20
21
21
7.2 An Example
7.2.1 Data Dependences
7.2.2 Constructing Transformations
7.2.3 Code Restructuring
8 Summary
9 Acknowledgments
3
22
22
23
24
27
27
1 Introduction
Loop transformations are becoming critical to exploiting parallelims and data locality in
parallelizing and optimizing compilers. This document describes the Lambda loop transfor-
mation toolkit. The toolkit is an implementation of the non-singular matrix transformation
theory described in [6, 8].
The main benefit of this loop transformation theory is that it provides an approach
to tackling the socalled `phase-ordering problem', i.e. for many problems it is difficult to
decide the sequence of the primitive transformations such as loop interchange, loop skew-
ing, loop reversal [13], and loop scaling [8] should be performed. The loop transformation
theory, generating the unimodular matrix approach [2, 12], provides a framework to re?
resent compound transformations of these primitive transformations. In fact, this theory
can represent any linear one-toone transformation, of which the primitive transformations
mentioned above are special instances. This linear transformation theory has been extended
to handle loop aligment [1]. Other research on compound transformations can be found in
recent publications [4, 9,10, 11]. This toolkit has been used in parallelizing compilers for
multiprocessor machines [3, 7] as well as optimizing compilers for uniprocessor machines [5].
Lambda has a simple interface, and is independent of the intermediate representation
used in the compiler. There are four modules: the data dependence module, the transfor-
mation module, the code restructuring module, and the utility module.
Lambda is a research software, and distributed freely in the interest of advancement of
science. Due to its experimental nature, no warranty or liability are provided. A reasonable
level of support approriate to a research project may be expected.
The document is organized as follows. Section 2 gives a brief review of the non-singular
transformation theory. The data types and routines for handling data dependences are
described in Section 3. The routines for constructing loop transformations are presented in
Section 4. Section 5 shows the functions to restructure loops with a non-singular transfor-
mation. The utility functions such as print routines in Section 6 are helpful in the debugging
process. Some examples of using these routines are given in Section 7. We summarize in
Section 8.
2 A Linear Loop Transformation Theory
In this section, we briefly review the linear transformation framework developed in [8].
This transformation theory is based on the use of integer lattices as the model of loop nests
and the use of non-singular matrices as the model of loop transformations.
Consider the iteration space in Figure 1(a), which is derived from the loop nest in
Figure 1(c) where i is the outer loop and j is the inner loop. The outer loop is not parallel
because of the data dependence 2? ) For coarse-grain parallelism, it is preferable to have
a parallel outer loop, i.e. an outer loop that does not carry data dependences. This can be
4
2 3
(a) Iteration Space
for i = 1, 3
forj = 1, 3
A[i, il = A[i-1, j-2] # 1;
03
0
0			0
0
-l			I			2			3			4			5			u
(b) New Iteration Space
for u --H -1 5
for v = u+2max([(u+1)/2],1),
-u+2mind(u#3)/2j ,3), 2
A[(u+v)/2, v] = A[(u#v)/2 -1), v-2] # 1
(c)			Source code			(d) Transformed code
Figure 1: Loop transformation for coars&grain parallelism
accomplished by the loop transformation below:
The iteration space in Figure 1(b) is the transformed iteration space; in this iteration
space, loop u is a parallel loop as shown in Figure 1(d) It is possible to obtain this transfor-
mation by a sequence of simple loop transformations like skewing and interchange, but it is
non-trivial to determine the order in which these transformations must be applied to achieve
the desired effect. The matrix-based approach permits the generation of the transformation
matrix from the dependences, and thereby produces the composite transformation directly.
Given a loop nest and a non-singular matrix that represents a legal transformation of the
loop nest, it is non-trivial to generate the code for the transformed program. Fortunately,
we have shown that the Hermite normal form of the transformation matrix can be used
to solve this problem. The details of code generation may be found in a related paper [8].
This paper also gives a completion procedure which, given some initial rows of a desired
transformation matrix, produces a complete transformation matrix that represents a legal
restructuring of the original loop nest.
This framework can be used in a number of applications such as restructuring for ma-
chines exploiting coarsegrain or finegrain parallelism, for exploiting spatial and temporal
5
data locality and for enhancing data reuse.
3 Data Dependences
3.1 Data Types
A data dependence can be specified with either distance or direction. A distance is repre-
sented by an integer constant, and a direction is represented by the following symbols.
typedef eniun
LA?dK,
LA?dLT,
LA?dLEq,
LA?dEq,
LA?dGEq,
LA?dGT,
LA?dDOT,
LA?dLG,
LA?dSTAR
LA?dSIZE
> LA_DIR_T;
LA?dK represents distance, where its value is stored in another variable. The data type
for a dependence is a structure with two fields.
typedef struct 1a?dep ?
LA_DIR_T dir;
int dist;
LA?DEP?T;
+
*define LA?DEP?DIR(dep)			((dep) dir)
Sdefine LA?DEP?DIST(dep)			((dep) dist)
A data dependence vector is a vector (array) of dependences pointed by the field vector.
The field next is used to form a dependence matrix with a linked list.
typedef struct vector f
LA?DEP?T * vector;
struct vector *next;
*LA?DEP?V?T;
+
*define LA?DEP?V?VECTOR(v) ((v)->vector)
*define LA?DEP?V?NEXT(v) ((v)->next)
6
And a dependence matrix is a linked list of dependence vectors. The matrix has a header
that contains the dimension (dim) and the size (size) of the dependence matrix, i .e. the
number of dependence vectors in the matrix, in addition to the linked list of dependence
vectors.
typedef struct 1a?dep?m +
LA?DEP?V?T vectors;
int dim;
int size;
? *LA?DEP?M?T;
*define LA?DEP_M_vEcToRS(D)			((D)->vectors)
Idefine LA?DEP?M?DIM(D)			((D)->dim)
Idefine LA?DEP?M?sIzE(D)			((D)->size)
3.2 Routines
There are routines provided to operate on the dependences, some of which are listed here. A
legal dependence vector should be lexicographically positive. A data dependence analyzer
may return a dependence vector with illegal components, i.e. not lexicographically positive.
1a?dep?1egai deletes the illegal components. For example, the dependence vector (*, 1)
becomes (<, 1) and (=, 1). The illegal component (>, 1) is deleted.
o+ Creates a dependence vector.
LA_DEP_V_T 1a?dep?vec?newC int dim );
dim is the length of the dependence vector, i.e. the number of loops. This routine
allocates an array of size dim, and returns the pointer to the vector (array).
o+ Frees a dependence vector.
void 1a?dep?vec?free( LA?DEP?V?T d );
0 Creates a dependence matrix.
LA?DEP?N?T 1a?dep?matrix?new(int dim, int size);
This routine allocates the header for the dependence matrix. The dimension (dim)
and size (size, usually zero for initialization) must be provided. The pointer to the
vectors is nulled.
7
o+ Deletes the illegal components from a dependence matrix.
LA?DEP??(?T 1a?dep?1ega1(LA DEP_N?T D);
o+ Eliminates redundant data dependences.
LA?DEP?N_T la_dep?no?redundant(LA DEP_M_T D);
This routine is quite useful. The dependence vectors in a loop nest tend to be similar;
in fact many of them will be the same. The size of the dependence matrix tends to
decrease significantly after the redundant vectors are eliminated.
4 Constructing Transformations
4.1 Data Types
A matrix type is a two dimensional matrix, which is an array of arrays. The row?size is the
number of rows, and the co1?size is the number of columns in the matrix. The type can
represent any rational matrix A/d, where A is an integer matrix, and d is the denominator.
The field denom contains the integer denominator.
typedef struct _la_matrix ?
int ** matrix;
int row?size;
int co1?size;
int denom;
+ *LA?MATRIX?T;
*define LA?MATRIx(T)
o+define LA?MATRIX?ROWSIZE(T)
*define LA?MATRIX?COLSIZE(T)
Idefine LA?MATRIX?DENON(T)
((T)->matrix)
((T)->row?size)
((T)--H>co1?size)
((T)->denom)
To create a matrix structure with rowsize and coisize, we can use the following routine.
It allocates memory space for the two dimensional matrix, and sets the right values
to fields row?size and co1?size.
LA?MATRIX?T 1a?matrix?new(int rowsize, int colsize);
o+ Create a matrix struct. No space is allocated for the matrix pointed by the field
matrix.
8
LA?NATRIX?T 1a?matrix?a11ocate( void );
o+ Free the matrix struct. Any space pointed by matrix is not freed, and must be freed
before 1a?matrixjree is called.
void 1a?matrix?free( LA?NATRIX_T matrix);
4.2 Nonsingularity
This set of routines are useful in the construction of a non-singular transformation matrix.
o+ Checks if the transformation matrix is nonsingular.
int 1a?is?nonsingu1ar(LA?MATRIX?T T);
It returns 1 if the square matrix T is non-singular, 0 if the matrix is singular.
o+ Checks if the transformation matrix has full rank.
int 1a?is?fu11rank(LA?MATRIX?T pT);
pT can be any partial transformation, i.e. the number of rows is not equal to the
number of columns. The function returns 1 if the matrix has full rank, i.e. its rank
equals to the minimum of row size and column size, 0 otherwise.
o+ Computes the rank of a matrix.
int 1a?rank(LA?MATRIX?T pT);
It returns the rank of the partial transformation pT.
o+ Computes a base matrix.
LA?MATRIX?T 1a?base(LA?NATRIX?T pT);
A row base is computed by deleting the rows that are linearly dependent on the base
rows. The rows are processed in the top-down fashion, i.e. the ?rst row is considered
lirst, then the second row, and so on.
o+ Extends a base matrix to a nonsingular matrix.
LA_MATRIX?T 1a?padding(LA?MATRIX?T pT);
Once the row base matrix is obtained in pT, additional rows are added to form a
non-singular square matrix.
o+ Computes the inverse of a transformation.
LA?MATRIX_T la_inverse(LA?NATRIX?T T);
This routine is useful for computing the new loop body, where the inverse of the
transformation must be applied to the expressions in the loop body.
9
4.3 Data Dependences
This set of routines assists the construction of a legal transformation.
o+ Checks if a full transformation is legal with respect to the data dependences.
int 1a?is?1ega1(LA?NATRIX?T T, LA?DEP?M?T D);
It returns 1 if the complete transformation T does not violate the dependences in D,
o otherwise.
o+ Checks if a partial transformation is legal.
int 1a?is?1ega1?par(LA?MATRIX?T pT, LA?DEP?N?T D);
It returns 1 if the partial transformation T does not violate the dependences in D, 0
otherwise.
o+ Computes a legal base matrix by deleting the rows that violate the dependences.
LA?NATRIX?T 1a?base?1ega1 (LA?NATRIX?T pT, LA?DEP?M?T D);
A base matrix may violate data dependences. The rows that violate dependences
will be deleted from the matrix to form a legal partial transformation. The data
dependences that are satisfied by the the partial transformation are deleted from D.
o+ Extends the legal base matrix with respect to the dependence matrix.
LA?MATRIX?T 1a?padding?1ega1CLA?MATRIX?T pT, LA?DEP_M_T D);
Similarly, the legal base matrix will be extended to a legal complete non-singular
matrix by adding additional rows.
A data dependence will be transformed to a new dependence vector. The function
ia?dep?trans returns the transformed dependence vector of the dependence vector d
under the transformation T.
LA_DEP_V_T la dep?trans(LA?DEP?V?T d, LA?MATRIX?T T);
o+ Dot product of an integer vector and a dep vector.
LA?DEP?T 1a?vec?X?dep(1a?vect veci, LA?DEP?V?T d, int dim);
o+ An integer matrix times a dependence matrix.
LA?DEP?N?T 1a?matrix?X?depN(LA?NATRIX?T A, LA?DEP?N?T D);
10
5 Code Restructuring
A loop nest is represented by a list of loops, where each loop contains a lower bound, an
upper bound, and a step size. Each bound is a list of linear expressions. A lower bound
can be the maximum of a set of linear functions, and an upper bound can be the minimum
of a set of linear functions.
Symbolic variables are allowed in the loop bounds as long as they are loop invariants.
We call them blobs. There are design decisions in choosing blobs. For example, if the
expression 2x + 3y is one of the loop bounds, where both x and y are loop invariants. One
can make the whole expression a blob, then the number of blobs is the number of distinct
expressions that contain these variables. Or one can make each variable a blob, e.g. x is a
blob, and y is another blob, then the number of blobs is the number of such variables. In
general, a loop bound can have a linear expression of these blobs as a part of the bound.
We explain these data structures in the bottom-up fashion, i.e. from expression, loop,
to loop nest.
5.1 Computing Loop Bounds
First, simple integer vectors and matrices are defined. An integer vector is an array of
integers, and an integer matrix is an array of integer vectors.
typedef int * 1a?vect;
typedef 1a?vect * 1a?matrix;
5.1.1 Linear Expressions
A linear expression is a linear expression of the loop index variables and symbolic variables
(blobs). The linear expression type has a field for the coefficients (coel) of the loop index
variables and the coefficients (bThb?coej\ of the blobs. cO is the constant in the expression.
Again, denom can be used to represent a rational linear expression in the same way as
a rational matrix. The pointer next enables the representation of a linked list of linear
expressions.
typedef struct 1a?expr?
1a?vect coef;
int cO;
1a?vect b1ob?coef;
int denom;
struct 1a?expr *next;
> *LA?EXPR?T;
11
*define LA?EXPR?COEF(expr)
o+define LA?EXPR?CO(expr)
Idefine LA?EXPR?BLOB?COEF(expr)
Idefine LA?EXPR?DENOM(expr)
#define LA?EXPR?NEXT(expr)
((expr)--H>coef)
((expr)->cO)
((expr)->blob?coef)
((expr)->denom)
((expr)->next)
For example, let i,j be the loop index variables and x, y be two blobs in the linear
expression (i +j + 2 + ? + y)/3. The coefficient vector contains the linear coefficients (coef
(1 1) for i+j). The integer constant is in cO (co = 2). The blob coefficients are (1, 1), since
there are only two blobs and we consider x as blob 1, and y as blob 2. The denominator is
3.
We can allocate and free an expression with the dimension dim and total number of
blobs blobs using he following routines.
. LA?EXPR_T la_expr?new(int dim, int blobs);
This function allocates space for not only the loop structure but also the two coefficient
vectors. The size of the loop index coefficients is dim, and the size of the blobs
coefficients is blobs.
. void la?expr?free( LA?EXPR?T expr );
5.1.2 Single Loops
A loop has a lower bound, an upper bound, and a step size. The lower bound can be
the maximum of a set of linear expressions defined in the previous section, and a linear
expression with an integer denominator may have an implicit ceiling function since only
integer functions are allowed. An upper bound can be the minimum of a set of linear
expressions, where the floor function is implicit, when necessary.
For example,
DO i = max( ceil((i+j)/2), ...), min( floor((2i)/3), ...), step 2
The loop type has a list of expressions for both lower and upper bounds. The offset
expression (a linear expression of loop index variables) is used in both lower and upper
bounds.
typedef struct la?loop?
LA?EXPR?T low;
LA?EXPR?T up;
12
int step;
LA?EXPR?T offset;
> *LA?LOOP?T;
Idefine LA?LOOP?LOW(loop)
#define LA?LOOP?UP(loop)
#define LA?LOOP?STEP(loop)
Idefine LA?LOOP?OFFSET(loop)
((loop)->low)
((loop)->up)
((loop)--H>step)
((loop)->offset)
The new loop bounds after a nonsingular transformation can be more complex. The
bounds may have a linear offset, and an integer factor that equals to the step size. The
offset and factor are the same for both lower and upper bound in the same loop. The integer
factor is always the same as the loop step size.
For example,
DO v = u + 2*max( ceil((u+v)/2),
u + 2*min( floor(..),...), step 2
These two routines allocate and free a loop structure.
o+ LA?LOOP?T la?loop?new( void );
o+ void la?loop?free( LA?LOOP?T loop );
5.1.3 Loop Nests
A loop nest is an ordered array of loops with the ?rst being the outermost loop and the last
being the innermost. depth is the depth of the loop nest. blobs is the total number blobs.
typedef struct _la loopnest?
LA?LOOP?T *loops;
int depth;
int blobs;
+ *LA?LOOPNEST?T;
#define LA?NEST?LOOPS(nest)
Idefine LA?NEST?DEPTH(nest)
#define LA?NEST?BLOBS(nest)
((nest)->loops)
((nest)->depth)
((nest)->blobs)
These routines allocate and free a loop nest.
o+ LA?LOOPNEST_T la_nest?new(int depth, int blobs );
13
o+ void 1a?nest?free( LA?LOOPNEST?T nest );
la?nest takes a loop nest, and the transformation matrix and returns the new loop nest.
o+ LA?LOOPNEST?T 1a?nest(LA?LOOPNEST?T nest, LA?NATRIX?T T);
5.2 Computing Loop Body
The loop body can be considered as a set of simple vectors. Each vector can be transformed
using ia?vector.
A simple vector is a piece-wise linear function.
typedef struct 1a?vector ?
1a?vect coef;
int size;
int denom;
> *LA?VECThR_T
*define LA?VECTOR?COEF(v)
*define LA?VECTOR?SIZE(v)
*define LA?VECTOR?DENOM(v)
((v)->coef)
((v)--H>size)
((v)->denom)
The following routines allocate and free a simple vector.
o+ LA?VECTOR_T la_vector?new( int size );
o+ void la_vector_free( LA?VECTOR?T v
la?vector computes the new expression in the transformed loop nest. The input is the
inverse of the transformation.
o+ LA?VECTOR?T 1a?vector(LA?MATRIX?T invT, LA?VECTOR?T v);
6 Utility Routines
There are many other routines available in the toolkit. These routines include vector and
matrix operations, algebraic operations on dependences, print routines for debugging, and
so on.
14
6.1 Print Routines
Here is the list of print routines that are helpful for debugging.
0 Print a matrix.
void la?matrix?print( LA?NATRIX?T mat );
o+ Print a loop nest. If the start symbol is i, then index variables are i,j, k and so on.
void 1a?nest?print( LA?LOOPNEST?T nest, char start);
o+ Print a loop.
void la?1oop?print( LA?LOOP?T loop, int depth, int blobs, char start);
o+ Print an expression.
void la?expr?printC LA?EXPR?T expr, int depth, int blobs, char start);
o+ Print a dependence.
void la?dep?print(LA?DEP?T d);
o+ Print a dependence vector.
void la?dep?vec?print(LA?DEP?V?T d, int dim);
o+ Print a dependence matrix.
void la?dep?matrix?print( LA?DEP?M?T D);
6.2 Vector Operations
A vector is an array of integers ia?vect.
typedef int * la?vect;
In this section, unless specified otherwise, n is the length of the vector. Storage space
for all vector parameters must be allocated before calling the routines.
15
o+ Negate the vector veci, and the output is in vec2.
void 1a?vecNegate (1a?vect veci, 1a?vect vec2, int n);
o+ Multiply the vector veci by the constant c, and store the result in vec2.
void 1a?vecConst (1a?vect veci, 1a?vect vec2, int n, int c);
o+ Add two vectors veci and vec2 of size n, and the output is in vec3.
void 1a?vecAdd(1a?vect veci, 1a?vect vec2, 1a?vect vec3, int n);
o+ Add two vectors veci and vec2 of size n. vec3 is f1*vecl + f2*vec2.
void la_vecAddF(1a?vect veci, int fi, 1a?vect vec2, int f2,
1a?vect vec3, int n);
o+ Concatenate two vectors veci of size ni and vec2 od size n2. The output is in vec3.
void
1a?vecConcat(1a?vect veci, int 111,
1a?vect vec2, int n2,
1a?vect vec3);
o+ Make a copy of the vector veci of size n in vec2.
void 1a?vecCopy(1a?vect veci, 1a?vect vec2, int n);
o+ Check if every entry is zero. Returns 1 if yes, 0 otherwise.
int 1a?vecIsZero(1a?vect veci, int n);
o+ Clear a vector.
void 1a?vecC1ear(1a?vect veci, int n);
o+ Check if two vectors are equal. Returns 1 if yes, 0 otherwise.
int 1a?vecEq(1a vect veci, 1a?vect vec2, int n);
o+ Free a vector.
void la_vecFree(1a?vect vec);
16
6.3 Matrix Operations
A matrix ia?matrix is an array of vectors.
typedef ia?vect * ia?matrix;
In this section, unless speci?ed otherwise, m is the number of rows; and n is the number
of columns. Storage space for all matrix and vector parameters must be allocated before
calling the routines.
o+ Allocate a matrix of m by n.
ia?matrix la_matNew(int m, int n);
o+ Elementary operation: exchange two columns coil and coi2.
void ia?eExchange (ia?matrix mat, int m, int coil, int coi2 );
o+ Elementary operation: add an integral multiple (fact) of column coil to column coi2.
void ia?eAdd (ia?matrix mat, int m, int coil, int coi2, int fact);
o+ Elementary operation: negate the column coL
void ia?eNegate (ia?matrix mat, int m, int col);
o+ Elementary operation: multiply the column coi by the const fact.
void ia?eCoiConst(ia?matrix mat, int m, int col, int fact);
o+ Find the minimum non-zero in the vector from the position start to position n. Return
the array index of the minimum element.
int ia?eMinInVec(ia?vect vec, int n, int start);
o+ Return the first non-zero in the vector segment vec[i. . n]. Return n if all are zeroes.
int ia?eFirstNonZero(ia?vect vec, int n, int i);
o+ Apply the elementary column operations to decompose mat to the product of a lower
triangular matrix H and an unimodular matrix U. All matrices are of size n by n.
17
void 1a?matHermite(1a?matrix T, int n, 1a?matrix H, 1a?matrix U);
o+ Initialize the matrix mat ofsize n x n to identity.
void 1a?matId(1a_matrix mat, int n);
o+ Copy the matrix mall to the matrix mal2.
void 1a?matCopy(1a?matrix mati, 1a?matrix mat2, int m, int n);
o+ The matrix mall is negated and the output matrix is in mal2.
void 1a?matNegate(1a?matrix mati, 1a?matrix mat2, int m, int n);
o+ Compute the transpose ofthe matrix mall and the output matrix is in ma12.
void 1a?matT(1a?matrix mati, 1a?matrix mat2, int m, int n);
o+ Is an identity matrix? Returns 1 ifyes, 0 otherwise.
int 1a?matrix?is?id(LA?MATRIX?T M);
o+ Add two matrices mall and mal2, and the result is in mal3.
void 1a?matAdd(1a?matrix mati, 1a?matrix mat2,
1a?matrix mat3, int m, int n);
o+ Add two matrices with constant factors. mat3is fl*mall + f2*mal2.
void la_matAddF(1a?matrix mati, int fi, 1a?matrix mat2, int f2,
1a?matrix mat3, int m, int n);
o+ Matrix multiplication. mall is m x r; mat2is r x n; and mal3is m x n. mal3is the
product of mall and mal2.
void 1a?matNu1t(1a?matrix mati, 1a?matrix mat2, 1a?matrix mat3,
int m, int r, int n);
o+ Get a column from the matrix. The column colofthe matrix mat is returned in the
vector vec. m is the number ofrows in the matrix, and also the length ofthe column
vector.
void la_matGetCo1(1a?matrix mat, int m, int col, 1a?vect vec);
18
Th?matConcate concatenates two matrices by row. ml is the number of rows in mat1;
and m2 is the number of rows in mat2. The resulting matrix is in mat3, which shares
the storage space of rows with mati and mat2. Ia?matConcat?newM concatenates two
matrices, and allocates new storage space for the new matrix.
void 1a?matConcat(1a?matrix mati, int ml,
1a?matrix mat2, int m2,
1a?matrix mat3);
void 1a?matConcat?newM(1a?matrix mati, int ml, int n,
1a?matrix mat2, int m2,
1a?matrix mat3);
o+ Exchange two rows ii and i2.
void 1a?rowExchange (1a?matrix mat, int ii, int i2 );
o+ Add a multiple (fact) of row ii to row i2.
void 1a?rowAdd (1a?matrix mat, int m, int ii, int i2, int fact);
o+ Computes the inverse of the matrix. The result is in inv. Both matrices are n x n.
The function returns the determinant.
int 1a?matInv(1a?matrix mat, 1a?matrix inv, int n);
0 Free a matrix.
void 1a?matFree(1a?matrix mat, int m, int n);
o+ Multiply the matrix mat by the column vector vec, and the output vector is in vec-out.
void 1a?matVecNu1t(1a?matrix mat, int m, int II,
1a?vect vec, 1a?vect vec-out);
o+ Multiply the row vector vec by the matrix mat, and the output vector is in vec-out.
void 1a?vecMatNii1t(1a?vect vec, int m,
1a?matrix mat, int
1a?vect vec-out);
19
6.4 Integer Operations
o+ Return the greatest common divisor of two integers.
int la?gcd(int a, int b);
o+ Return the greatest common divisor of an array of integers.
int la?gcdV(la?vect v, int n
o+ Return the least common multiple of two integers.
int la?lcm(int a, int b);
6.5 Data Dependences
o+ Copy the dependence vector d of size dim to d-out.
void la?dep?vec?copy(LA?DEP?V?T d, LA?DEP_V_T d-out, int dim);
o+ Add the dependence vector d to the matrix as the ?rst column.
void la?dep?add to_list(LA?DEP?V?T d, LA?DEP?N?T D);
o+ Add the dependence vector d to the matrix as the last column. last is the pointer to
the last dependence vector in the linked list.
void la?dep?add to_list_last(LA?DEP?V?T d, LA?DEP?M?T D, LA?DEP?V?T last);
o+ Check if every element in the direction vector d is negative or zero. Return 1 if yes, 0
otherwise.
int la?allNegEq(LA?DIR?T *d, int n);
o+ Check if every element in the direction vector d is positive or zero. Return 1 if yes, 0
otherwise.
int la?allPosEq(LA?DIR?T *d, int n);
o+ Negate the direction vector.
void la?flagNegateCLA?DIR?T *d, int n);
20
o+ Compute the dot product of the integer vector vec and the dependence vector d. Both
vectors have size n.
LA?DEP?T 1a?vec?depV?mu1t(1a?vect veci, LA?DEP?V?T d, int dim);
o+ Multiply the integer vector vec by the dependence matrix D, and convert the resulting
dependence vector to direction vector in dir-out.
void 1a?depNu1t(1a?vect vec, LA?DEP?N?T D, LA?DIR?T *dir-out);
Delete the dependences with positive flags. The direction vector flag V has the same
size as the number of dependence vectors (columns) in the dependence matrix. flagV
is the vector of flags, each of which is a flag for the corresponding column in the de-
pendence matrix. When the flag is positive (LA?dL1), the corresponding dependence
vector is deleted from the matrix.
void 1a?de1PosDep(LA?DEP?M?T D, LA?DIR?T * flagv);
o+ Check if two dependence vectors are equal. Return 1 if yes, 0 otherwise.
int 1a?dep?vec?eq( LA?DEP?V?T dl, LA?DEP?V?T d2, int size);
o+ Check if two dependences are equal. Return 1 if yes, 0 otherwise.
int 1a?dep?eq( LA?DEP?T dl, LA?DEP?T d2);
7 How to Use Lambda?
This section describes how to use the toolkit.
7.1 Compiling and Linking
The file "Lambda.h" must be included, and the library "Lambda.a" must be linked. The
toolkit can be called from both C and C++. A typical Makefile looks like:
LIBS			= -lLambda
demo:			demo.o
$(CC) $(cFLAGS) -o demo demo.o $(LDFLAGS) $(LIBS)
21
7.2 An Example
This example shows how Lambda toolkit is used to transform a loop nest. The dependence
matrix, transformation matrix and loop nest (loop bounds) are all stored in the Lambda
format after being read from the data ?les In a compiler, they should be extracted from
the intermediate representation of the compiler. The following is the set of variables used
in the example.
int depth;
int blobs;
int i, j, k, r, n;
FILE *fp;
LA?MATRIX?T T;
LA_DEP_N_T D;
LA?L!)OPNEST?T nest, new?nest;
LA?LOOP?T loop;
LA?EXPR?T expr;
LA?DEP?V?T d;
int depSize;
7.2.1 Data Dependences
1* the nesting depth of loop nest */
/* number of blobs */
1* data file */
1* Transformation matrix */
/* Data dependence matrix */
/* Loop nests */
/* A loop */
/* An expression */
/* A dependence vector *1
/* number of dependence vectors */
The data dependence presentation in Lambda may be different from that is used in other
compilers. The conversion from other representation to the Lambda representation is quite
straightforward.
fscanf(fp, `?1.d', &depSize);
D = la?dep matrix_new(depth, depSize);
for (i=O; i<depSize; i++)
d = la?dep?vec?new(depth);
LA?DEP?V?NEXT(d) = LA?DEP?M?VECTORS(D);
LA?DEP_N_vEcToRS(D) =
for (j=O; j<depth; j++)
fscanf(fp, Il7??II, temp);
if (!strcmp(temp, "<`) )
LA?DEP?DIR( LA?DEP?V?VECTOR(d) [j]
else if (!strcmp(temp, 1=) )
LA?DEP?DIRC LA?DEP V_VECTOR(d)[j]
else if (!strcmp(temp, `>") )
LA?DEP?DIR( LA?DEP?V?VECTOR(d) [j]
else if (`.strcmp(temp, 1*11) )
= LA?dLT;
= LA?dE?;
= LA?dGT;
22
LA?DEP?DIR( LA_DEP_V?VECTOR(d)[j]
else if (!strcmp(temp, <=`) )
LA?DEP?DIR( LA?DEP?v?vEcToR(d) [j]
else if (Istrcmp(temp, ">=") )
LA_DEP_DIR( LA?DEP?V vEcTOR(d)[j]
else
= LA?dSTAR;
= LA?dLE?;
= LA?dGE?;
LA?DEP?DIR( LA?DEP?V?VECTOR(d)[j] ) = LA?dK;
sscanf(temp,
"Zd11
&(LA?DEP?DIST( LA?DEP?V?VECTOR(d) [j] ))
+
printf('1The data dependence matrix is :\n'1);
la_dep?matrix?print(D);
7.2.2 Constructing Transformations
The transformation T is represented by a two-dimensional array with the row size and
column size stored in the same structure. The print routine is always useful.
1* read the transformation matrix from the file */
T = la?matrix?new(depth, depth);
/* read in row size and col size */
fscanf(fp, "Y.d", &(LA?NATRIX?ROWSIZE(T)));
fscanf(fp, `Y.d", &(LA_MATRIX_coLSIZE(T)));
for(i=O; i<depth; i++)
for(j=O; j<depth; j++)
fscanf(fp, "7.d", &(LA?MATRIX(T) [i] [j]));
printf(`The transformation matrix is :Nn");
la?matrix?print(T);
A matrix may not contain a legal transformation. To be a transformation, the matrix
has to be non-singular and does not violate data dependences. If these conditions are not
satisfied, a legal transformation can be constructed using the routines for computing the
base matrix and for completing the partial transformation.
1* compute the legal dependence matrix *?
if( la?rank(T) == LA?NATRIX?COLSIZE(T) )
printf("?n\nThe transformation is non-singular :-) \n");
23
+
else
printf("?n\nThe transformation is singular or partial !! \n');
printf("Completing... \n'?);
1* computes the base matrix from parT *1
T = la?base(T);
printf("\nThe base matrix is:\n");
la?matrix?print(T);
/* computes the legal base matrix *1
/* dependences satisfied by T are deleted from D */
T = la_base legal(T, D);
printf("\nThe legal base matrix is:\n');
la?matrix?print(T);
printf("\nThe unsatisfied dependences:\n');
la?dep?matrix?print(D);
/* padding of the legal matrix w.r.t the dependence matrix. */
T = la?padding?legal(T, D);
printf('\nThe legal partial matrix after padding w.r.t depM is:\n");
la?matrix?print(T);
1* padding of the matrix to a legal non-singular matrix */
T = la?padding(T);
printf("\nThe non-singular matrix is:\n');
la?matrix?print(T);
+
if( la?is?legal(T, D) )
printf("\n\nThe transformation is legal :-) \n\n');
else
+
printf('\n\nThe transformation is illegal :-( \n");
exit(0);
7.2.3 Code Restructuring
The loop bounds from the perfectly nested loops are extracted from the intermediate rep-
resentation of the compiler, and stored in the Lambda format. The following code segment
reads in the bounds.
/* Create a structure for
/* Number of blobs should
fscanf(fp, `7.d", &blobs);
nest = la?nest?new(depth,
the loop nest. */
be computed from the program. */
blobs);
24
/* e.g DO k = max(I, i), min(j+b, N) */
for(i=O; i<depth; i++)
/* create a structure for the loop */
loop = la?loop?new();
/* k==O lower bound;
Note: there is an implicit CElL on each expression,
and an implicit MAX on the set of expressions. */
1* k==1 upper bound;
Note: there is an implicit FLOOR on each expression,
and an implicit MIN on the set of expressions. */
for(k=O; k<2; k++)
/* total number of expressions in the lower/upper bounds */
fscanf(fp, "Xd", &n);
for(r=O; r<n; r++)
expr = la?expr?new(depth, blobs);
/* read in the coef */
for(j=O; j<depth; j++)
fscanf(fp, "Y.d", &(LA?EXPR?COEF(expr) [j]));
/* read in the constant */
fscanf(fp, "Y.d', &(LA?EXPR?CO(expr)));
/* read in the blob coef */
for(j=O; j<blobs; j++)
fscanf(fp, "?/.d", &(LA?EXPR BLOB?COEF(expr) [j]));
/* read in the denominator */
fscanf(fp, `Y0d', &(LA?EXPR?DENOM(expr)));
if(k==O)
+
else
/* add the expr to the low bound list */
LA?EXPR?NEXT(expr) = LA?LOOP?LOW(loop);
LA?LOOP?LOW(loop) = expr;
1* add the expr to the upper bound list */
LA?EXPR?NEXT(expr) = LA?LOOP?UP(loop);
LA?LOOP?UP(loop) = expr;
+
+/* end of r-loop */
+1* end of k-loop */
/* read in the stepsize */
fscanf(fp, "7.d'?, &(LA?LOOP?STEP(loop)));
LA_NEST_LOOPS(nest)[i] = loop;
25
+
LA_NEST_BLOBS(nest) = blobs;
LA?NEST?DEPTH(nest) = depth;
printf("The original loop nest: \n");
?* The loop index variables are i, j, k...
la?nest?print( nest, `i');
/* The transformation is T. *1
new_nest = la?nest(nest, T);
printf('The new loop nest: \n'1);
/* The loop index variables are u, v, w...
la?nest?print( new?nest, `u');
The above code segment explained the data structure for storing the loop bounds, where
each lower bound can be the max of a set of linear expressions, and each upper bound can
be the min of a set of linear expressions. The function CElL is assumed for each expression
in the lower bound, and the function FLOOR is assumed for each expression in the upper
bound.
However, the new loop bounds can be more complex. We display the loop type be-
low again for convenience, and use a simple example to illustrate the loop bounds being
represented.
typedef struct la?loop?
LA_EXPR_T low;
LA?EXPR?T up;
int step;
LA?EXPR?T offset;
+ *LA?LOOP?T;
For example, the field low is a linked list of two expressions (i + j)/2 and (2i --H j)/3; the
field up is a linked list of three expressions i, j + 3 and (5i + 7j)/2; the step size is 2; and
the offset 1 is a linear expression 2i + 3j. Then the loop (call it loop k) has the following
bounds:
for k =
2*i+3*j + 2*max( ceil((i+j)/2), ceil((2*i--Hj)/3) ),
2*i+3*j + 2*min( i, j+3, floor((5*i+7*j)/2) ),
step 2
Note that the offset expression is in both the lower bound and the upper bound. Both
the lower bound and the upper bound are multiplied by the step size as well.
1The offset is always one single expression.
26
8 Summary
This document describes the Lambda loop transformation toolkit, an implementation of
the non-singular matrix transformation theory described in [8] that can represent any linear
one-toone transformation. The toolkit has a simple interface, and is independent of the
intermediate representation used in the compiler.
9 Acknowledgments
We would like to thank Donna Bergmark and David Presberg of the Cornell Theory Center,
and Richard Schooler and Bob Gottlieb of llewlett-Packard for their helpful feedback during
the development of the toolkit and the writing of this document.
References
[1]
[2]
E. Ayguade and J. Torres. Partitioning the statement per iteration space using non-
singular matrices. In Proceedings of The 1993 ACM International Conference on Su-
percomputing, July 1993.
U. Banerjee. Unimodular transformations of double loops. In Proceedings of the Work-
shop on Advances in Languages and Compilers for Parallel Processing, pages 192--H219,
August 1990.
[3] D. Bergmark and D. Presberg. Initial experiments in the integration of Parascope and
Lambda. Technical Report CTC93TR136, Cornell Theory Center, 1993.
[4]
W. Kelly and W. Pugh. Generating schedules and code within a unified reordering
transformation framework. Technical Report CS-TR-2995, Dept. of Computer Science,
University of Maryland, College Park, November 1992.
[5] W. Li. Compiler optimizations for cache locality and coherence. Technical Report 504,
Department of Computer Science, University of Rochester, April 1994.
[6] W. Li and K. Pingali. A singular loop transformation framework based on non-singular
matrices. In Proc. 5th Annual Workshop on Languages and Compilers for Parallelism,
August 1992.
[7] W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers.
ACM Transactions on Computer Systems, 1993.
[8] W. Li and K. Pingali. A singular loop transformation framework based on non-
singular matrices. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padna, editors,
Languages and Compilers for Parallelism, Lecture Notes in Computer Science 757.
Springer-Verlag, 1993.
27
[9]
L. Lu. A uni?ed framework for systematic loop transformations. In 3rd ACM SIGPLAN
Symposium on Principies and Practice of Parallel Programming, pages 28--H38, April
1991.
[10] J. Ramanujam. Non-unimodular transformations of nested loops. In Proc. of Super-
computing, 1992.
[11] R. Thekkath. A general framework for iteration-reordering loop trans-
SIGPLA N `92 Programming Language and Implementation Conference,
V. Sarkar and
formations. In
June 1992.
[12] M. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize
parallelism. IEEE Transactions on Parallel and Distributed Systems, October 1991.
[13] M. Wolfe. Optimizing Supercompilers for Supercomputers. Pitman Publishing, London,
1989.
28
Index
blob, 11
data dependence, 6
distance, 6
la?IlNegEq, 20
la?l1PosEq, 20
laffibase, 9
la?baseAegaI, to
laAep?ddftoJist, 20
1aAep?addftoJistJast, 20
LA1)EP?IR, 6
LAJ)EPJ)IST, 6
la?dep?eq, 21
ladepjegal, 8
LAd)EP?A)!M, 7
LAJ)EP??IZE, 7
LAJ)EPAlff, 7
LAJ)EP?LVECTORS, 7
laAepinatrix?ew, 7
IaAepjnatrix?print, 15
laAep?o?edundant, 8
Iadep?print, 15
LAJ)EP?T, 6
laAepftrans, 10
LAJ)EPLV?EXT, 7
LAJ)EP V T 7
LA?EPLV?VECTOR, 7
ladep?vec?copy, 20
ladep?vec?eq, 21
1aAep?vec?ree, 7
laMep?vec?new, 7
laMep?vec?print, 15
la?depMult, 21
laAepPosDep, 21
LAd)IRff, 6
la?eAdd, 17
la?eColConst, 17
la?eExchange, 17
la?eFirstNonZero, 17
Ia?eMinInVec, 17
1a?eNegate, 17
LATXPR?BLOB?COEF, 12
LA?EXPWC0, 12
LA?EXPR?COEF, 12
1a?exprIree, 12
la?expr?new, 12
LA?EXPR?EXT, 12
la?expr?print, 15
LA?EXPR?T, 12
IaAIagNegate, 20
1a?cd, 20
la?cdV, 20
laAnverse, 9
lajs?fu11rank, 9
laAsAegal, 10
lajsJegal?par, 10
lajs?nonsingular, 9
lajcm, 20
laAoop?ree, 13
LAlOOPIOW, 13
laJoop?new, 13
LAiOOP?OFFSET, 13
IaJoop?print, 15
LAiOOPTEP, 13
LAiOOP?T, 13
LAiOOP?UP, 13
LAiOOPNEST?T, 13
lainatAdd, 18
laj?atAddF, 18
la?natConcat, 19
la?atConcat?ewM, 19
1a?atCopy, 18
Ia?atFree, 19
Ia?matGetCol, 18
Iajnatllermite, 17
Iajnatld, 18
lajnatlnv, 19
la?natMult, 18
lajnatNegate, 18
la?atNew, 17
LAAlATRIX, 8
la?atnx, 11, 17
Ia?atrix?llocate, 8
LA?MATwxfoLSIzE, 8
29
LAA1ATR!X?DENOM, 8
Ia?matrixIree, 9
la?matrixAsJd, 18
la?atrix?ew, 8
la?atrix?pnnt, 15
LAA1ATRIX?ROWS!ZE, 8
LAA1ATRIX?T, 8
la?inatrix?AepM, 10
lajnatT, 18
la?atVecMult, 19
la?nest, 14
LA?NESTJ3LOBS, 13
LA?ESTJ)EPTll, 13
la?est?ree, 14
LA?EST1OOPS, 13
Ia?estjiew, 13
lajiest?print, 15
la?padding, 9
la?pa?dingJegal, 10
la?ank, 9
la?owAdd, 19
la?owExchange, 19
la?vecAepV?mult, 21
la?vec?Aep, 10
la?vecAdd, 16
la?vecAddF, 16
l??vecClear, 16
la?vecConcat, 16
la?vecConst, 16
la?vecCopy, 16
la?vecEq, 16
la?vecFree, 16
la?vecIsZero, 16
la?vecMatMult, 19
la?vecNegate, 16
la?vect, 11,15
la?vector, 14
LALVECTOR?COEF, 14
LA?VECTORJ)ENOM, 14
Ia?vector?ree, 14
la?edor?ew, 14
LAWECTORIZE, 14
LA?VECTORT 14
loop transformation, 4
compound, 4
nonsingular, 4
others, 4
unimodular, 4
primitive, 4
30
