BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR92-1282
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Area-Universal Interconnection Networks for VLSI Parallel Computers
AUTHOR:: Bay, Paul Edwin
DATE:: May 1992
PAGES:: 82
COPYRIGHT:: Paul Edwin Bay 1992 - All Rights Reserved
ABSTRACT::
A central issue in the design of a general-purpose parallel computer is the 
choice of an interconnection network and an associated algorithm for routing 
messages through it. The main results of this thesis are two new 
interconnection networks, the pruned butterfly and the sorting fat-tree 
and deterministic routing algorithms for them. Both networks are 
area-universal, i.e., they can simulate any other routing network fitting in 
similar VLSI chip area with only polylogarithmic slowdown. Previous 
area-universal networks were either for the off-line problem, where the 
message set to be routed is known in advance and substantial precomputation is 
permitted, or involved randomization, yielding results that hold only with 
high probability. The two networks introduced here are the first that are 
simultaneously deterministic and on-line and they use two substantially 
different routing techniques. The performance of the routing algorithms 
depends on the difficulty of the problem instance, which is measured by a 
quantity $\lambda$ known as the load factor. The pruned butterfly algorithm 
runs in time $O$ ($\lambda log^2$ $N$), where $N$ is the number of possible 
sources and destinations for messages and $\lambda$ is assumed to be 
polynomial in $N$. The sorting fat-tree algorithm runs in $O$ 
($\lambda log N + log^2 N$) time for a restricted class of message sets 
including partial permutations.

Several related results are also presented in this thesis. A nontrivial lower 
bound on wire area is proven for a class of tree-based networks that do not 
modify the content of messages are shown to be subject to an area-time 
tradeoff.  This lower bound implies the sorting fat-tree's area-time 
performance is optimal for a wide range of possible values for $\lambda$. 
Other results of this work include a new type of sorting circuit and an 
area-universal VLSI circuit.
END:: CORNELLCS//TR92-1282
BODY::
Area-Universal Interconnection
Networks for VLSl
Parallel Computers
Paul Edwin Bay
Ph.D Thesis
92-1282
May1992
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
AR?EA-UNIVER?SAL INTER?CONNECTIOX NETWOR?NS
F'OR? VLSI PAR?ALLEL COMPUTER?S
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Paul Edwin Bay
May 1992
e Paul Edwin Bay 1992
ALL ?IGllTS R?ESEItVED
AREA-UNWERSAL INTERCONNECTION NETWORKS FOR VLSI
PARALLEL COMPUTERS
Paul Edwin Bay, Ph.D.
Cornell University 1992
A central issue in the design of a general-purpose parallel computer is the choice
of an interconnection network and an associated algorithm for routing messages
through it. The main results of this thesis are two new interconnection networks,
the pruned butterfly and the sorting fat-tree, and deterministic routing algorithms
for them. Both networks are area-universal, i.e., they can simulate any other rout-
ing network fitting in similar VLSI chip area with only polylogarithmic slowdown.
Previous area-universal networks were either for the off-line problem, where the
message set to be routed is known in advance and substantial precomputation is
permitted, or involved randomization, yielding results that hold only with high
probability. The two networks introduced here are the first that are simultane-
ously deterministic and on-line, and they use two substantially different routing
techniques. The performance of the routing algorithms depends on the difficulty
of the problem instance, which is measured by a quantity A known as the load
factor. The pruned butterfly algorithm runs in time O(Alog2 N), where N is the
number of possible sources and destinations for messages and A is assumed to be
polynomial in N. The sorting fat-tree algorithm runs in O(A log N + log2 N) time
for a restricted class of message sets including partial permutations.
Several related results are also presented in this thesis. A nontrivial lower
bound on wire area is proven for a class of tree-based networks that includes some
known area-universal networks. Routing networks that do not modify the content
of messages are shown to be subject to an area-time tradeoff. This lower bound
implies the sorting fat-tree's area-time performance is optimal for a wide range
of possible values for A. Other results of this work include a new type of sorting
circuit and an area-universal VLSI circuit.
Biographical Sketch
Paul Bay was born in 1964 near Des Moines, Iowa, and grew up on a small farm
near Rock Island, Illinois He lived for a year in Heidelberg, Germany, and, while an
undergraduate, spent summers working in Hannover and Manuheim, Germany, in
Boca Raton, Florida, and in Beaverton, Oregon. He completed bachelor's degrees
in Computer Science and Computer Engineering with distinction from Iowa State
University of Science and Technology in 1985. He married Gretta Anderson in
1986. He received the Master of Science and Doctor of Philosophy degrees in
computer science from Cornell University in 1989 and 1992 respectively.
111
To
my parents,
who taught me to love learning
`v
Acknowledgements
I would like to thank my advisor and friend, Gianfranco Bilardi. llis creative ideas
have influenced all of this thesis, but in particular, Chapters 2, 3, and 5 are as
much his work as mine, and are taken almost directly from a joint paper currently
being refereed for journal publication. It has been a great pleasure to collaborate
with Gianfranco. In addition to all he has taught me about the theory of parallel
computing, I have benefited tremendously from the example of his methodical,
carefully focussed research style. I thank him for sharing his intuitions about
which problems are hard and for his encouragement when my progress was slow.
Thanks to Keshav Pingali and Adam Bojanczyk for serving on my commit-
tee. Thanks to my officemates Michael Wilk and Christine Piatko for their proof-
reading help. Financial support was provided by a fellowship from the General
Electric foundation, teaching assistantships from the Cornell Computer Science
department, and by a research assistantship funded by Joint Services Electronics
Program contract F4962O-87-C-OO44.
Thanks to Dave Knudsen, Ellen Knudsen, Sam Otter, and Caverlee Carey
for support, camaraderie, and much needed humor during the early years. Thanks
also to my friends and fellow students Bill Aitken, Richard Chang, Robert Freimer,
Chet Murthy, Christine Piatko, Jim Russell, Jennifer Turney, and Michael Wilk
v
for sharing all the ups and downs of graduate student life.
Under the excruciating pressure of graduate school it is very easy to become
caught up in the collective academic delusion that one's human worth is contingent
on academic success. I would especially like to thank my wife Cretta Andersoii for
challenging that delusion with her laughter, patience, and unconditional love.
v'
Table of Contents
2
3
Introduction
1.1 Interconnection Networks for General-purpose Parallel Computers
1.2 Definitions
1.3 Previous Work
1.4 Preview
Deterministic On-line Routing on the Pruned Butterfly
2.1 The Pruned Butterfly .
2.2			Auxiliary Circuitry . . .			. .			. . .
2.3			Routing Algorithm
Deterministic On-line Routing on the Sorting Fat-tree
3.1			Flexible Sorters
3.2			The Sorting Fat-Tree .			. .
3.3			Routing Algorithm . . .			. . .			.
4 Load Factor and Area-universality in Tree-structured Networks
4.1			?-Channel-sufflcient Networks . . .			. . . .			. . .			. .
4.2 The Reference Load Factor
4.3 A Lower Bound on the Area of Certain Tree-structured Networks
5 An AT2 Lower Bound for Content-Preserving Routers
6 An Area-universal VLSI Circuit
6.1			Introduction . . .			. . . . . .			.
6.2			An Area-universal VLSI Circuit
6.3			Discussion . . .			. . .			. . . .			. . .			. . . .			.
7 Conclusion
Bibliography
4
7
10
13
13
17
19
26
27
33
35
40
41
49
52
61
68
68
69
75
77
80
vii
List of Figures
2.1
2.2
2.3
The pruned butterfly when N = 16. . . . . . . .
The pruned butterfly is a subgraph of the butterfly. .
The 11-tree layout of a 16-leaf subtree of the pruned butterfly.
3.1 A flexible sorter with s = 8 and 1 = 64
3.2 The layout of the sorting fat-tree
4.1			Example of a tree-structured network
4.2			Creating paths from a flow. . . .			.
4.3			Partition of layout of F
4.4			Modified partition of layout of F			.
14
15
17
29
34
43
45
o+ .			. . . .			55
56
v"'
Chapter 1
Introduction
1.1 Interconnection Networks for
General-purpose Parallel Computers
The most interesting and important feature of a parallel computer is the network
that connects its processors. For a special-purpose parallel computer, the best
interconnection pattern is one that reflects the communication structure of the
particular algorithm to be executed on it. For general-purpose parallel computers,
researchers disagree not only about which interconnection network is best, but,
more fundamentally, on what terms it should be argued that a given network is
general-purpose.
One way to show a network is general-purpose is to efficiently implement a
variety of important parallel algorithms on it. Manufacturers and users of parallel
computers typically try to justify the generality of their machines this way. A more
theoretical version of this style of argument is Preparata and Vuillemin's demon-
stration [PV8l] that the cube-connected cycles can execute in O(log N) time any
algorithm in the ASCEND/DESCEND class. Algorithms in this class include per-
2
mutation routing, the fast Fourier transform, and bitonic merge. The weakness of
these arguments is that unknown problems might require communication patterns
that have no efficient implementation on the allegedly general-purpose network.
Another method for arguing network I? is good for general-purpose comput-
ing is to specify a desired general programming model and show that R is able
to implement it efficiently. For instance, Ranade, Bhatt, and Johnsson [RBJ88]
describe an abstract parallel machine called "Fluent" which uses a shared memory
programming model. Fluent includes a generalized form of prefix computation as
one of its basic instructions. The authors prove that the butterfly interconnection
network supports the shared memory and the prefix computation efficiently, i.e.,
in O(log N) time with high probability. We call this a "primitive-based" argument
about the general-purpose nature of the butterfly, because it depends primarily on
the communication primitives that are the basis for the programming modd.
Peleg and Upfal [PU87,PU89] give a "primitive-based" argument for the use of
expander graphs as interconnection networks. They define an (m, K1, K2) commu-
nication request as a set of m messages to be exchanged among the N processors
with at most K1 originating at any given source and at most K2 terminating at
any destination. They exhibit a network based on an N-node expander graph
that can execute any (rn, K1, I?) communication request in O(log N + K1 + K2)
time. Only expander-based networks can achieve this performance, which matches
a known lower bound.
The PRAM [FW78] is the most popular abstract model of parallel computa-
tion, and a large body of literature [MV84,Ran87,UW87,AllMP87,NU88,ller90]
is devoted to efficient simulations of PRAMs on bounded-degree networks. This
3
work can also be regarded as "primitive-based" arguments for the general-purpose
nature of those bounded-degree networks. In the case of the deterministic simula-
tions, the networks are again based on expander graphs.
Three criticisms can be leveled at the above-mentioned research efforts. First,
the arguments about the general-purpose nature of the interconnection networks
are persuasive only to the extent that the communication primitives are flexible
and expressive. There might be other useful forms of communication that cannot
be easily cast in terms of the primitives provided. Second, the cost of a network
is taken to be the number of nodes, an unrealistic approximation that neglects
significant wiring costs. Third, most of the above research efforts are concerned
with worst-case execution time for the communication primitive. A network and
message routing algorithm that are optimal in the worst case may not execute all
instances of the problem efficiently. In fact, the network may be efficient only on the
worst-case instances. One could reasonably hope to find a network and algorithm
that execute easy instances of the communication primitive more quickly than the
hard ones.
This thesis addresses all three criticisms. First, we prove certain networks can
emulate all special-purpose networks of comparable cost with only a small penalty
in time. We call this an "emulative" argument that the networks are general-
purpose. Rather than relying on the breadth of a class of abstract communication
primitives, the argument compares the network against all other feasibly buildable
networks. An example of an "emulative" argument in the bounded-degree network
model can be found in [GP83]. There, Galil and Paul prove that the N-node cube-
connected cycles is universal, i.e., it can simulate any other N-node network with
4
only an O(log N) time penalty. "Emulative" arguments of the general-purpose
nature of computing devices have been made for the boolean circuit model of
parallel computation as well. In [C1185], Cook and Hoover design a boolean circuit
of depth 0(d) and size O(c3d/ log c) that can simulate any boolean circuit of size c
and depth d. It is thus depth-universal, because of the only constant factor increase
in depth. Related work can be found in [Val76] and [Gol82].
The second criticism is addressed in this thesis by carrying out the work in
Thompson's VLSI model of computation [Tho80]. In this model, the cost of a
network is the chip area it occupies in a two-dimensional1 layout. Wires have an
explicit cost proportional to their length, and high bandwidth networks like those
based on expander graphs have a cost equal or nearly equal to the square of the
number of nodes. Thus the problem of designing an interconnection network for
general-purpose computing has a different solution than in the bounded-degree
graph model.
The third criticism is addressed by expressing routing performance in terms
of load factor, a measure of the difficulty of an instance of the routing problem.
The next section will define this term more formally and clarify the model for
interconnection networks used in this thesis.
1.2 Definitions
We will usually think of a parallel computer as a set of N processors executing
arithmetic and logic operations independently and communicating by sending mes-
1 Thompson's model has been extended to three dimensions in [Ros8l]. Most of the results of
this thesis could probably be extended from two to three dimensions with simple adaptations. To
make comparisons easier, when describing results formulated in the literature in three dimensions,
we convert them to equivalent two-dimensional results.
5
sages through an interconnection network. Our main focus is the network, which
we model as a graph whose vertices represent constant-size logic gates, and whose
edges represent wires connecting the corresponding gates. The cost of the network
is the area required to lay out the graph on a constant number of superimposed pla-
nar manhattan grids, as described in [Tho80]. A distinguished subset of N vertices
called terminals are processor interface points and can be sources or destinations
of messages. The processors themselves are assumed to be outside the model, per-
haps on a different layer or connected in the third dimension, but in any case not
counted in the area cost of the network. The processors' instruction streams could
be the same or different, and communication could be overlapped with computa-
tion, but the interconnection network is synchronous; that is, some global timing
mechanism permits topologically distant gates to switch simultaneously.
A message is a triple (src, dst, body?, where src and dst are log N-bit terminal
id's and body is a b-bit string. (Typically we will assume b = 0- (log N).) The
messages of the input set Al are initially made available at their source terminals,
and the task of the interconnection network is to deliver the message bodies to
the destinations. (The interconnection network will usually be referred to as a
routing network, or simply as a router.) The degree of a message set, d(M), is
the maximum number of messages that have their source or destination at a given
terminal. If d(M) = 1, Al is called a partial permutation. Sometimes we will find
it useful to think of a set of messages M as a graph with one vertex for each of the
N terminals and an edge between vertices t and v if M contains a message with
source u and destination v.
There are two version of the routing problem: off-line and on-line. In off-line
6
routing, the sources and destinations of the messages are known before the routing
starts, and the network can be configured accordingly. The time to move messages
from source to destination is the primary concern, rather than the time needed
for configuring the network. Off-line routing can be used when the pattern of
communication is known at compile time. An interesting special case is when a
known fixed-connection network is being simulated. In on-line routing, sources and
destinations are not known until run time. Any general-purpose computer must
have an efficient on-line routing algorithm so it can run programs that generate
unpredictable, data-dependent message sets.
Let 1? be a N-terminal routing network where each edge e has a positive integer
capacity cap(e). (With the exception of some discussion in Chapter 4, edges will
have capacity 1.) The time to route a message set M on an N-terminal routing
network will be stated in terms of a parameter A(1?, M), known as the load factor
of the message set. If U and V are subsets of the terminals, a cut 5 is a set of edges
such that every path from a terminal in U to a terminal in V includes some edge
in 5. The capacity of a cut 5 is the sum of the capacities of its edges. The load
placed on cut 5 by M is defined as the number of messages that must cross it, and
denoted ?(R, M, 5). The load factor of a cut is ?(R, M, 5)/ ?eES cap(e), denoted
A(P, M, 5). The load factor on the entire network, A(1?, M), is the maximum
load factor of any cut. The quantity 6A(R, M) is a straightforward bandwidth-
based lower bound on the number of bit steps necessary to route Al. When the
context makes it clear which network the load factor is being computed for, the
first argument to A will be dropped.
7
1.3 Previous Work
An N-terminal routing network of area A is called area-un?versal if it can simulate
any other N-terminal network of area 0(A) with only a polylogarithmic slowdown
in time. Work on area-universal networks grew out of the study of the area-efficient
layout of graphs. In [BL84], Bhatt and Leighton showed how to use a graph known
as the "tree of meshes" to lay out an arbitrary graph efficiently. Their method was
to decompose the arbitrary graph hierarchically and then to use the decomposition
(after certain balancing operations) to embed the graph in the tree of meshes.
In a groundbreaking paper [Lei85b], Leiserson presented a routing network
called a fat-tree, with an overall structure and layout similar to the tree of meshes.
A fat-tree has the form of a complete binary tree with the terminals at the leaves.
At the internal nodes are subuetworks that perform switching flinctions. Two
adjacent nodes in the tree are joined by a group of wires called a channel. The
number of wires in a channel c is referred to as its capacity, written cap(c). The
channel capacities near the leaves are small constants, but grow by a factor of two
every two levels to reach o-(#N) at the root. Under certain assumptions about the
area of the switches at the internal nodes, any network of this form has a layout
of area O(N log2 N).
The switching device at the internal nodes of Leiserson's fat-tree has constant
depth and is based on a-partial concentrator graphs [Pip77]. Its detailed structure
is too complex to repeat here, but has the following essential characteristics. The
switch has three ports: one going to the parent channel, one to the left child, and
one to the right. A set of paths through the switch from one port to the other
two can be accommodated on disjoint wires provided the paths occupy no more
8
than a fraction a of the capacity of each participating channel. The constant a is
the same for the whole tree and is between 0 and 1. As the term "fat-tree" was
later broadened to include other networks, we will refer to the above network as
the concentrator fat-tree, or CFT for short.
For the N-terminal concentrator fat-tree, Leiserson gave an algorithm to route
a set M of messages off-line in time O(A(M) log2 N). The algorithm is based
on the following lemma, itself a direct consequence of the properties of a-partial
concentrators.
Lemma 1 (Leiserson) A message set Al satisfying for all c, A(CFT, INI, c) ? a,
can be routed by CFT on disjoint paths.
The first step of the algorithm is the off-line partition of the input set M into
subsets M1..., Mo(AlogN) such that each Mi satisfies max? A(CFT, Mi, c) < a,
where c ranges over all channels. Then each batch Mi is routed in turn on the
disjoint paths guaranteed by Lemma 1. The messages follow the unique simple
path from source to destination in the underlying binary tree, and are sent bit-
serially. The delivery time for a batch is proportional to the sum of the tree height
and the message length, or O(log N); hence the delivery time for the entire set M
is O(A log2 N).
Another contribution of FLei85b] was to relate a network's routing performance
to a guarantee about its efficiency in simulating an arbitrary network of similar
cost. This relation was made possible by adapting some of the techniques of [BL84]
to show that an arbitrary routing network can be hierarchically decomposed and
matched with the tree structure of the concentrator fat-tree. Once this matching
has been accomplished, the fact that the fat-tree's routing algorithm runs in time
9
proportional to the routing difficulty A(M) leads to a good bound on the simulation
slowdown. The concentrator fat-tree can simulate an arbitrary N-terminal network
of area O(N log2 N) with a slowdown in time of only O(log2 N). We will examine
these ideas in more detail in Chapter 4.
The first on-line routing algorithm for an area-universal network was proposed
in [GL85,GL89]. There, an N-terminal network of area O(N log2 N) uses ran-
domization to route in time O(A log2 N log log N) with high probability. A faster
randomized on-line routing algorithm, which permits en route combining of mes-
sages with the same destination, is offered in [LMR?88i, for a network with N
terminals, O(N log2 N) area, and O(A(AI) log N t log2 N) routing time with high
probability2. In [Gre90], area-universality is investigated under alternative assump-
tions about wire delay, and area-universal networks with processors of various sizes
are considered.
A complementary research effort based on the ideas of load factor and fat-trees
is [LM88]. Motivated by concern about the lack of realism of the popular PRAM
model of parallel computation, Leiserson and Maggs introduce a class of models
that force the algorithm designer to attend to communication costs. They give a
model for each graph that could serve as the interconnection pattern of a parallel
computer. When analyzing an algorithm in the model based on graph G, for each
communication step involving a message set M the programmer must add A(G, M)
to the running time. Thus the model underestimates the algorithms' running times
by whatever factor the routing performance differs from A(G, Al).
Leiserson and Maggs advocate a style of programming for this model that in-
2We have translated the results of [LMR88] from the word model to the bit model.
10
volves embedding a data structure into the network, calculating the load factor of
the data structure, and the being careful to use only communication steps that
follow the pointers of the data structure. They call this style of algorithm conser-
vative. If the load factor of the data structure is A, then the running time of a
T-step conservative algorithm in their model is bounded by AT. They show how
to solve a number of graph problems with conservative algorithms.
While still too closely tied to particular graphs, their models remove some of
the less relevant details of the machine from the algorithm designer's consideration,
most notably any routing algorithm inefficiency. Their models are thus likely to
be valuable for networks with good routing algorithms, small diameter, and with a
structure susceptible to a concise expression of load factor. Such networks include
the concentrator fat-tree and the networks presented in this thesis.
1.4 Preview
This thesis gives the first deterministic solutions for area-universal on-line rout-
ing. In Chapters 2 and 3 respectively we propose two new routing networks, the
pruned butterfly and the sorting fat-tree. The N-terminal pruned butterfly can
route an arbitrary message set of polynomial size in time O(A log2 N) time and area
O(N log2 N).3 The sorting fat-tree routes only message sets of constant degree,
but on this important class the N-terminal sorting fat-tree has better performance:
it takes area O(Nlog2 N) and routes on-line in time O(Alog Ntlog2 N). For both
networks, the routing time is within a factor of O(log N) of the ?(A log N) band-
width lower bound. For the special case of permutations with A = Q(log N), the
? A result similar to this one has been established independently, using a different construction,
by Leiserson and Park (personal communication).
11
routing algorithm for the sorting fat-tree actually achieves the bandwidth lower
bound. Note that this special case 15 common: for a random permutation, the
expected value of A is ?(?/?). Both the pruned butterfly and the sorting fat-
tree can simulate any N-terminal router of area O(N log2 N) with a slowdown of
O(log2 N), or any router of area 0(N) with slowdown O(log N).
In the construction of the sorting fat-tree, we deploy a new type of sorting circuit
of independent interest. It has area A and for any n such that MA < n < A/log A,
it can sort n words of (log n + 0(log n)) bits with optimal AT2 --H O(n2 log2 n).
Previously known VLSI circuits [BP85,Lei85a] achieve AT2 O(n2 log2 n), but
for a fixed value of A, different circuits are needed for different values of n. The
novel feature of our sorter is that the same circuit can process all input sizes in the
given range in optimal time.
Chapter 4 clarifies the relationship between routing performance and area-
universality. The pruned butterfly and the sorting fat-tree are each shown to be
able to simulate similar-cost networks with only a O(log2 N) penalty in time. The
chapter also introduces a class of tree-based networks that includes the N-terminal
pruned butterfly and the N-terminal concentrator fat-tree, proving a nontrivial
?(N log2 N) area lower bound for every member of the class. Chapter 5 continues
the exploration of load factor as a measure of routing difficulty by establishing an
AT2 lower bound for networks whose routing algorithms are not allowed to modify
the content of messages (e.g., by recoding a pair of messages into another pair).
In Chapter 6 we relax the model of computation described in Section 1.2 to
include all synchronous VLSI circuits, not just those that route messages among
processors. Surprisingly, it is possible to get a better bound on the area and time
12
loss required to simulate an arbitrary circuit in this more general model: we show
how to construct a (deterministic) VLSI circuit of area 0(A) that can simulate
any circuit of area A with an O(log A) slowdown.
Chapter 2
Deterministic On-line Routing on
the Pruned Butterfly
In this chapter we present the pruned butterfly switching network (Section 2.1)
which, augmented with some auxiliary circuitry (Section 2.2), supports an efficient
on-line routing algorithm for arbitrary message sets (Section 2.3).
2.1 The Pruned Butterfly
We give the name pruned butterfly to the graph G(V? E) defined as follows, for N
a power of 4.
V--H--Hf(?,j,k?:O?i<logN, O<j<2?, ????jN2--Ht?2i1,
log N 2??i
U U Eij,
i=1 j=O
where, for even
Eij = f((i,j,k?, (i --H 1, [j/2J,k?), ((i, 5, k?, (i --H 1, [5/2j,k t ?2[?i/2i?)
0< k < ?2--H[q2J1,
13
14
and for odd i,
Eij = f((i,j,k?,(i --H 1, [j/2J,k')) : 0< k < ?2--Ht??iJ.
(4,10,0?
Figure 2.1: The pruned butterfly when N = 16. The triples identify the vertices
on the unique path from leaf 10 to root 2.
As illustrated in Figure 2.1, the pruned butterfly has the structure of a fat-tree.
We use the term node to refer to the set of pruned butterfly vertices that together
form one vertex of the underlying complete binary tree. Node j in the left-to-right
ordering of all fat-tree nodes at depth i is the set f(i,j,k) 0 < k < ?/?2--Hti/2JJ
The edges in Ejj form a channel of capacity lEiil = VmN2--Ht(i--H')/2i, between a level
i node and its parent.
As shown in Figure 2.2, the pruned butterfly is a subgraph of the butterfly. A
similar graph has been used in [LMR88] for randomized routing.
Vertex (log N, 1,0) of the pruned butterfly is called leaf1 and vertex (0,0, r)
is called root r. Between leaf 1 and root r is a unique shortest path that includes
15
??Q -? -? -? -? -? -? -?
--H I --H --HI --H --H --H --H --H --H?--H??--H?--H			I I
Figure 2.2: The pruned butterfly is a subgraph of the butterfly.
exactly one vertex from each level of the tree. If 1 and r have binary representa-
tions alog N--H1 ao and P(log N)/2--H1 `Po1 respectively, then the level i vertex of
the path is (i, [1/2log N?ij , r mod ?[(1og N--Hi)/21) or, with slight abuse of notation,
(i, 0LIogN--H1 QIogN?t, P[(IogN--Hi)/21--HI ` Po?
We now regard the vertices of the pruned butterfly graph as switches and
the edges as wires, and consider movement of messages in the resulting switching
network. A set of s _ messages rno,7n1,... , rn??1 is called an expansion if mh
has source at root h and destination at leaf 1h, and 1h--H1 < 1h for h = 1,... ,s --H 1.
A similar message set is called a compression if rnh has source at leaf 1h and
destination at root h,and 1h--H1 ?1?forh--H1,...,s--H1.
Let us consider the following greedy routing strategy. Messages may be active
or inactive; initially all messages are active. At each step, each active message
tries to advance one level on its unique source-to-destination path. Two messages
conflict at an edge if their paths both contain the edge. A message mh becomes
16
inactive and advances no further if it conflicts with a message 7nk where k < h.
This simple strategy has the following useful property.
Lemma 2 If the greedy strategy is applied to an expansion and a conflict arises at
an edge of channel c, then all the edges of c are occupied by active messages.
Proof: We will establish the following property inductively: the messages ar-
riving at any fat-tree node have consecutive sources. The property is trivially true
at the root. Suppose the property holds for all nodes at depth i --H 1 or less, and
consider a level i --H 1 node u joined to its child v by channel c. Let Mv be the
subset of messages arriving at u whose destinations are in v's subtree. From the
induction hypothesis, the messages of AL, have consecutive sources, so if there is
no conflict at any edge of c, all of Mv remains active and arrives at v.
Now suppose there is a conflict at some edge of c. Since the messages of Mv
have consecutive sources, we conclude from the structure of their unique paths to
u that they arrive at vertices of u that are consecutive modulo cap(c). Therefore a
conflict can occur only between two messages whose sources differ by cap(c). In this
case the conflict resolution rule guarantees that the cap(c) messages in Mv with the
smallest destinations remain active and arrive at v. Thus not only do the messages
arriving at v have consecutive sources, establishing the induction hypothesis, but
every edge of c is occupied, establishing the lemma. E]
Corollary 1 If the load factor of an expansion does not exceed 1, the expansion
ts routed by the greedy strategy without conflicts. By symmetry, the same property
holds for a compression.
17
An interesting property of the N-leaf pruned butterfly (reportedly [GL89] al-
ready observed by Leiserson and Leighton) is that it embeds an N-leaf mesh of
trees with constant dilation and load. Therefore, the ?(N log2 N) lower bound
for the area of the mesh of trees [Lei84] transfers to the pruned butterfly. An
O(N log2 N) layout of the latter graph is easily achieved by the 11-tree method, as
shown in Figure 2.3. Moreover, mesh of trees algorithms can be readily adapted
to the pruned butterfly.
Figure 2.3: The 11-tree layout of a 16-leaf subtree of the pruned butterfly.
2.2 Auxiliary Circuitry
For the on-line routing of a message set, the switching structure of the pruned
butterfly needs to be augmented with some circuitry supporting auxiliary functions
such as buffering, counting, sorting, and partial-sum computation.
Each leaf node of the pruned butterfly contains a terminal, a constant-area,
18
bit-serial interface with a processor. The processor stores a set of messages, each
consisting of a record with an O(log N)-bit information field, a log N-bit destina-
tion field, and a [log log Ni -bit peak-level field. (A message's peak is the level in the
tree of the lowest common ancestor of the message's current position and destina-
tion.) The processor keeps the messages in a priority queue organized by peak-level
(minimum level at the top) which can receive every O(log N) bit steps either
an insert or a deletemin instruction from the leaf.
A leaf node is responsible for initializing the peak-level of the messages orig-
inating at the attached processor, and for updating the peak-level of a message
before inserting it into the queue. A leaf also maintains log N counters, each stor-
ing the number of messages in the queue with a given peak level. In addition a
leaf is equipped with a comparator for log N-bit numbers and a circuit to compute
a mod b where a and b are O(log N)-bit numbers. The leaf can perform any of
the operations mentioned in O(log N) time, and can be laid out in a square region
of side length O(log N). The leaf's area limits the length of the peak-level coun-
ters to O(log N), which in turn limits the network to handling message sets with
cardinality polynomial in N.
By adding a single-bit full adder and an O(log N)-bit shift register to each node
of the pruned butterfly tree, and using a straightforward bit-pipelined version of
the tree implementation of prefix computation algorithms [DS82,BP89], we have:
Lemma 3 Let T be an n-leaf subtree of the pruned butterfly routing network where
a register in leaf i stores an integer 0 < bj ? N. In O(log N) time T can compute
and store at leaf i (a) the partial sum _ b5 or (b) the total sum _ b3.
By adding yW O(log N)-bit buffers to the root node of every n-leaf subtree
19
T, and adapting the mesh of trees sorting algorithm [NMB83,Lei8t] to the pruned
butterfly, we have:
Lemma 4 Let T be an n-leaf subtree of the pruned butterfly routing network, and
suppose a message set M of cardinality at most ? is initially at the root of T
Then in O(log N) time the message set can be sorted by destination and output at
the root T
The node at the root of an n-leaf subtree can be laid out in area O(n + log N), and
a leaf can be laid out in area O(log2 N), so an fl-tree layout of the entire circuit
takes area O(N log2 N).
2.3 Routing Algorithm
Routing of a message set M is performed in log N stages: stage 0, ..., stage
log N --H 1. Let Mj be the set of messages with peak at level i at the beginning
of stage i. During stage i each message of AA is moved to a new leaf, possibly
different from the final destination, but always lowering the message's peak. A
message not routed to its true destination is said to be sidetracked and is processed
again in the stage associated with its new peak. A crucial property maintained by
the algorithm is that, for each i, A(AIj) is O(A(AI)).
Stage i is conveniently described in terms of the activity of a generic subtree T
with root at level i + 1. Such a subtree interacts only with its sibling, T', to which
it sends and from which it receives some messages. Let MT be the set of messages
in Mi with source in T, and let
Aa(MT) i?i?a?X?MT?c)/caP(c)
20
be its ascending load facton Stage i of the algorithm partitions MT into AT =
[Aa(MT)] batches and routes each batch to leaves ofT! in O(log N) time. Although
some messages of MT may not reach their true destinations, the routing policy
guarantees that AT is O(A(M)), so stage i takes O(A(M) log N) time.
We now describe and analyze the algorithm for stage i of the routing.
Partition of MT into batches. First, each subtree T rooted at level i + 1
computes AT and makes this quantity available at each of its leaves, as follows.
Each leaf loads the value of its peak-level counter for level i into a designated
register. By a leaf-to-root summation of these values, each tree node obtains
?(MT, c), where c is the channel connecting that node to its parent. The node
then computes [A(MT, c)i [C(MT, c)/cap(c)i. (As cap(c) is a power of two, this
amounts to a right-shift with truncation.) Another leaf-to-root computation makes
AT = maxc?T[A(MT, c)i available at the root of T, which can then broadcast it to
all the leaves.
Second, for each leaf 1, T computes xl, the number of messages of MT with
source to the left of 1. (This is accomplished by applying part (a) of Lemma 3 to
the leaves' peak-level counters for level i.) If 1 contains yj messages of MT, then it
contributes a message to batch B(zi+x)modA?, for each x such that 0 < x ? Yl. It
is easy to see that for each batch B so defined, [Aa(B)] = 1.
Thus far the algonthm involves a constant number of O(log N) time operations.
Therefore we have:
Lemma 5 In only O(log N) time, MT can be partitioned into AT batches,
Bo,...,BAT?1, such that Aa(Bo),...,Aa(BA??i) < 1.
Routing of the batches. The messages in a batch B are first routed to the root
21
of T by a compression operation. Specifically, by a prefix computation, each leaf
of T containing a message m in B determines the number h of messages of B to
its left (0 < h < number of vertices at the root of T). Then message m is routed
to vertex (i + 1,j, h') at the root of T. Since Aa(B) < 1 (Lemma 5), the routing
paths do not conflict (Corollary 1). The bits in the binary representation of h
correspond directly to the binary choices message m encounters at even levels on
its path to (i + 1,j, h?. Thus the routing can be done bit-serially in O(log N) time
if the message packets are organized with the least significant bits of h at the front.
Once a batch B reaches the root of T, it is transferred to the root of ?? and
sorted by destination. By Lemma 4, this takes O(log N) time. Now B is an
expansion and is routed to the leaves of T' by the following variant of the greedy
strategy described earlier.
A bit is prepended to each message, indicating whether the message is active or
sidetracked. The bit is initialized to "active". A direction bit is associated with each
pruned butterfly vertex and initialized to "left". Routing is bit-serial. Decisions
are decentralized and made on-the-fly by individual vertices of the pruned butterfly
according to the following rules. (a) If there is no conflict, an active message is
routed toward its destination. (b) If two active messages compete for the same
edge, then the one with the larger destination is sidetracked and its initial bit is
toggled. (c) An active message has precedence over a sidetracked one. (d) Each
time a vertex receives a single sidetracked message from its parent(s), it sends it
down the edge indicated by the direction bit of that vertex, and toggles the bit.
(Notice that exactly the same set of active messages reaches each node as in the
earlier greedy strategy; in particular, Lemma 2 applies to the present strategy as
22
well.)
Assume that the message is organized in a packet whose first field is the desti-
nation address, the currently-most-significant bit of which is the first to arrive at
a pruned butterfly vertex. Based on this bit, the vertex can implement the policy
described above in constant time. If the message is active, then the bit is stripped
away before the message goes to the next vertex. The only subtlety arises in the
implementation of rule (b), because a vertex cannot determine which of two con-
flicting messages has the larger destination by comparing only the most significant
address bits. However, the vertex can send the (identical) address bits down both
edges, delaying the routing decision until a distinguishing bit appears.
Since messages have O(log N) bits and paths have O(log N) vertices, the bit-
serial pipelined routing of a batch from the root to the leaves of T' takes O(log N)
time. Thus, the entire routing of one batch takes O(log N) time and, as there are
AT batches, we conclude:
Lemma 6 All messages in MT can be routed to leaves of T' in time O(AT logN).
Our next goal is to bound AT in terms of A(M). Note that MT can be partitioned
into MTS u MT0, where MTS is the set of messages destined to T' that have been
sidetracked to T during stages 0 through i --H 1, and MT0 is the set of messages
that have never been moved from their original sources at leaves of T. Clearly,
Aa(M?0) ? A(M), and since Aa(AIT) < Aa(MTS) t Aa(M?0), all that remains is to
bound Aa(MTS).
Let c' be the channel between the root of T' and its parent. By Lemma 2 applied
to ??, a batch contributes messages to A/'T3 only if cap(c) active messages from the
same batch traverse c'. By definition of A(M, c'), there can be at most [A(M, c')1 <
23
[A(M)] such batches. Since each batch B is delivered to destinations that ensure
A(B) <1 we conclude that [Aa(M?8)] < [A(M)], and AT = [Aa(MT)] <
In summary, we have:
Lemma 7 For each subtree T, AT < 2[A(M)]
Combining the above results, we arrive at the following theorem.
Theorem 1 The pruned butterfly of N terminals can route a set M of O(log N)
bit messages, with IMI polynomial in N, and with load factor A(M), in time
O(A(M) log2 N) and area O(N log2 N).
In the analysis of our routing algorithm, we have assumed that sidetracked mes-
sages reaching a given leaf of the pruned butterfly are temporarily stored in the
priority queue of the attached processor. The leaf itself stores only the peak-level
counters, as described in Section 2.2. We now want to bound the size of the pro-
cessors' priority queues. While it is easily established that at most O(A(M) log N)
messages (one per batch) can be sidetracked to the same leaf, a careful analysis
yields a tighter bound in terms of d(M), the message set's degree.
Lemma 8 At no time during the routing of a message set of degree d does any
processor store more than (d t 1) log N + d sidetracked messages.
Proof: Let T be a subtree of the pruned butterfly and let IT denote the number
of its leaves. Choosing a time between batch routings when each message is at
some leaf, we denote by s(T) the total number of sidetracked messages stored at
the leaves of T and by x(T) the maximum number of sidetracked messages stored
at a leaf of T. We shall show that the routing policy balances the storage load at
24
the leaves so that, for any subtree T,
x(T) --H s(T)/ITI < (d + 1) log ITI.
(2.1)
When T is the whole fat-tree, IT = N and s(T) < dN, so (2.1) yields the statement
of the lemma.
We will establish (2.1) by induction on ITI. For ITI = 1, (2.1) is trivially
satisfied, as both sides of the inequality are zero. For ITI > 1, let T0 and T1
denote the two subtrees of T so that s(T0) < s(T1), and inductively assume that
x(Th)--H2s(Th)/ITI ? (d+1)log(T/2),for h = 0,1. To prove (2.1) forT, observe
that x(T) = max(x(To),x(Ti)) and s(T) = s(T0) + s(Ti), and consider two cases.
Case x(To) > x(Ti). Here x(T) = x(To) and, since s(T) > 2s(To),
x(T) --H s(T)/ITI ? x(To) --H 2s(To)/ITI ? (d + 1) log ITI,
where the last step uses the inductive hypothesis on T0 and the relation jT 1/2 < ITI.
Case x(To) <x(Ti). Here x(T) = x(Ti) and
x(T) --H s(T)/ITI ? x(Ti) --H 2s(Ti)/ITI + (s(Ti) --H s(To))/ITI.
(2.2)
We need to bound the quantity s(T1) --H s(T0). To this end, let m be a sidetracked
message stored in Th, for h e fO, 1J, and let v be the pruned butterfly vertex at the
root of T from which m entered Th. Message m can be one of four types. Either
1. m's final destination is a leaf of Th, or if not, either
2. m arrived at v together with an active message destined to T1?h, or
3. m arrived at v together with a sidetracked message routed into T1?h, or
4. m arrived at v alone,
25
Clearly s(Th) = ?4a=1 sa(Th), where 5a(Th) is the number of messages in Th
of type a. Since at most dITI/2 messages could have destinations in Th, s1(T1) --H
s1(T0) < s1(T1) ? dITI/2 and s2(T1) --H s2(To) < s2(T1) < dITI/2. By definition,
s3(T1) --H s3(To) = 0. Finally, as single sidetracked messages arriving at v are sent
alternately to Th and T1?h, we have that s4(T1) --H s4(T0) < $mTI, where is
the maximum number of vertices at the root of T. Combining these four bounds
yields
s(T1) --H s(T0) ? dITI + $tT(			(2.3)
Substituting (2.3) into (2.2) and applying the inductive hypothesis on T1 produces
x(T) --H s(T)/ITI < (d + 1) log(ITI/2) + d + $mTI
? (d +1) log IT --H d --H1 + d +
< (d+i)log?TI,
where we have exploited the fact that, as IT >2, $7jTI ? 1.
In conclusion, (2.1) remains established in both cases.			c
Remark: A priority queue for storing n messages can be laid out in area
O(n log N) using a systolic implementation[Lei79]. If d(M) is constant, then by
Lemma 8 the priority queues could be stored in the leaves without changing the
network's O(N log2 N) area bound,
Chapter 3
Deterministic On-line Routing on
the Sorting Fat-tree
The partition of messages by peak level in the pruned butterfly may unnecessar-
ily serialize the routing of subsets of messages which use different channels and
hence could be routed simultaneously. The sorting fat-tree, to be described next,
circumvents this problem by first bringing all messages to their peaks and storing
them. Unlike the pruned butterfly, a given node v of the sorting fat-tree can then
reorganize all of the messages with peak v for more efficient transmission down to
their destinations. This strategy leads in certain cases to an optimal routing time
of O(Alog N). However, the strategy requires that all messages be present in the
routing network simultaneously and hence limits the class of message sets that can
be handled.
In Section 3.1 we describe the main component of each node of the fat-tree: a
new type of sorting circuit. Section 3.2 describes the sorting fat-tree the network
itself, and Section 3.3, the routing algorithm.
26
27
3.1 Flexible Sorters
The main task of the sorting circuit placed at node v of the fat-tree is to reorganize
the set of messages with peak v. Roughly speaking, for the entire algorithm to
route in time proportional to the load factor, node v must be able to sort in time
proportional to the size of its peak set. Qualitatively, we refer to a circuit that
sorts in time proportional to its input size as a flexible sorter. Although VLSI
sorting has been investigated extensively, the circuits proposed in the literature
are for sequences of a fixed length 1. Since they do not yield better performance on
sequences of length r < 1, they are not flexible. In this section, we design a flexible
sorter with the properties summ&ized by the following theorem.
Theorem 2 Let ? > 1 be a constant and let 1 = O(s?). Then there is a VLSI
c'rcuit that, for any r such that s _ r ? 1, can sort r words of b bits in time
O((b + log s)r/s), and that can be laid out in a rectangle of height 0(s) and length
O(s[b/logs] + (l/s)(b+ log 1)).
We use an extension of Columnsort[Lei85a], reviewed in Lemma 9 below, to sort
r = ws words by a number of phases. Each phase consists of w sorting operations
on sequences of size 5, and of one permutation of ws words. To sort 5 words, we
use the area-time optimal circuit of [BP85], reviewed in Lemma 10 below.
Lemma 9 Let P be a two-dimensional array with 5 rows and r/s columns. Then
P's elements can be sorted into column maior order by an algorithm consisting of
logr			2/log(3/2)
log(s/2)
28
phases. In each phase, the columns of P are sorted and a permutation is applied
to the entries of P. At a given phase, the permutation depends only upon r and 5.
Proof: Recall that Columnsort sorts an 5 x (r/s) array of words by sorting the
columns four times and, after each time, permuting the entries of the array accord-
ing to a fixed pattern. A sufficient condition for Columnsort to work properly is
that 5 > 2i/3r2/3. If this relation is not satisfied, one can resort to a recursive appli-
cation of Columusort. The input array P is partitioned into blocks of consecutive
columns, with each block containing 21/3r2/3 elements. (Here and below, we ignore
for simplicity that some quantities may not be integer; suitable adjustments could
make the analysis rigorous without altering the essence of the result.) These blocks
are then treated as "virtual columns" and are sorted by a recursive call to Column-
sort, unless their size is 5 or less. A straightforward analysis shows that, at the kth
level of recursion (the main call being at level 0), the size of the sorting problems is
2(i?(2I3)k)r(2I3)k. This size becomes <5 for k (log (j0giO???1T2?))/log(3/2) As
the recursion tree has degree 4, there are 4k --H ((logr)/log(s/2))2/10?(3/2) leaves.
A leaf call corresponds to sorting each of the r/s columns of the s x (r/s) array.
The number of permutation steps between the sorting phases corresponding to
two consecutive leaf calls is equal to the height in the recursion tree of their lowest
common ancestor. By combining these consecutive permutation steps into a single
permutation step, we obtain an algorithm with the desired structure. E]
Lemma 10 There is a VLSI circuit for sorting 5 words of length b in time O(b +
logs) and area 0(52 [b/ log s]). Furthermore the circuit fits in a rectangle of height
29
0(s) and length 0(s[6/logs]).
Proof: The paper [BP85] shows how to construct such a circuit to fit in an 0(s)
by 0(s) square when b = O(log s). The area bounds for arbitrary b cited above are
obtained without modifying the layout except to extend the registers that compare
words and to stretch wires that connect them to the rest of the layout. 0
We are now ready for the main proof of this section.
Proof of Theorem 2: To simplify the presentation, we assume s and 1 are powers
of two. The circuit consists of the sorter for s words of Lemma 10 and a permuter
for up to 1 words. The permuter consists of s cycles of l/s cells, interconnected
by an Omega network [Law75] as shown in Figure 3.1. Each cell, represented by a
Input/Out ut
0(-?'(btlog-31))			0(-??logs)			?0(s[b/logs1)
Figure 3.1: A flexible sorter with s 8 and 1 64
small rectangular box in the figure, consists of two ?bit bidirectional shift registers
and some associated circuitry. In constant time, the two registers can be exchanged
30
or one loaded with the contents of the other. Each cell is 0(1) high and 0(b) long
and is connected by a constant number of wires to its neighbors on either side. The
cells of each cycle are labeled from 1 to t/s starting from the right. As illustrated,
the cycles have switches that permit them to be closed off at any length that is a
power of two up to lis. The switches of the Omega network are one bit wide, and
are represented in the figure by vertical lines joining open circles. Some additional
registers not shown in Figure 3.1 are required to control the permutation routing.
Attached to each switch of the Omega network is a shift register with O(l/s) bits
laid out horizontally parallel to one of the two cycles joined by the switch. Attached
to each cell labeled with an even number between 2m--H1 and 2m is a shift register
with O(log2 l/s --H rn2) bits, laid out horizontally between it and the neighboring
cell.
The height of the circuit layout is proportional to s, the number of cycles. The
length of the layout of each cycle is
O(bl/s +
Iog?/s
? 2rn(1og2 l/s --H m2)) O(bl/s + l/s log 1/.s).
m=1
Since each stage of the Omega network has length O(l/s), the entire Omega net-
work has length O((l/s) logs). Adding in the length of the sorter from Lemma 10,
we get O(s [b/ logs] +(l/s)(b+log 1)) for the length of the flexible sorter, as claimed.
We now describe how the flexible sorter implements the algorithm of Lemma 9.
The r input words, each of length b, enter the circuit bit-serially in groups of size
5 on the s wires shown at the top of Figure 3.1. The words are shifted into the
rightmost [r/s] columns of the cell array. The cycles are then closed off at the next
larger power of two, say w. From this point on, the algorithm consists of a constant
number of alternating sorting steps and permuting steps. In a sorting step, each
31
of the w columns of 5 words is passed through the sorter by a counterclockwise
bit-serial rotation through the cycles. In a permuting step, the circuit simulates
the Benes permutation network [Ben65] using essentially the same technique as the
cube-connected cycles architecture [PV81].
Recall that when an n-line Benes network routes a permutation, each of the
2n log n switches requires a control bit to tell it whether or not to exchange the
words appearing at its inputs. Consequently our permuter requires control bits
when it simulates an sw-line Benes network. To route a permutation of 3W words,
each switch of the Omega network needs 2w control bits to determine which pairs of
words to exchange between the cycles it joins. Each cell with an even label between
1 and w needs 2 log w control bits to determine which words to exchange with its
odd neighbor. For any fixed permutation, the control bits can be precomputed in
polynomial time as in [Wak6S]
The set of permutations our flexible sorter must route is fixed at the time of the
sorter's construction: the circuit must be able to execute Columnsort on sw words
for any value of w a power of two between 1 and 1/s. Since every such w is O(s?),
by Lemma 9 at most a constant number of permutations, say ?, are needed for
each valid w. The control bits a Benes network would use to route these P log 1/3
permutations are precomputed and stored in the shift registers at the switches of
the Omega network and between odd and even numbered cells in the array. In
each register, the bits are arranged in log 1/s flelds, one for each valid w. The fields
are in order from smallest to largest w, and within each field the bits are in the
order that the p permutations are used by Columnsort. In an Omega network
control register, each field has 2Pw bits, giving a total of ?1??(??/S) 2P2k --H O(t/s)
32
bits. In a control register at a cell labeled with an even number between 2m--H1 and
2?, each field has 2P log w bits, for a total of ?I?%??!/S 2Pk = O(log2 1/s --H m2) bits.
After w is known and just prior to the first permuting step, all control registers
are simultaneously shifted so that the appropriate field is at the front of the shift
register. Then during the permuting steps the control bits can be read and used
without delaying the data movement.
Let us now analyze the time taken by the flexible sorter. Shifting the input
into the cell array takes time O(br/s). Shifting the permutation control registers
so the appropriate field is ready to be read takes time O(log2 w) for the registers
in the cell array and time 0(w) for the registers in the Omega network. Since each
column can be sorted in time O(b + logs), each sorting step of the algorithm takes
O(w(b + logs)). The cube-connected-cycles method of executing a permuting step
takes O(wb + log s) time. Observing that w = O(r/s) and sorting steps dominate
the other costs leads to the bound claimed. n]
Henceforth we will refer to the circuit of Theorem 2 as a (b, 5, 1)-flexible sorter.
A special case of the (b,s, 1)-flexible sorter works in optimal time for its area over
the entire range of input sizes.
Taking into account the known bisection-based lower bound for sorting[Tho8O,
Lei85a,Bil84], we have:
Corollary 2 For 6 = logs + 0- (logs) and 1 = 0(s2/logs), a (6,s, 1)-flexibTh sorter
has area 0(s2) and sorts in time O((n/s)logs), which is AT2-optimal.
33
3.2 The Sorting Fat-Tree
The sorting fat-tree is an N-terminal routing network whose structure is a hybrid of
a fat-tree and a mesh. Groups of log2 N terminals are interconnected by log N x
log N two-dimensional meshes. Each terminal in a mesh is allocated a square
region with side length O(log N). Thus each mesh occupies a square region with
side length O(log2 N). The N/ log2 N meshes are placed at the leaves of a fat-tree
laid out in the 11-tree style. (For convenience, we assume N and N/ log2 N are
powers of two.) The structure of the node at the root of a subfat-tree with n
terminals is different depending on whether n is an even or odd power of two. If
n is an odd power of two, the node contains a (6, YTn, n)-flexible sorter whose
v? outputs enter a channel of capacity A?? connecting the node to its parent.
If n is an even power of two, the node contains (6, 2#, n)-flexible sorter whose
2# outputs are multiplexed into a channel of capacity ? connecting the node
to its parent. Figure 3.2 illustrates two adjacent levels of the tree, with the flexible
sorters labeled by their parameters.
The sorting fat-tree also has some auxiliary circuitry similar to that in the
pruned butterfly fat-tree. A N-leaf tree structure with its leaves at the terminals
permits the the sorting fat-tree to execute prefix computations as in Lemma 3. A
second tree structure with its leaves at the internal nodes of the fat-tree permits
the network to synchronize, in O(log N) time, operations at all internal nodes. As-
sociated with each terminal are a pair of 6-bit shift registers that can be compared
and exchanged.
To analyze the area occupied by the sorting fat-tree, let ? be a constant large
enough so that a (6, ffin, n/2)-flexible sorter and a (6, 2??i, n)-flexible sorter each
34
fit in a rectangle of height ?? and length ?# log N, and mesh of log2 N ter-
minals fits in a square of side length ? log2 N. Let S(n) denote the length of
one side of the (square) layout of a subtree of the fat-tree containing n ter-
minals, where n is a power of four and n > log2 N. Assume inductively that
S(n/4) < ?fiF4log(Nn/4). Clearly S(n) is the maximum of V(n), the ver-
tical length, and H(n), the horizontal length of the layout. Referring to Fig-
ure 3.2, one can see that V(n) < max(??nlog N, ?? + 2S(n/4)) and H(n) <
H(n)
n/4 terminal (b, 2?n, n)
subtree
O(?) (6, VW? n/2)			(6, ?, n/2) V(n)
S(n/4)			S(n/4)
Figure 3.2: The layout of the sorting fat-tree
?# + 2max(S(n/4),??nlogN). Applying the induction hypothesis and some
straightforward calculations shows that 5(n) < max(H(n), V(n)) ? ??log(Nn),
thus reestablishing the induction hypothesis. The entire network fits in a square
region of side length 2?vWlogN, so its area is O(Nlog2N).
35
3.3 Routing Algorithm
The algorithm of this section works for any message set M of constant degree.
For simplicity, we describe the special case where M is a partial permutation; the
extension to constant degree message sets is straightforward. The following list of
steps outlines the routing algorithm.
1. Messages with source and destination in the same mesh are routed by a
standard technique.
2. Messages to be sent between meshes are partitioned into log N batches.
3. Within each mesh, messages are reorganized into row major order by batch
number.
4. The batches are routed to their peaks consecutively.
5. All the messages with the same peak are sorted in order of destination.
6. The messages are again partitioned into log N batches.
7. The batches are routed down to their destination meshes.
8. Within each mesh, the messages are routed to their final positions.
Step 1 can be accomplished in O(log N) time by standard mesh routing tech-
niques based on sorting and prefix computation. Step 2 is more involved. We will
first describe the partition, then how the network computes it.
Let M be the set of messages that must be routed between meshes. We denote
by ?a (M, c) the load placed by a message set M on a channel c during the source-to-
peak movement of the message set. For a channel c that is k levels down from the
36
root of the fat-tree, the ascending load factor Aa(M, c) is ?(M, c)/?k. Let Mi
be the set of messages in M with their peak at level i (0 < i < log N --H2 log log N).
Assign to each message in Mi a rank between 0 and Mii --H 1 according to increasing
order of the message sources. The set Mj is then partitioned according to rank
into Mi0 U Mi1 ?.. U Mi(Iog N--H1)? where Mij is the set of messages in Mi whose
rank modulo log N is j. Then the log N batches of the partition are given by
M*j = UiMij, for each 0 ? j < log N. This partition is useful because it divides the
work of routing M roughly equally among the batches, as shown by the following
lemma.
Lemma 11 For each 0 ? j <logN, and for each channel c?
Aa(M*jc) = O([Aa(M,c)/logN]).
Proof: Let us first show that, except for log N messages, the load placed by a
batch on any given channel is a factor of log N smaller than the load placed by
the full message set on the channel. The statement of the lemma will then follow
trivially from the definition of load factor. Let c? be a level k channel (k <
log N --H 2 log log N, since there are no channels within the meshes). Only the
messages with peak levels smaller than k can affect the load a batch places on ck,
so for every 5
k--H1
a(NI*j,c?) = S &(NAj,c?).			(3.1)
t=O
Since all messages in Mi that must cross ck have consecutive ranks, our choice of
Mij as those with rank j(mod log N) amounts to selecting every log Nth message
that must cross c?. Thus
&(Mij,c?) = ?a(Mi,ck			logN
logN			< 1+ a(Mi,ck)
(3.2)
37
Combining (3.1) and (3.2), we have:
Ca(M*j,c?)			1 k--H1			<logN+ a1(o)1??ck)
<k+?0g? Z?(A'i,ck)
Since every channel has capacity at least log N,
Aa(M*j,Ck) --H Ca(M*j,c?) < logN + a(M,ck) --H?
7w2k			jm/2k logN772k
?( Aa1%M??ck)
0
To compute the partition of M of Lemma 11, the network first labels each
message with its peak level i. This is a simple matter of comparing the source
and destination bits and can be done for all messages in parallel in O(log N) time.
The network then prepends i to the messages' destination fields to simplify later
routing operations. Computing each message's rank within the appropriate Mi
can be done with a prefix operation for each Mi. Since each prefix operation takes
O(log N) time, all the ranks, and hence all the batch numbers can be assigned to
the messages in O(log2 N) time.
The purpose of Step 3 is to prepare the messages so that when they are routed
upward, the entire channel will be busy at once: there will be no gaps between
messages. Since each mesh holds log2 N terminals and the capacity of a channel is
the square root of the number of terminals below it, the channel that joins a given
mesh to the fat tree has one wire for each column of the mesh. Regardless of the
mesh's orientation in the layout, for the purposes of defining row major order, we
will consider top of the mesh to be the side to which the channel is attached. Steps
3 and 8 are similar to Step 1 and can be accomplished in O(log N) time.
During Step 4 the batches of messages are routed in order from their source
meshes up the tree to their peaks. The nodes of the fat-tree operate in lockstep
38
under the control of a global synchronizer. The step consists of a sequence of stages
numbered consecutively. In the even stages, each node at an even level receives the
messages of a given batch from its children, reorganizes them, and sends them to
its parent. In the odd stages, nodes at odd levels do the same. A stage is completed
only when all the nodes have finished their jobs.
Before the messages of a given batch are sent across any channel in the fat-tree,
they are grouped for efficient transmission into sets of size cap(c) called waves.
Each wave is sent across the channel in time proportional to the message length,
with one message on each wire of the channel. A set B of messages is sent across
channel c in time O((IBIlogN)/cap(c)) O(A(B,c)logN).
At each stage, after the node has received all messages in the current batch
from both children, it separates messages that have their peak at the node from
those that must be sent up to the parent. This is easily accomplished by sorting on
the peak field previously prepended to the message destinations. As the messages
exit the sorter, they are already properly organized into waves. Those destined for
the parent are sent out immediately; those currently at their peaks are shifted into
the unused portion of the flexible sorter's permuter, where they remain until Step
5.
A fat-tree node with parent channel c can execute the sorting and transmission
operations on batch j in time O(Aa(M*j, c) log N). Thus by Lemma 11 each stage
takes O(maxc [Aa(M, c)/ log Ni log N) time. The number of stages is proportional
to the height of the fat-tree plus the number of batches, or O(log N). hence Step
4 takes time O((maxc Aa(NI, c) + log N) log N).
39
In Step 5 the flexible sorter in each node is applied to all the messages that have
their peaks at that node, taking time O(maxc A(M, c) log N). Partitioning M into
log N batches for the downward movement is simpler than the partitioning in Step
2. No prefix computations are required since the messages are already separated
by peak level, and assigning batch numbers is a simple matter of marking each
message with its position in the sorted sequence modulo log N. Sorting the messages
again by batch number completes Step 6 in time O(maxc A(M, c) log N). Step 7 is
symmetric with Step 4 and takes time O((maxc Ad(M, c) t log N) log N), where Ad
is the load factor of the peak-to-destination movement of M. Since Aa(M, c) and
Ad(M, c) are bounded by A(M, c), we have
Theorem 3 The sorting fat-tree of N terminals can route on-line a set M of
O(log N)-bit messages with constant degree in time O((maxc A(M, c)+log N) log N)
and area O(N log2 N).
Clearly max? A(M, c) < A(M), so the sorting fat-tree's (deterministic) perfor-
mance matches the O(A log Ntlog2 N) time of the best known randomized routing
algorithm for an area-universal network [LMR88].
Chapter 4
Load Factor and
Area-universality in
Tree-structured Networks
We have seen several examples of how a fattened complete binary tree can be
a useful interconnection pattern for a parallel computer. It retains the principle
advantages of trees, namely their low diameter and degree, while avoiding their
chief disadvantage, the communication bottleneck at the root. In this chapter we
study tree-based networks in more general terms. In particular, Section 4.1 defines
the class of tree-structured networks and considers conditions under which they can
be compactly represented. Section 4.2 restates Leiserson's theorem on a tree-based
decomposition of an arbitrary network and shows how it can be applied to prove the
pruned butterfly and sorting fat-tree are area-universal. This application elucidates
the connection between routing performance, expressed in terms of load factor, and
area-universality. In Section 4.3 we show all tree-structured networks with certain
40
41
reasonable architectural properties require a factor of fl(log2 N) more layout area
than one could predict from their bisection widths. This result generalizes the
complex and unusual wire area lower bound for the mesh of trees [Lei84].
4.1
Channel-sufficient Networks
One disadvantage of load factor as a measure of the difficulty of routing a message
set M on a network R is that, in general, it is not easy to compute. Recall that
A(R, M) is defined as the maximum load factor on any cut of R, and there are
exponentially many cuts. Load factor calculation on weighted trees is much easier,
as shown by the following lemma.
Lemma 12 Let T be any N-leaf complete binary tree and let M be a set of mes
sages whose sources and destinations are leaves of T. Then
A(T, M) =			A(T, M, e).
Proof" Let A and B be any two disjoint sets of leaves ofT and let E fe,, ..., ent
be a cut of T that separates A from B. It suffices to show that A(T, M, E) <
???eET A(T, M, e). Let A4 c M be messages with source in A and destination
in B or vice versa. By definition of load, A(T, M, E) A(T, M, E). Since every
message that places a load of 1 on  also places a load of 1 on at least one edge
e Ei E, ?T, M, E) ? ZeEE ?(T, M, e). Therefore
< Ze??(%M,e) ? max?(T,M,e) _
A(T,M,E) --H Ze??cap(e) --H eEE cap(e)
max
e?T cap(e)
= ?e??TX A(T, M, e).
? A proof similar to this one is given for Lemma 1 in [GL85,GL89], which is an entirely
different statement
42
The second inequality above is due to the following fact: if xl, ..., X? > 0 and
Y1,?",Yn >0, then ?jxj/?jyj < maxj(xj/m).			0
In other words, Lemma 12 says that to compute the load factor on a weighted
tree, one need consider only the 2N --H 2 cuts consisting of single edges. We now
introduce a class of interconnection networks that can be regarded as weighted
trees.
We say an N-terminal network R is tree-structuredif there is a mapping a from
the vertices of R to the nodes of an N-leaf complete binary tree TR,? such that
o+ exactly one terminal is mapped to each leaf, and
o+ for each edge (v,w) of R, a(v) and a(w) are either the same tree node or
adjacent tree nodes.
The N-leaf tree TR,?, with each edge given a capacity equal to the number of R's
edges that cross it, is called the refrrence tree for 1?. (When a particular mapping
is understood we will drop the subscript a from our notation and refer to R's
reference tree as TR.) Note that the pruned butterfly and the concentrator fat-tree
are tree-structured, but the sorting fat-tree and the mesh of trees are not.
We would like to use a tree-structured network R's reference tree TR to simplify
the calculation of load factor: the load factor A(TR, M) on the reference tree should
approximate the load factor A(R, M) on the network itself. When this condition
holds, then by Lemma 12 only the 2N --H2 cuts corresponding to R's channels need
to be considered to determine the difficulty of routing a given message set.
Unfortunately, although A(TR, M) is always a lower bound on A(1?, M), it is
a very weak lower bound when the reference tree doesn't accurately capture the
43
communication power of the network it represents. For example, Figure 4.1 illus-
trates a network with the same reference tree as the pruned butterfly, but no more
2			22			22 2			2 22			22 2			2
Figure 4.1: Example of a tree-structured network with very little communication
power, together with its reference tree.
"useful" communication bandwidth than a simple binary tree. The load factor on
the reference tree of the network of Figure 4.1 could underestimate the true load
factor by O-(?). The rest of this section investigates the conditions under which
a reference tree more closely characterizes communication power.
We will find it helpful to use network flows; for the reader's convenience we
repeat here the following standard definitions [Tar83]. Let R = (V? E) be an
undirected graph with a positive integer capacity cap(v, w) on each edge (v, w).
44
For convenience, we define cap(v, w) to be 0 if (v, w) is not an edge. Let A be a
distinguished set of source vertices and let B be a distinguished set of sink vertices.
A flow is a real-valued function f of vertex pairs satisfying three conditions:
1. Skew symmetry. f(v,w) =
2. Capacity constraint. f(v,w) < cap(v,w).
3. Flow conservation. For every vertex v other than sources and sinks,
Zw?v(v,w) = 0.
The value if I of the flow is the net flow out of the sources, ?vEA,wEV f(v, w). Let
s*(?, A, B) denote the value of a maximum flow from A to B in R.
Several known area-universal networks contain sets of disjoint paths with be-
havior patterned after corresponding maximum network flows.
Lemma 13 Let PB refer to the pruned butterfly, let TPB be its reference tree,
and let A and B be two disjoint sets of terminals of PB. Then there is a set of
s*(TPB, A, B) edge-disioint paths from A to B in PB.
Proof: Let f* be an integral maximum flow from A to B in TPB, i.e., the flow
through every edge of TPB is an integer. Such a maximum flow exists by Theorem
8.2 in [Tar83]. We will construct the If*i = s*(TPB, A, B) paths so that the number
of paths through each channel is equal to the flow through the corresponding edge
of TPB.
The paths are constructed starting at the root and moving downward toward
the leaves. The flows through the edges between the root and its two children
are equal, and the paths are established consecutively on the smallest numbered
45
edges. Let v be a generic pruned butterfly node, let z be v's parent, and let x and
y be v's left and right children respectively. Assume inductively that f*(z, v) edge-
disjoint paths enter node v from z on cyclically consecutive edges. As illustrated in
Figure 4.2, there are essentially only two cases for the structure of the flow through
node v. The flow direction is indicated by the arrows on the edges in the reference
tree; the cases where the flow is in the opposite direction are obviously analogous.
reference tree TPB
6/8
v
x			y
pruned butterfly PB
z			I I
3/8			:
x			y
Figure 4.2: Creating paths from a flow. For simplicity only an odd level of the
pruned butterfly is illustrated; a similar diagram would apply for the even levels.
In the first case, the flow comes from the parent and goes to both children. In
this case, the first f*(v, x) paths are made to continue downward on the (cyclically
46
consecutive) edges to the left child; the remaining f*(z,v) --H f*(v,x) = f*(v,y)
paths continue downward on cyclically consecutive edges to the right child.
In the second case, the flow comes from the parent and from one child and goes
toward the other child. Without loss of generality, assume the flow is directed out
of the right subtree. In this case the f*(z, v) paths continue downward on cyclically
consecutive edges to the left child. Then f*(y,v) = f*(v,x) --H f*(z,v) new paths
are created going from right child to left on cyclically consecutive edges beginning
with the edge immediately following the f*(z, v) paths already chosen.
Since the flow value of f* on a given edge of TPB never exceeds the edge's
capacity, there are always enough edges in the pruned butterfly's channels to com-
plete the path-creating step just described.			El
We can make a claim similar to Lemma 13 for the concentrator fat-tree as well.
Recall the property of ot-partial concentrators mentioned in Section 1.3, namely
that disjoint paths can be constructed through an a-partial concentrator as long
as the number of paths desired is no more than a fraction a of the capacities of the
three adjacent channels. If If* is the value of a maximum flow from A to B in the
concentrator fat-tree's reference tree, then there are aif* I edge-disjoint paths from
A to B in the concentrator fat-tree. These paths can be constructed in a manner
similar to the proof of Lemma 13, but with the inductive step of the construction
following immediately from the definition of the partial concentrator switch.
The existence of such a set of edge-disjoint paths is a property important enough
to warrant its own definition. Let R be a tree-structured network and let A and
B be disjoint sets of leaves in its reference tree TR. Then R is said to be ?-
47
channel-sufficient if there is a positive constant ? < 1 such that ?s*(TR, A, B)
edge-disjoint paths join A and B in R. As we have just observed, the pruned but-
terfly is i-channel-sufficient and the concentrator fat-tree is a-channel-sufficient.
The reason for the name of this edge-disjoint paths property (and of this section)
is made apparent by the following theorem. The theorem asserts that a network is
?-channel-sufficient precisely when its reference tree can be used to accurately mea-
sure load factor, i.e., when the channels are a sufficient set of cuts for determining
the load factor to within the constant ?.
Theorem 4 An N-terminal tree-structured network 1? is ?-channel-sufficient il,
and only if for all M, A(1?, M) ? A(TR, AI)/?.
Proof: (?) Suppose R is ?-channel-sufficient. Let M be any message set, let S
be any cut, and let A and B be the terminal sets S separates. It suffices to show
that A(R,M,S) < A(TR,AI)/?.
Since there are at least ?s*(TR, A, B) edge-disjoint paths from A to B, clearly
cap(S) > ?s*(TR, A, B). By the max-flow, min-cut theorem, there is a set E of
edges of TR that separates A from B and has total capacity equal to the value of
the maximum flow, i.e., a set E such that Ze?E cap(e) = s*(TR, A, B). Therefore
cap(S) > ?Ze?cap(e). So
A(1?,M,S) = (R,AJ,S)			____________
cap(S)			--H ?Zc??cap(e)
?(TR, M, E) _
?Ze??cap(e) --H A(TR,M,E)/? ?
(?) Suppose A(R,AI) < A(TR,M)/?, i.e., for every message set Al and cut
5, A(R, M, 5) < A(TR, M)/?. Let A and B be any two disjoint sets of terminals
and let m be the minimum number of edges whose removal separates A from B.
48
By Menger's theorem (specifically, the "sets of points" variant of Theorem 5.11 in
[Har69]), m edge-disjoint paths join A and B. Let 5 be some specific set of ni
edges whose removal separates A from B.
Given any integral flow in a tree, it is easy to construct a corresponding message
set with tree load factor at most 1 and cardinality equal to the flow value. In
particular, let M be a message set corresponding to a maximum integral flow from
A to B in TR. Message set M has sources in A, destinations in B, A(TR, M) < 1,
and IMI = s*(TR,A,B). By hypothesis A(R,M,S) < A(TR,M)/?. So
A(R,M,S) --H ?(1?,M,S) --H IMI _ s*(TR,A,B) ? A(TR,AI)/? <
cap(S)			rn			rn
which implies that m> ?*(TR,A,B).			5
If a tree-structured network is ?-channel-sufflcient, then by Theorem 4 and
Lemma 12, the edge capacities of its reference tree accurately describe the available
communication bandwidth. Of course, good routing performance depends upon
some features not captured by the reference tree, for example, small diameter. But
if a ?-channel-sufflcient tree-structured network is unexpectedly poor at routing it
will be for reasons other than hidden communication bottlenecks. In this sense,
while not a guarantee of communication ability, the edge capacities of the reference
tree are a compact summary of communication potential.
The minimum bisection width of a graph is a similar (but even more compact)
summary of communication potential, and it can be used to prove a lower bound
on the graph's layout area. In Section 4.3, we shall see that, together with some
other not-too-restrictive properties, ?-channel-sufflciency can be used to prove a
larger lower bound than the bisection width. Meanwhile, we turn our attention
49
to the connection between load factor and a guarantee that a network is a good
interconnection pattern in comparison with other networks.
4.2 The Reference Load Factor
Expressing a routing algorithm's performance in terms of the load factor measures
how close to optimal it is for that particular network, but says nothing about how
well the network can simulate other networks of similar area. To prove a network
is area-universal it is helpful to have a measure of routing difficulty that reflects
time constraints on routing that come from the network's layout. Consider any
N-terminal router laid out in area 0(N). Any rectangle of area O(N/21) in a hier-
archical decomposition of the layout would contain O-(N/2?) terminals enclosed by
a perimeter of length o(772i). At best, the number of edges entering or exiting
such a rectangle would be proportional to the perimeter. These communication
constraints on an ideal network are captured by a standard capacity reference tree
Tstd, which we define to be an N-leaf complete binary tree with edges at depth d
given a capacity of ?/2td/2i Then 71(M), the reference load factor, is defined to
be A(Tstd,M), i.e.,
=			?T3t?, M, e)/cap(e).
Notice that n(M) is a measure of routing difficulty that is independent of the
particular network, and that bn(M) provides a lower bound on the time to route
M on any network. Also notice that this measure is sensitive to the naming of
the messages in M: two message sets could be isomorphic as graphs but have very
different reference load factors. The usefulness of the reference load factor stems
from the following result, a simple variant of Theorem 10 in [Lei85bj.
50
Theorem 5 If the terminals of an N-terminal router can be labeled in such a way
that it can deliver any message set M in time O(bn(M)r(N)), then it can simulate
any N-terminal router of area A, losing at most a factor of O(77Nr(N)) in
time.
The following lemma is a concise restatement of the main result of Section V of
[Lei85b]. The interested reader is referred there for the proof, which involves some
interesting combinatorial lemmas.
Lemma 14 Let R be an N-terminal routing network with layout area A, let T be
a complete binary tree with N leaves, and let (Ue, Ve) be the natural biparlition of
T's leaves corresponding to the removal of edge e from T. Then there is a mapping
of R's nodes to the leaves of T such that
o+ exactly one terminal is mapped to each leaf and
o+ if edge e is at depth d in T, then at most 0(?/$2d) edges of R have one
endpoint mapped to Uc and one to Ve.
Proof of Theorem 5: Let l?? be an N-terminal routing network whose termi-
nals can be labeled in such a way that it delivers every message set M in time
O(bn(M)T(N)). Let Rc be any competing N-terminal routing network occupying
layout area A. Choose names from 0 to N --H 1 for Rc's terminals by mapping them
to the leaves of Tstd according to a function satisfying Lemma 14. Let Al be any
message set with messages of length b, let Tc(AI) be the time required by Rc to
deliver M, and let Tu(M) be the time required by 1??.
By Lemma 14, corresponding to each edge e at depth d(e) in Tstd is a cut of
Rc with capacity 0(?/2d(e)/2) During delivery of M, at least bC(T3td, Al, e) bits
51
must cross the cut, and each edge can transmit at most one bit per unit time, so
Tc(M) > ?(b ?Tst?, A', e)/(?/??/2d(e)I2)).
Since e was chosen arbitrarily, we have
Tc(M) >
--H ?(b77A 7I(A')).
max ?(2d(e)/26 ?(Tstd, A', e)/?)
eETst?
max (2d(C)/2(Tstd, A', e)/?))
CETstd
Since Tu = O(b7I(A')T(N)), the simulation time penalty Tu/Tc is
c
We now have the tools needed to give "emulative" arguments that the routing
networks presented in Chapters 2 and 3 are general-purpose.
Theorem 6 The N-leaf pruned butterfly can simulate any N-terminal routing net-
work of area A losing at most a factor of O($m/Nlog N) in time.
Proof: By Theorem 1, the pruned butterfly PB can deliver a set A' of O(log N)-bit
messages in time O(A(PB, A') log2 N). By Lemma 13 and Theorem 4, A(PB, M) <
A(TPB, A'). But the capacity of an edge at depth d in TPB is 2?/2[d/21, so
A(TPB, A') ? 27I(A'). Therefore PB can deliver a A' in time O(7I(A') log2 N) time.
Applying Theorem 5 with r(N) = log N yields the statement of the theorem. E
The special case of the above theorem that is of most interest is when the
competing network has within a constant factor of the same area as the pruned
butterfly, or O(N log2 N) area. In that case, the pruned butterfly simulates the
competing network with at most a loss of O(log2 N) time.
52
Theorem 7 The N-terminal sorting fat-tree can simulate any N-terminal routing
network of area A losing at most a factor ofO(4m/NlogN) in time.
Proof: By Theorem 3, the sorting fat-tree can route a set M of O(log N)-bit
messages with constant degree in time O((maxc A(SFT, M, c)+log N) log N). Since
the capacity of a channel at depth d in the sorting fat-tree is at most jmN/2d, and
an edge at depth d in Tstd has capacity only #N/2[d/2i,
max A(5FT,AI,c)			? max A(Tst?,M,e)			,i(Al).
cESFT			--H eETst?
So the time to route on the sorting fat-tree is O((n(M)+log N) log N), and applying
Theorem 5 with r(N) = log N completes the theorem.			E
4.3 A Lower Bound on the Area of Certain
Tree-structured Networks
The pruned butterfly and the concentrator fat-tree each have 0(N) nodes and min-
imum bisection width o-(??), but their 11-tree layouts take O-(N log2 N) area. An
N-node mesh takes 0(N) area, but is not area-universal. An interesting question
to ask is whether something about area-universality forces an extra log2 N factor of
area. Unfortunately, we don't know the answer to that question, but we can show
that certain architectural features shared by the pruned butterfly and concentrator
fat-tree lead to ?(N log2 N) area. We have already observed in Section 4.1 that
both networks are ?-channd-sufflcient, that is, between every pair of leaf sets A
and B they have a number of disjoint paths proportional to a maximum flow from
A to B in their reference trees. This section shows that a slight restriction on the
53
disjoint paths, namely that they follow the underlying binary tree structure, is a
sufficient condition to establish an ?(N log2 N) lower bound on their area.
If u and v are leaves of a tree-structured network fi, then the path
U = Wi, .2...., W? = V 15 said to be tree-constrained if the sequence of reference tree
vertices a(wi), a(w?), ..., a(w?) is the simple path from u to v in T?,g, but possibly
with some repeated vertices. Note that the paths constructed for the pruned but-
terfly in the proof of Lemma 13 are tree-constrained, as are the paths of Lemma 1
on the concentrator fat-tree.
Theorem 8 Let N be a power of two and let F be an N-leaf tree-structured net-
work satisfying the following conditions:
o+ Every channel of F at depth k has capacity at least ?/K2--Htk/2i
o+ A set of?s*(TF, A, B) tree-constrained edge-disjoint paths joins any two dis-
joint sets of leaves A and B.
Then any layout of F takes ?(N log2 N) wire area.
Proof: The proof is by induction on N, and closely follows Leighton.5 wire area
lower bound for the mesh of trees [Lei84]. Let us refer to a network satisfying
Theorem 8's hypotheses as a "natural fat-tree". (The pruned butt&fly and the
concentrator fat-tree are natural fat-trees.) Let W(n) denote the minimum wire
area of an n-leaf natural fat-tree. Assume inductively that W(n) > ?n log2 n for
all n --H 2k where n < N, and where ? is a positive constant satisfying
(?/6144o)2			(4.1)
Let F be an N-leaf natural fat-tree. The edges of P can be partitioned into
levels E0, ?.,..., Ejog N, where Fo is the edges in the root node and Ej is the edges
54
in all nodes at depth i and in the channels to their parents. It is not difficult to
verify that if the edges E0, ..., Ej?1 are removed from F, each of the 2? remaining
subtrees with N/2? leaves is itself a natural fat-tree with the same parameter ?.
Suppose we are given some layout of F occupying W(N) area. As shown in
Figure 4.3, we partition the layout into three vertical and three horizontal strips in
such a way that the number of leaves in regions V0, V2, H0, and H2 are each NIS.
Without loss of generality assume that the longer side of the rectangular region
V1 n H1is horizontal, and let p be the distance separating V0 from V2. Remove the
top (3 log N)/4 levels of edges of F to produce N314 natural fat-trees, each with
N'/4 leaves. Since there are at least N/2 leaves in region V1 n H1, at least N3/4/2
of these subtrees have at least one leaf in V1 n H1. At most 4p edges cross the
boundary of V1flH1, so at least N314/2--H4p subtrees are wholly contained in V1flH1,
and by the induction hypothesis, each has wire area W(N'/4) > oN114 tog2 N114
Therefore
p2 > area(Vi fl H1) > (N314/2 --H 4p)oN114 log2 N'14
Applying the quadratic formula and discarding the negative root yields
p > [VmivlogN1I4(16oN--H1/2 log2 N114 + 2)1/2 --H 4oN1/4 log2 N1141 /2
> (V-oN log N)/8 + k$2 --H 1)V-oNlogN --H oN114 log2 Ni /8
>--H (V-oNlogN)/8
(4.2)
where the last inequality is true when 0 < ($2 --H 1)2?/ log2 N, which follows
from (4.1).
As shown in Figure 4.4, draw a new vertical line that bipartitions the leaves
and splits region V1 into regions Via and Vib. At least one of the regions Via and
55
V0			V1			`V2
H0
p ________			H1
H2
Figure 4.3: Partition of layout of F
Vib has width at least p/2; assume without loss of generality that it is region Vib.
Call the leaves of the left subtree of F lefileaves; of the right subtree, righileaves.
Without loss of generality, assume that region V0 u Via contains more leftleaves
than rightleaves. Then it contains at least N/4 leftleaves. If region V2 contains at
least N/16 rightleaves then let D frightleaves in V2J and let C = fleftleaves in
V0 u Vial. Otherwise, since V2 contains N/8 leaves altogether, it contains at least
N/16 leftleaves. Furthermore, since the number of rightleaves in region V16 u V2
is less than 7N/16, there must be at least N/16 rightleaves in `4) U Via. So let
C = ?leftleaves in V2) and let D = frightleaves in V0 u Via). In each case, we have
defined subsets from the left and right subtrees of F, each of cardinality at least
Nil6, and separated in the layout by a geometric distance of at least p/2.
For A and B any two disjoint subsets of fat-tree leaves, let z(A, B) denote the
wire area used in this layout by the ?s*(Tf, A, B) tree-constrained edge-disjoint
56
V0			Via			Vib			V2
o+			0
H1
o+			H2
Figure 4?4: Modified partition of layout of F
paths joining A and B. If A c c and B C D, then
z(A,B) > (p/2)?s*(TF,A,B),
(4.3)
because all the paths cross region Vib.
57
Next, let J ?J0, ??` J?12?1J be the following partition of the leaves of
the left subtree of F into subsets of size ?. For any (log N)/2 --H 1 bit string
X = ?1og N)/2--H2?1og N)/2--H3 ?o, define Jx to be the set of leaves with id's whose
binary representations are of the form 0 * ?Iog N)/2--H2 * ?(Iog N)/2--H3 * ?*, where *
denotes 0 or 1. As the reader can verify, this partition has the following two useful
properties. Let T be any N/2k?leaf subtree rooted at level k and let Jx ? J Then
(a) no more than ?/2[k/2i members of Jz are leaves of T, and
(b) at most ?/2[k/2] different Jz's have some member in T
We similarly partition the leaves of the right subtree of F by letting A4 be the
set of leaves with id's of the form 1 * ?log N)/2--H2 * ?(Iog N)/2--H3?? * *. Analogous
properties hold for this partition K.
Given any Jx Ei J and any Ky E K, by an arbitrary matching one can construct
a set of ? messages that all go through the root and have sources in Jx and
destinations in Ky. By property (a) such a message set places a load factor of at
most 1 on F's reference tree, so s*(TF, Jx, A?) = #N. It follows that if A C Jx
and B C Ky, then
s*(TF, A, B) = min(IAI, IBI). (4.4)
Let A = fx : IJx fl CI > jN/16J. Then since the J??'s form a partition and
IJxI=ThN,
#2-1
ICI =			IJxflC =			? IJxflCI+ ? Jr ncI
_ xEA z?A
? IAIvW +			--H IAI)(?/16).
Since C > N/16, IA > ?/30. Let A1,..., A#3o be distinct sets of the form
Jr A C where x E A. Similarly let B1, ..., B#3o be distinct sets of the form
58
Ky A D where Ky A DI > ?/16. By (4.4), for each i,
s*(TF, Aj, Bi) = min(IAiI, Bit) >
(4.5)
Lemma 15 If for every i 1 < i < ?/3O, the ?s*(TF, Aj, Bi) tree-constrained
edge disjoint Aj H Bi paths are simultaneously embedded in F, no level k edge has
congestion greater than YN/2?k/21
Proof: Let e be an edge at depth k in the left subtree of F. For any specific
i, all the paths under consideration are edge-disjoint, so the congestion of e is no
larger than the number of (Aj, Bi) pairs whose paths could potentially contain it.
But since the paths are tree-constrained, a particular set of paths can contain edge
e only if Aj includes a leaf in the subtree rooted at e. Recall that property (b)
of the partition J guarantees that at most ?/2?k/21 different sets Jx contain a
leaf in the subtree rooted at e. Each Aj is a subset of exactly one Jz. Therefore
at most ?/2?k/21 different pairs (Aj, Bi) have paths that contain e. A similar
argument applies to edges in the right subtree of P E]
Let Lk denote the total wire area used by level k edges. By Lemma 15,
?/3o
IjN )ffii >- ?, z(Ai, Bi).
From (4.3) and (4.5),
A/3O			ThN/3o
z z(Ai,Bi)>? z
t=1			--H 2
5*(TF,Ai,Bi)>
0 16
Using (4.2) to substitute for p, combining the above results, and simplifying yields
logN Lk ___
2[k/2j			7680
59
From the induction hypothesis, for k = 0,1, ...,logN --H 1 we have
log N
? Lj > 2k+lw(N/2k+1) >--H ?Nl0g2(N/2k+l)
j=k+1
(4.6)
Clearly W(n) > L0 + . + LiogN, and L0+?? + LiogN is minimized when the
equations are satisfied with equality. At the minimum, we have
logN			Lk
L0 = ?6ffia0NlogN			Z 2tk/2i			(4.7)
For 1 <i < logN --H 1,
logN
Lj =
J=t
Lj --H ljN Lj = aN [(logN/2i)2 --H (logN/2i+l)2?
j=i+1			= aN?logN--H2i--H1),
and LiogN = 0. Substituting (4.9) into (4.7), we obtain
?#NlogN?(yN(2logN??) log N--H1
L0 --H 7680
>--H (%#w0 --H6a)NlogN.
From (4.9),
2--H[k/2J + 2&N
logN--H1
Z Lk = aN(log N --H 1)2
logN--H1
z
Combining (4.10) and (4.11), we get
W(N) > 7680 --H 6a)N log N + cN(log N --H 1)2
> oNlog2N+(?# --H8a)NlogN+aN
7680
> aN log2 N,
(4.8)
(4.9)
k2? [k/2J
(4.10)
(4.11)
where the last inequality holds when ?#/7680 --H 8a is positive, which follows from
(4.1).			E]
Chapter 5
An AT2 Lower Bound for
Content-Preserving Routers
Let R be an N-terminal router of area A. By decomposition tree techniques[BL84,
Lei85b], the terminals of 1? can be labeled in such a way that the time T to
route M satisfies the tradeoff AT2 ?(b2?2(AI)N), where 77(M) is the reference
load factor. This lower bound captures the bandwidth constraints imposed by a
certain set of cuts of the network. It is natural to ask whether there is a router
that can achieve this lower bound, delivering any message set with performance
AT2 --H O(b2?2(M)N). The answer is negative, at least for a wide class of routers,
as stated by Theorem 5 below.
We say that a routing algorithm is conieni preserving if the body of each mes-
sage is not modified while the message travels from source to destination. More
formally, we require that each bit of the message body trace a source-to-destination
path, with the property that if any two paths share an edge the corresponding bits
traverse that edge at different times.
60
61
Most routing algorithms proposed in the literature, including those in this the-
sis, are content preserving. (An exception is [Rab89], but there packet splitting
and recoding is for the sake of fault-tolerance, not efficiency.) Content-preserving
routers satisfy the following lower bound:
Theorem 9 Let R be an N-terminal routing network with a content-preserrnng
routing algorithm (either off-line or on-line). Then for each value of b and each
TI < ? there exists a set M of b-bit messages with reference load factor O-(?),
such that the time T to route M on 1? and the area A of It satisfy the tradeoff
AT2 = Q(b2n2Nlog2(N/n2)).
To prove Theorem 9, we first develop a few lemmas, for which we need a
number of definitions. Recall that a graph (V E) is an (c, fl)-expander if for any
S c V such that S ? PIVI, IF(S) --H ?1 > aISI, where F(S) is the set of
vertices adjacent to some vertex in 5. Let EXP(n, a, ?) denote an n-vertex (a, P)-
expander. We also recall [11ar69, pp. 21--H23] that given two graphs C1 = (U, E1)
and G2 = (V, E2), their cross product, denoted C1 x G2, is the graph (U x V, E)
wheret(ui,vi),(u2,v2)) ? Ewhenevereitheru1 v2 and tvi,v21 ? E2 or v1 =V2
and tui,t?] E E1.
Lemma 16 If G1 and G2 are (a,P)-expanders, then G1 x G2 is an (a/4,P2/2)-
expander.
Proof: First we develop some notation. Let G1 = (U, E1) and G2 = (V, E2),
where U = p and ?V = q. Let G = G1 x G2 = (w,E), where W = U x V =
?(ui,vj) : Uj ? U and Vj E V, for 0 < i ? p and 0 ? j < q). For X c w and
O<k<q,let??(X)=((uj,vj)EX:j=kJ. IfoneviewsGasqcopiesofCi,
62
where homologous vertices of the copies are connected according to the pattern of
U2,s edges, then ?k(W) is the vertex set of kth copy. A restricted form of adjacency
in 0, namely adjacency through copies of Ci `s edges, is defined as follows. For
X C W,F1(x) = t(ui,vj) ? W : (uj,vj)isadjacenttoavertexin?(X)?. Clearly
F1(X) c F(X). Mentally reversing the roles of Ci and 02, let ph(X) = f(uj, vj) ?
X : i = hi, for each 0 ? h <p, and let F2(X) = f(ui,vj) e W: (ui,vj) is adjacent
to a vertex in pj(X)J.
Let 5 C W such that 151 < p2IwI/2. We will isolate the portions of 5 that do
not expand well under the adjacency functions F1 and F2. Let I = : Ipi(S)I >
flq? and let J = : l?(5)l > ppj. The set I (respectively, J) is the indices of the
copies of G2 (Gi) that contain too large a part of 5 to be guaranteed expansion
through U2,s (Gi's) edges. Let 5lost = f(ui,vj) ? 5 i ? I and j E J?. 51ost
is the part of 5 that doesn't expand either through Gi's edges or through 02's.
Clearly
15lost1 ?			151151 __ 151 151.			(5.1)
?q Pp --H P2IWI
We now turn our attention to 5 --H 5lost Let Si = f(ui,vj) E 5 : j ? J? and
let 52 = f(ui,vj) ? 5 i ? I). Since for each j ? J, lmi(5i)l ? Pp, by the
expansion property of 01, IFi(Si) --H 511 > aISil. From the definitions ofF1 and
Si, one can see that F(5)--H5 ? F1(51)--H51. Similarly, IF2(52)--H52 > al5?l and
F(5) --H 5 ? F2(52) --H 52. It follows that
IF(S) --H51 > max(IFi(5i) --H511, IF2(S2) --H521)
> 0ma4ISiI,I5?I)
>?JI5iU52I=)IS 5lost1
63
From (5.1) and the upper bound on			we can therefore conclude
IF(S)--HSI > --H?(ISI--H?ISI) --H --H?(i--H??)151 > a15
2			P2IWI			2			P2IWI			4
Given a graph G whose vertices are identified with the terminals of a routing
network 1?, we denote by M(G) the set containing for each edge tu, v) of & a
message from u to v. Also, let area(G) denote the (minimum) layout area of graph
G. We are now ready to state a result that links the layout area of graphs to the
AT2 measure of content-preserving routers.
Lemma 17 Let & be a graph of constant degree. If a content-preserving router of
area A can route M(G) in time T, then
AT2 = ?(area(G x EXP(b,c,P))),
where b is number of bits in the body of a message in Af(&).
Proof: Because R's routing algorithm is content-preserving, the paths traversed
by the bits of the messages of M(G) form an embedding of b copies of & in R, with
congestion at most T. By a variant of a technique due to Thompson [Tho80], such
an embedding can be converted to a layout of area O(AT2) for & x EXP(b, ?,
whence the statement of the lemma.
We now outline the construction of the layout of & x EXP(b, a, ?). Assume
that R is laid out in two layers on a square grid ? = ((i, 5) : 0 <i <I 0 <j < J)7
with IJ = A. The layout of & x EXP(b, a, ?) will have six layers on a square grid
= ?(h, k : 0 < h < IT, 0 < k < JT). Intuitively, ? is viewed as the interleaving
of T stretched copies of g. Laid out on the tth copy of g are the wires of 1? that are
active at time tin delivering some bit of Ai(&). Nlore precisely, if a horizontal wire
64
of R is used at time t (for 1 < I ? T) to move some bit from point (i,j) to (i,j + 1)
on grid g, then a wire is placed in ? from (iT+t,jT+t) to (iT+t,(j + 1)T+t).
A similar mapping is defined for vertical wires.
In the resulting layout, for every sequence of wires of R traversed by a bit going
source to destination, there is a corresponding (disconnected) sequence of stretched
wires. The stretched wire sequences are converted to paths by adding an extra edge
(on a different layer) between each pair of adjacent stretched wires. Thus, we have
obtained a three-layer layout on ? of the edges of 6 copies of G. For each copy, we
then connect together the endpoints of the edges that are incident upon the same
vertex of G, and so obtain a four-layer layout of 6 copies of 0. Next we construct
expanders on each group of 6 homologous copies of a vertex of 0. As these copies
are placed in a T >c T region of g, and 6 < T the expander connections can be
added on two extra layers without increasing the layout area. E]
The graph that will play the role of 0 in Lemma 17 is the cross product of an
expander and MOT(n), the n-vertex mesh of trees [Lei81], to be defined next. Let
? be a power of two and let n be the largest integer of the form 3?22k --H 2k+1
not larger than N/?2. Place 22k of the n vertices in the even rows and columns
of a 2k+1 >c 2k+1 grid. These vertices are called leaves. In each occupied row and
column, construct a complete binary tree on the 2k leaves by adding vertices and
edges as needed. The 2 internal nodes of a tree are placed at the odd grid
positions. Thus an examination of the grid points in an even row from left to
right or in an even column from top to bottom, yields an in-order traversal of the
corresponding tree. The vertices of MOT(n) are named by their association with
the grid points, which are themselves labeled in "shuffle-major" order: the grid
65
point in row r and column c, for 0 < r, c < 2k+1, is labeled with the integer whose
binary representation is the bits of r interleaved with the bits of c.
Lemma 18 Under an appropriate naming of the vertices, the reference load factor
of MOT(n) x EXP(n2,a,P) is O-(?).
Proof: First we show that the reference load factor of MOT(n) is constant.
Recall the binary tree used to define the reference load factor. The labels on the
leaves of any complete subtree of height h correspond to the labels in a rectangular
region of the grid containing the mesh of trees. The number of edges of MOT(n)
crossing the boundary of any such rectangle is proportional to the length of the
perimeter. But the perimeter has length 0(2h/2), which is proportional to the
capacity of the edge leaving the root of the subtree. Thus the reference load factor
is constant.
When assigning names to vertices of MOT(n) x EXP(7?2,Q,P), place the ex-
pander part of each vertex's name in the least significant position. Thus, each
?2-vertex expander gets mapped to an ?2-leaf subtree when computing the refer-
ence load factor. The statement of the lemma follows.
Lemma 19 If H is any m-vertex expander, then
area(MOT(n) x			= ?(rn2nlog2n).
Proof: The graph MOT(n) x H is a supergraph of the expander connected
mesh of trees Qm?2k defined in [BL84]. The lower bound on wire area derived there
applies directly.			E
66
We are now ready to combine the lemmas into a proof of Theorem 9. The
message set M in question is the one corresponding to the graph MOT(n) x
EXP(ii2, a, ?). By Lemma 18, M's reference load factor is e(?). By Lemma 17,
any area A router that routes M in time T satisfies
By Lemma 16,
AT2 = ?(area(MOT(n) x EXP(?2, a, ?) x EXP(b, a,
AT2 = ?(area(MOT(n) x EXP(?2b,a/4,P2/2))).
Finally, by Lemma 19, and the fact that n = O-(N/ii2), AT2 = ?(?4b2nlog2n) =
?(?2b2N log2(N/?2)).
Chapter 6
An Area-universal VLSI Circuit
6.1 Introduction
Theorems 6 and 7 on the time penalty paid by the pruned butterfly and sorting
fat-tree to simulate an arbitrary computer were proven under the assumption that
the competing network conformed to the model outlined in Section 1.2, i.e., a
set of N processors communicating by sending sets of messages. Clearly not all
parallel computing devices fit this model. Some special purpose computers might
not have any entities recognizable as processors, nor communicate via anything
more message-like than single-bit signals sent on wires.
This chapter considers the problem of designing an area-universal computer
for a more general model where the computing devices do not necessarily have
processors. Ideally, we would like a computing device that could efficiently simulate
any target circuit of area A in Thompson's model. Unfortunately we don't know
how to measure time for an arbitrary asynchronous circuit. To give a meaningful
bound on how quickly the universal device carries out the simulation of a target
machine, there must be some definition of a "time step" of the target. We have
67
68
therefore chosen to assume that each gate in the target circuit has a latch at its
output which can change only at each tick of the (global) clock. Although this
may at first seem to unduly narrow the class of circuits our universal device can
simulate, as a practical matter, most digital devices are synchronous and have
only a few levels of combinational logic between registers. Assuming a latch at
the output of every gate will very likely slow the target circuit by only a small
constant, and we are concerned primarily with asymptotic results.
The reader should note that the N-terminal concentrator fat-tree, pruned but-
terfly, and sorting fat-tree could each be adapted to simulate any N-gate circuit
in the model just described. The adaptation would take the form of allocating
one gate to each terminal and, for each time step of the target circuit, generating
a constant-degree set of i-bit messages corresponding to the signals on the cir-
cuit's wires at that time step. However, each of the above networks requires area
O(Alog2 A) and time O(log2 A) to simulate one step of a circuit of area A. In the
next section, we construct a circuit of area only 0(A) that can be "programmed"
to simulate any circuit of area A with only O(log A) slowdown.
6.2 An Area-universal VLSI Circuit
Circuit Ca is said to simulate circuit Cb with slowdown t if (1) the output latch of
each gate of Cb is put in correspondence with a latch in Ca for the duration of the
simulation, and (2) for each tick of C6,s clock, the latches of Ca that correspond
to latches in C? are updated with the values of C6 within t ticks of Ca's clock. A
circuit Ctt is area-universal with gap f(A) if it can be laid out in area 0(A) and
for any circuit C taking up area A, control registers in C? can be loaded with
69
information that permits C? to simulate C with slowdown O(f(A)).
Theorem 10 There is an area-universal circuit with ?ap
We first show that a layout containing very long wires can always be reorganized
so that no wire is longer than the perimeter of the layout.
Lemma 20 Any VLSI circuit that can be laid out in a square of area A can be
laid out in a square of area 4A using only wires of length 8? or less.
Proof: We leave alone wires shorter than 2?5? and reroute the at most ?/2
longer wires. Each long wire has a source end and a destination end. We expand
the layout grid horizontally by adding an extra vertical track for the source of
each long wire. The extra tracks for the sources in a given column are added
between it and the the column immediately to the right of it, so the vertical tracks
that were part of the original layout may have unequal spacing in the modified
layout. The new wires are routed up the vertical tracks to the top of the layout.
A similar expansion of the layout in the vertical direction is made by adding an
extra horizontal track for each destination of a long wire. The layout now has
dimensions at most 3?/2 x 3jA/2, the sources have been routed to the top of
the circuit, and the destinations have been routed to the right. We now further
expand the layout by adding #A/2 new horizontal tracks at the top of the layout
and ?/2 new vertical tracks at the right of the layout. These tracks are used to
join the already partially routed long wires. E]
Proof of Theorem JOIn the grid model of VLSI computation, there is only a
constant number of possibilities for the contents of a given point at the intersection
70
of two tracks of the layout grid. For example, it can be empty, contain a pair of
crossing wires, a single wire bending, a gate with its inputs on the south and
west wires and outputs on the north and east, etc., or any rotations of these. It
is possible to construct a programmable finite state machine that can simulate
every possible circuit element with constant delay. We refer to such a finite state
machine as a universal circuit element. The universal circuit element has an 1/0
port for each of the four grid directions and a one-bit latch designated to hold the
output of the gate it will simulate, if any. It is programmed by loading a control
register with a constant number of bits that specify the type and orientation of
circuit element to be simulated. Associated decoding circuitry assures that the
appropriate boolean function of the inputs (either the NAND function in case of a
gate or the identity function in the case of wire) appears at the appropriate output
ports within constant time. This structure fits in constant area, and the universal
circuit G1L is a mesh of such universal circuit elements.
To configure GIL to simulate an arbitrary circuit G that can be laid out in area
A, we first normalize G to exclude wires longer than sA and to contain only
two-input NAND gates with fan-out at most two. This expands C's area by at
most a constant factor. The control registers of the universal circuit elements in Cn
are configured according to C's normalized structure, and G's gate latches are put
in correspondence with the designated one-bit latches in the appropriate universal
circuit elements.
The simulation of G by Cu proceeds in the natural way, with the delay on a
wire being proportional to its length instead of constant. Since all the wires are of
length O(?), the simulation can be completed in time O(?). E)
71
The above construction will be used to simulate small subcircuits in the proof
of the following theorem, the main result of this chapter.
Theorem 11 There is an area-universal circuit with gap log A.
We will need the following lemma, which was stated in a somewhat more general
form as Corollary 2 in [Lei85b].
Lemma 21 Let CFT denote an N-leaf concentrator fat-tree built with ?-partial
concentrators in which each channel has capacity at least 2 log N, and let M be
any message set. Then there is a partition of Al into O(A(AI)) subsets, with each
subset Mi satisfying
niiaxA(CF%Mj,c) <--H ?.
Proof of Theorem 11: The area-universal circuit Cn is based on an area-
universal routing network where a single-bit message is routed between each pair
of terminals that corresponds to the endpoints of a wire in the simutated circuit
C. Since C is fixed in advance, the sources and destinations of the messages never
change, and only off-line routing is needed. Our construction uses ideas from
[Lei85b] and [LMR88] to obtain an efficient solution using pipelining of messages.
Let ? be the constant factor by which each dimension of the layout is expanded
when a circuit is normalized as in Theorem 10 to exclude long wires and to contain
only two-input NAND gates. Then Cu consists of ?2A universal circuit elements
interconnected by a hybrid of a mesh and a fat-tree. Specifically, the universal
circuit elements are placed in A/log2 A meshes, each ? log A x ? log A, at the leaves
of a fat-tree. (For convenience, we assume A and A/log2 A are powers of two.)
72
The capacity of a channel at the root of a subfat-tree containing n universal circuit
elements is max(2 log A, #?/ log A).
At each internal node of the fat-tree is a concentrator switch of the form de-
scribed in [Lei85b], but augmented in two ways. First, as in [LMR88i, we insert a
chain of single-bit registers on every path through the switch from one child to the
other. If the node is at the root of a subtree containing n universal circuit elements,
the chains have length 2log(A/n). The purpose of the chains is to delay message
propagation so that all paths between leaves have equal length. The number of
chains is equal to the channel capacity, so the total area they add to the node is
0(2 log(A/n)?n/ log A)
The second augmentation to the concentrator switch is the ability to store
O(log A) switch settings and play them back consecutively in constant time each.
Since the concentrator has constant depth and bounded degree, one setting of
an s-input concentrator can be stored with 0(s) bits. The number of inputs to
a switch is proportional to the channel capacity O(max(log A, ?n/log A)), so the
O(max(log2 A, #n)) extra area needed to store the switch settings will not increase
the O(max(log2 A, n/ log2 A)) area already devoted to the concentrators by more
than a constant factor.
To analyze the area of Cu, let S(n) be the length of one side of the square layout
of a subtree containing n universal circuit elements. The recurrence relation for
S(n) is
O(log A)			if n log2 A
S(n)			2S(n/4) + O(log A)			if log2 A ? n ? log4 A
2S(n/4) + 0(#/ log A)			if log4 A ? n
73
The solution of this recurrence is S(A) = o(V)k), implying the total area of C? is
0(A).
The structure of C is loaded into the control registers in Ctt's universal circuit
elements according to the following scheme. We partition the layout of C into
subcircuits by overlaying a grid with cells of size (log A) >< (log A). Wires originally
crossing one or more of the grid lines are segmented by the partition. The initial
and final segments of such wires are rerouted to 1/0 registers on the border of
their respective subcircuits; the remaining segments are removed entirely. Each
subcircuit is then normalized, expanding its layout to (? log A) x (? log A). The
subcircuits of C are labeled in "shuffle-major" order: the subcircuit in row r and
column c of the grid, for 0 ? r c ? ???/ log A, is labeled with the integer 1 whose
binary representation is the bits of r interleaved with the bits of c. Subcircuit 1 is
mapped to the mesh at the 1th leaf of the fat-tree, with the 1/0 registers along the
border joining the mesh to the fat-tree. This "shuffle-major" assignment has the
property that each subtree of C? has a contiguous rectangular region of C mapped
to it.
Circuit Cn simulates one step of C in two phases, each taking O(log A) time. In
the first phase, each mesh simulates the subcircuit assigned to it. In essence, the
first phase is A/log2 A copies of the area-universal circuit of Theorem 10 working
independently in parallel, so it can clearly be accomplished in O(log A) time.
In the second phase, the wires between subcircuits are simulated by sending
single-bit messages through the fat-tree channels. Observe that the maximum
number of wires that can enter or leave any Mn x Mn region of C is proportional
to the perimeter, or o(#). The root channel of each subtree of Cu to which such
74
a square region has been mapped has capacity max(2 log A, #?/ log A) Therefore
when the signals on the wires between subcircuits are regarded as a set M of single-
bit messages, the load factor M places on any channel is O(log A). By Lemma 21,
M can be partitioned into d = O(log A) message subsets ?....., Md, each with load
factor on every channel at most OL. By Lemma 1, there is a set of disjoint routing
paths for any particular subset Mi. These paths can be specified by one setting
of each concentrator switch of the network. Thus O(maxc A(M, c)) = O(log A)
settings of the concentrator switches suffice to store routing paths corresponding
to all the wires that go between subcircuits. Since all paths between leaves of
the fat-tree are of equal length, the message subsets Al1, . .., Aid can be pipelined
through the switches for delivery in O(log A) time. 0
6.3 Discussion
At first glance, the circuit of Theorem 11 seems to be a better general-purpose
computer than those presented in Chapters 2 and 3, because it has a smaller
penalty in both area and time when simulating an arbitrary machine. But the
price we have paid for the better resource bounds on the universal device is the
loss of the helpful programming abstraction of message passing. Of course, the
question of how best to program a computer consisting of a network of processors
is far from solved, but at least a substantial number of useful programs have been
written for that model. In contrast, one clearly cannot expect a programmer, for
each problem to be solved, to design a circuit whose description would be loaded
into the concentrator switches of the area-universal circuit. One might imagine that
compiler technology could advance to the point where such a circuit description
75
could be produced automatically from a program written in an easy-to-understand
language. The design of such a compiler is one of the obstacles likely to keep this
line of research of only theoretical interest for quite some time.
Chapter 7
Conclusion
This thesis has investigated the problem of designing area-universal interconnection
networks and deterministic wuting algorithms for them. The main results are the
constructions of the pruned butterfly and the sorting fat-tree. Although the sorting
fat-tree has the better performance, the flexible sorter at the internal nodes of the
tree is probably too complex to be practical. The pruned butterfly, on the other
hand, has a very simple structure that may be practical.
As shown by Theorems 6 and 7 both the pruned butterfly and the sorting fat-
tree can simulate networks with area within a constant factor of their own area with
a slowdown of only O(log2 N). It is an open question whether this time penalty is
the smallest possible. However, one consequence of Theorem 9 is that Leiserson's
method of proving a network area-universal (summarized by Theorem 5) will at
best achieve an upper bound of O(log N). Any tighter bound on the simulation
slowdown, if achievable at all, would require drastically different methods.
The role of reference load factor can be better understood by relating some of
the routing results of this thesis to known bounds on the layout area of graphs. It
76
77
is well known[BL84] that a graph with a (minimum $2-) bifurcator of size F has
layout area at least ?(F2) and at most O(F2 log2(N/F)). Moreover, each of these
bounds is achieved by some graph. Therefore, the bifurcator determines the area
only to within a factor of log2(N/F).
An analogous situation holds for the reference load factor as a predictor of the
AT2 measure of routing complexity, albeit with a few more conditions. Suppose we
restrict our attention to content-preserving routers and constant-degree message
sets with 0 (log N)-bit messages and reference load factor ?, where ?(log N) ? 71 ?
O(N'/2--H?), for some fixed constant ? > 0. Then the AT2 measure for routing a
message set is at least ?(?2Nlog2 N), by arguments from [Lei85b], and at most
O(712N log2 N log2(N/?2)), by Theorem 3 on the sorting fat-tree. It is easy to
construct message sets for which AT2 = O-(712N log2 N), but there are also some for
which AT2 --H O-(712N log2 N log2(N/?2)), as established in Theorem 9. Therefore
the reference load factor determines the AT2 measure only to within a factor of
log2(N/?2).
It is natural to ask whether some parameter in addition to the load factor could
lead to tight characterization of the area-time complexity of routing. However, even
within the load factor framework, the conditions on the above analogy suggest
several avenues for further research. With regard to the lower bound, it would be
interesting to see if the restriction to content-preserving routers could be removed.
However, the main problem left open is to design an N-terminal network that can
route any message set in 0(71 log N) time and O(N log2 N) area, matching the lower
bound of Theorem 9. We conjecture that such an optimal router would combine
features from both of our fat-trees. To route message sets with constant load factor
78
in logarithmic time, a network should, like the pruned butterfly, have tree nodes
with only constant depth. Moreover, the routing algorithm should simultaneously
deal with messages that have peaks at different levels, as in our sorting fat-tree.
Bibliography
[AHMP87]
II. Alt, T. Hagerup, K. Mehihorn, and F. P. Preparata. Simulation of
idealized parallel computers on more realistic ones. SlAM Journal on
Computing, 16(5)808--H835, November 1987.
[Ben65] V. E. Benes. Mathematical Theory of Connecting Networks and Tele-
phone Traffic. Academic Press, New York, 1965.
[Bil84]
[BL84]
[BP85]
[BP89]
[C1185]
[DS82]
G. Bilardi. The Area-time Complexity of Sorting. Ph.D. dissertation,
University of Illinois, Urbana-Champaign, Illinois, December 1984. Co-
ordinated Science Laboratory Report R-1024.
5. N. Bhatt and F.
graph layout problems.
28(2):300--H343, 1984.
T. Leighton. A framework for solving VLSI
Journal of Computer and System Sciences,
G. Bilardi and F. P. Preparata. A minimum area VLSI network
for O(log N) time sorting. IEEE Transactions on Computers, C-
34(4):336--H343, April 1985.
G. Bilardi and F. P. Preparata. Size-time complexity of boolean net-
works for prefix computations. Journal of the ACM, 36(2):362--H382,
April 1989.
5. A. Cook and II. J. Hoover. A depth-universal circuit. SlAM Journal
on Computing, 14(4)833--H839, November 1985.
E. Dekel and 5. Sahni. Binary trees and parallel scheduling algorithms.
IEEE Transactions on Computers, C-32(3):307--H315, March 1982.
5. Fortune and J. Wyllie. Parallelism in random access machines. In
Proceedings of the 10th Annual Symposium on the Theory of Comput-
ing, San Diego, California, pages 114--H118, May 1978.
79
[FW78]
80
[GL85j
[GL89]
[Gol82j
[GP83]
[Gre9O]
R. I. Greenberg and C. E. Leiserson. Randomized routing on fat-trees.
In Proceedings of the 26th Annual Symposium on the Foundations of
Computer Science, pages 241--H249, October 1985.
R. I. Greenberg and C. E. Leiserson. Randomized routing on fat-trees.
In 5. Micali, editor, Randomness and Computation, pages 345--H374. JAl
Press, Inc., 1989.
L. Goldschlager. A universal interconnection pattern for parallel corn-
puters. Journal of the ACM, 29(3):1073--H1086, October 1982.
Z. Galil and W. J. Paul. An efficient general-purpose parallel computer.
Journal of the ACM, 30(2):360--H387, April 1983.
R. I. Greenberg. The fat-pyramid: A robust network for parallel com-
putation. In W. J. Daily, editor, Advanced Research in VLSI: Proceed-
ings of the Sixth MIT Conference, pages 195--H213, 1990.
[11ar69] F. llarary. Graph Theory. Addison-Wesley, Reading, MA, 1969.
[ller9O] K. T. Herley. Deterministic Simulations of Shared Memory on Bounded
Degree Networks. Ph.D. dissertation, Cornell University, January 1990.
[KU88] A. Karlin and E. Upfal. Parallel hashing - an efficient implementation
of shared memory. Journal of the ACAl, 35(4):876--H892, October 1988.
[Law75] D. II. Lawrie. Access and alignment of data in an array proces-
sor. IEEE Transactions on Computers, C-24(12):1145--H1155, December
1975.
[Lei79]
[Lei8l]
C. E. Leiserson. Systolic priority queues. Technical Report CMU-CS-
79-115, Department of Computer Science, Carnegie-Mellon University,
April 1979.
F. T. Leighton. Layouts for the Shuffle-Exchange Graph and Lower
Bound Techniques for VLSL Ph.D. dissertation, Massachusetts Insti-
tute of Technology, Cambridge, Massachusetts, September 1981.
[Lei84] F. T. Leighton. New lower bound techniques for VLSI. Mathematical
Systems Theory, 17:47--H70, 1984.
[Lei85a] F. T. Leighton. Tight bounds on the complexity of parallel sorting.
IEEE Transactions on Computers, C-34(4):344--H354, April 1985.
[Lei85b] C. E. Leiserson. Fat-trees: Universal networks for hardware-efficient
supercomputing. IEEE Transactions on Computers, C-34( 10)892--H
900, October 1985.
81
[LM88]
[LMR88]
[MV84]
[NMB83]
[Pip77]
[PU87]
[PU89]
[PV8l]
[Rab89]
[Ran87]
C. E. Leiserson and B. M. Maggs. Communication-efficient parallel al-
gorithms for distributed random-access machines. Algorithmica, 3:53--H
77, 1988.
T. Leighton, B. Maggs, and 5. Rao. Universal packet routing algo-
rithms. In Proceedings of the 29?h Annual Symposium on the Founda-
tions of Computer Science, White Plains, New York, October 1988.
K. Mehlhorn and U. Vishkin. Randomized and deterministic simu-
lations of PRAMs by parallel machines with restricted granularity of
parallel memories. Acta Informatica, 21(4) :339--H374, November 1984.
D. D. Nath, 5. N. Maheshwari, and P. C. P. Bhatt. Efficient VLSI net-
works for parallel processing based on orthogonal trees. IFEE Trans-
actions on Computers, C-32(6):569--H581, June 1983.
N. J. Pippenger. Superconcentrators. SIAAI Journal on Computing,
6(2):298--H3O4, 1977.
D. Peleg and E. Upfal. The generalized packet routing problem. The-
oretical Computer Science, 53:281--H293, 1987.
D. Peleg and E. Upfal. The token distribution problem. SIAAI Journal
on Computing, 18(2):229--H243, April 1989.
F. P. Preparata and J. Vuillemin. The cube-connected cycles: A ver-
satile network for parallel computation. Communications of the ACM,
24(5):300--H309, May 1981.
M. 0. Rabin. Efficient dispersal of information for security, load bal-
ancing, and fault tolerance. Journal of the ACAl, 36(2):335--H348, April
1989.
A. G. Ranade. liow to emulate shared memory. In Proceedings of the
28th Annual Symposium on the Foundations of Computer Science, Los
Angeles, California, pages 185--H192, October 1987.
[RBJ88] A. G. Ranade, 5. N. Bhatt, and 5. L. Johusson. The Fluent abstract
machine. In Proceedings of the 5th AHT Conference on Advanced Re-
search in VLSJ? pages 71--H93, March 1988.
A. L. Rosenberg. Three-dimensional integrated circuitry. In H. T.
Kung, B. Sproull, and G. Steele, editors, Proceedings of the CMU Con-
ference on VLSI Systems and Computations, pages 69--H80. Computer
Science Press, Rockville, Maryland, October 1981.
[Ros81]
82
[Tar83] R. E. Tarjan. Data Structures and Network Algorithms. Society for
Industrial and Applied Mathematics, Philadelphia, PA, 1983.
[Tho8O] C. D. Thompson. A Complexity Theory for VLSL Ph.D. dissertation,
Carnegie-Mellon University, August 1980.
[UW87] E. Upfal and A. Wigderson. How to share memory in a distributed
system. Journal of the ACM, 34(1):116--H127, January 1987.
[Val76] L. G. Valiant. Universal circuits. In Proceedings of the 8th Annual
ACM Symposium on the Theory of Computing, pages 196--H293,1976.
[Wak68] A. Waksman. A permutation network. Journal of the ACM, 15:159--H
163, 1968.
