BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR93-1370
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Fast Algorithms for $N$-body Simulation
AUTHOR:: Sundaram, Sridhar
DATE:: August 1993
PAGES:: 87
COPYRIGHT:: Sridhar Sundaram 1993 - All Rights Reserved
ABSTRACT::
Many physical models require the simulation of a large number ($N$) of 
particles interacting through pair-wise inverse square law forces. $N$-body 
simulations are employed in fluid-dynamics, biochemistry, astrophysics, 
electrodynamics and molecular dynamics. The computational problem is 
intrinsically hard and these simulations are time-intensive.

Existing algorithms exploit either the spatial proximity of particles or the 
temporal proximity of states. In this thesis, we formally combine the two 
approaches and present an algorithm with sequential time complexity 
$O(N^{4/3})$ to integrate $N$ uniformly distributed particles in 3D over one 
crossing time against the $O(N^{8/3})$ complexity of the direct method. Under 
reasonable assumptions, our algorithm is optimal. The core of the algorithm 
is the temporal multipole expansion of the field in terms of the space-time 
coordinates of the field-point.

We also present efficient parallel algorithms on the 2D and 3D Mesh and 
Hypercube which amortize communication costs through temporal multipole 
expansions. The parallel algorithms offer an order of magnitude improvement 
over existing algorithms for even $10^{4}$ particles.

A sequential implementation of the algorithm for two-dimensional $N$-body 
systems shows the predicted asymptotic scaling. A parallel version on a 
16-processor Intel iPSC/860 machine is also in conformance with theoretical 
expectations.
END:: CORNELLCS//TR93-1370
BODY::
Fast Algorithms for
N-Body Simulations
Sridhar Sundaram
Ph.D Thesis
93-1370
August 1993
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
FAST ALGOWTHMS FOR? N-BODY SIMULATIONS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Sridhar Sundaram
August 1993
o Sridliar Sundaram 1993
ALL R?IGllTS RESERVED
Fast Algorithms for N-body Simulation
Sridhar Sundaram, Ph.D.
Cornell University 1993
Many physical models require the simulation of a large number (N) of parti-
cles interacting through pair-wise inverse square law forces. N-body simulations
are employed in fluid-dynamics, biochemistry, astrophysics, electrodynamics and
molecular dynamics. The computational problem is intrinsically hard and these
simulations are time-intensive.
Existing algorithms exploit either the spatial proximity of particles or the tem-
poral proximity of states. In this thesis, we formally combine the two approaches
and present an algorithm with sequential time complexity o(N4/3) to integrate N
uniformly distributed particles in 3D over one crossing time against the 0(N813)
complexity of the direct method. Under reasonable assumptions, our algorithm
is optimal. The core of the algorithm is the temporal multipole expansion of the
field in terms of the space-time coordinates of the field-point.
We also present efficient parallel algorithms on the 2D and 3D Mesh and Hyper-
cube which amortize communication costs through temporal multipole expansions.
The parallel algoritlims offer an order of magnitude improvement over existing al-
gorithms for even io4 particles.
A sequential implementation of the algorithm for two-dimensional N-body sys-
tems shows the predicted asymptotic scaling. A parallel version on a 16-processor
Intel iPSC/86() machine is also in conformance with theoretical expectations.
Biographical Sketch
Born in Tirunelveli in India, Sridhar wandered all over Northern India from Ranchi
to Patna to Shaktinagar to Delhi in pursuit of education and his parents, elder
brother and sister, He received a Bachelor of Technology degree in Computer
Science from Indian Institute of Technology, Delhi in August 1989. He received
his Master's degree from Cornell in December 1991 and his Ph.D degree, also from
Cornell in August 1993.
iii
Acknowledgements
This thesis would be incomplete without a mention of all the people who worked
behind the scenes to make it possible, directly or indirectly. I would like to thank
the following people.
My advisers: John Hopcroft who instilled in me the vision and love of research
and also encouraged me to attack this problem; David Chernoff, who nurtured
my ideas gently steering me away from numerous pitfalls, encouraged me when I
was down and whose patient help and guidance made this thesis possible; Keith
Marzullo who lent a patient ear when I discussed my problems with him and made
time for me from his busy schedule.
My friends in the department: Karl I3o?hnnger, Suresh Chari, Prasad Jayanti,
Mike Kalantar, Yanhong Liu, Dave Pearson, T.V. R?aman, Pankaj Rohatgi, Ar-
avind Srinivasan, Kjartan Stefansson, Ida Szafranska.
Juris llartmanis, who gave me a clear perspective on research; Rich Zippel,
who gave me a clear perspective on the N-body problem; Leslie Greengard, Feng
Zhao, Sverre Aarseth and Vasudha Govindan who sent me various pieces of codes
for testing purposes; Anne Gockel, who coaxed the Intel llypercube to work at a
iv
crucial time and numerous others whose help and advice was invaluable.
Of course, none of the above people are responsible in any way for any mistakes
or errors in the thesis. All such mistakes are the sole responsibility of Gilling, the
workstation on which this thesis was written!
v
Table of Contents
2
Introduction
1.1 Modeling N-body Problems
1.2 Brief History
1.2.1 Sequential Algorithins
1.2.2 Parallel Algoritlims
1.3 Outline of the Thesis
1.3.1 Philosophy.
1.3.2 Contributions
and Outline
Background
2.1			Physical Model . .			.
2.1.1 Conventions
2.2 The Direct Algorithm . . . .
2.3 Exploiting Spatial Proximity
2.3.1 Multipole Expansions
2.3.2 Fast Multipole Algorithm
2.3.3 Adaptive Multipole Algo?thm
Exploiting Temporal proximity
2.4.1 Individual Time-step Scheme
Complexities
2.4
2.5
3 The Temporal Multipole Expansion
3.1 The Temporal Expansion . .
3.1.1 Temporal Expansion for
3.1.2 Temporal Expansion for
3.2 Temporal Multipole Expansions
3.2.1 Temporal Multipole Expansions for Cells
4 Sequential Algorithm
4.1 Informal Description of Algorithm .
a Particle . .
Collection of Particles .
vi
3
3
4
4
4
5
9
9
10
10
11
11
12
17
19
20
20
22
22
23
26
28
29
30
30
4.2
4.3
4.4
4.5
Transient Particle Consistency . .			. .
4.2.1 Particle Consistency Operator
4.2.2 Manipulation of Consistency Operators
4.2.3 Compensating for Transient Particle Potentials
Temporal Multipole Algorithm with Particle Consistency
Complexity Analysis and Optimality
Blocking of Time-steps .
4.5.1			Time-step Evolution . . .			.
4.5.2			Block Updates and Efficiency
Parallel Algorithm: Homogeneous
5.1			Parallel Processing Model			. .
5.2			Communication.
5.3 Estimating the Validity Duration T?
5.4 The Parallel Temporal Multipole Algorithm
5.5			Complexity Analysis
5.5.1 Communication Efficiency
5.5.2 Time and Space Complexity
Parallel Algorithm: Non-Homogeneous
6.1 Load Balancing
6.1.1 The Center of Work Heuristic
6.2 Communication Efficiency
Parallel Algoritlim on Various Architectures
7.1			Relaxed Pyramid
7.1.1			Homogeneous . 			. .
7.2			Hypercube
7.2.1			Homogeneous
7.3			Mesh. .
7.3.1			Homogeneous
7.4			3D Algorithin on 2D Mesh			. .
7.5			A Distributed System
Implementation
8.1 Implementation Details
8.1.1			Scaling . . .			. . .
8.1.2 Tree of Cells
8.1.3 Taylor Series Integration
8.1.4 Temporal Multipole Expansions
vii
5
6
7
8
31
32
33
34
35
36
39
39
41
43
43
44
44
46
49
49
49
51
52
52
54
58
58
59
60
60
61
62
63
63
65
65
65
65
66
66
8.2
8.3
8.4
8.1.5 Optimizations of Multipole Manipulations
8.1.6 Parallel Optimizations .
Verification
8.2.1			Accuracy of Multipoles
8.2.2			Accuracy of Integration
Sequential Algorithm: Results			. .
8.3.1			Homogeneous Distribution			. .
8.3.2			Power-Law Distributions
8.3.3			Asymptotic Scaling . .			. .
Parallel Algorithm: Results
8.4.1 Computation and Communication
8.4.2 Scaling with Number of Processors
A Force-determined Time-steps
A. 1 Pathology with Force-determined Time-steps
Bibliography
viii
66
67
67
67
68
69
71
72
73
74
76
77
83
83
85
List of Tables
1.1			Time Line of Sequential N-body Algontlims			7
1.2			Time Line of Parallel N-body Algoritlims			.			. . .			. .			8
2.1 Space-Time Complexities in 2D and 3D for integrating N particles 21
4.1 Time Complexity for integrating N particles over 1 crossing time
in 3D			.			. .			. .			. . .			. 42
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
Multipole Order and Force Error .
Accuracy of Integration and Errors.
Time taken (in s) for different algoritlims: RANDOM
Time taken (in s) for different algontlims: POWER LAW 1
Time taken (in s) for different algontlims: POWER LAW 2
Asymptotic Scaling with N of Temporal Multipole algoritlim
Time taken (in s) for different algoritlims on 16 Processors
Time taken (in s) for different algoritlims on 4 Processors
Time taken (in s) for different algoritlims on 1 Processor
ix
67
69
80
81
81
81
81
82
82
List of Figures
1.1 Modeling Nature through N-body Simulations
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Cell hierarchy for fast multipole algorithm . .			. .			.
Changing the origins of multipole expansions .			. .			.
Near and far potentials at a cell
Forming ?? for cell i not at bottom level			. .			. .			.
Forming ?? for cell zo+ not at top level
Cells for fast multipole algorithm in 2 dimensions . . .
Cells for non-uniform particle distributions in the adaptive algorithm
3.1 Temporal Expansion for a Particle
4.1 Particle Transitions across cell boundaries
4.2 Correcting ? with Consistency Operators
5.1
2
13
14
15
15
16
17
18
23
32
35
Validity duration of temporal expansion of particles inside sphere
at origin
5.2 Pyramid of Processors for the 1D System of Figure 2.1
6.1 Load Balancing: Distribution of Particles and Processors
6.2 Work redistribution based on center of work heuristic
7.1			Principle of Node chaining			.			.
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
45
o+ 47
o+ 51
53
o+ 60
Error in force computed using multipoles . . .
Integration precision against energy, position and velocity errors
Test Distributions of 500 Particles
Time taken by various algon'thms for homogeneous distributions
Time taken by various algorithms for POWER LAW 1 distribution
Time taken by various algorithms for POWER LAW 2 distribution
Asymptotic scaling with N of Temporal Multipole algorithm
Time taken by multipole algorithms on 16 processors o+ .
Tcomp, Twait and ToH for various configurations on 16 processors
68
70
71
72
73
74
75
76
77
x
8.10			Computation v/s Communication on 4 and 16 processors78
8.11 Time taken by temporal multipole algorithm on 1,4 and 16 processors 79
A.1			Pathology with Force-determined Tim&steps .			.			. . . . 84
xi
List of Assumptions
3.1.3 Statistical Averaging . .
4.2.1
4.4.1
4.4.2
4.4.3
27
Particle Movements
Minimum Computation			. .
Validity period for parent cell expansion
Non-exponential Density . .
5.5.1			Large Cell .			. .			. .
6.2.1 Evolution and Load Balancing . .
(3.2.2 Composition of homogeneous subsystems
xii
32
37
37
38
49
54
55
Chapter 1
Introduction
The N-body problem involves studying the evolution of a large ensemble of particles
interacting through a Coulombic or gravitational force. Given the state of N
particles at some initial time, the objective is to find the state of the particles
at a later time. This problem has applications in astrophysics, fluid mechanics,
molecular dynamics and electrodynamics. The problem is chaotic, characterized
by extreme sensitivity to initial conditions, and no analytical solutions are known
even for the 3-body problem. Computer simulations are often used to study N-
body phenomena.
1.1 Modeling N-body Problems
Modeling N-body phenomena on a computer requires first a physical model of the
system. Figure 1.1 depicts one model for the gravitational N-body problem. Stars
or galaxies are treated as point particles interacting through Newtonian gravity.
Given the positions fxj? and masses tmit of N particles, the evolution of each
particle is governed by Newton's second law of motion.
d2Xj			N
fl% dt2 ?jZ#iFii			fori			l...N			(1.1)
Here, Fi,j is the force exerted on particle i by particle I. For instance, the gravi-
tational force is given by
Fi,j			Cm?m?Xj--H Xj (1.2)
Ixi			--H Xj
The N coupled differential equations given by (1.1) govern the evolution of the
system. The system can be evolved by repeatedly solving the equations for the
2
COMPUTATIONAL MODEL
point panticles
finite time-steps
finite Tayk?r series
Figure 1.1: Modeling Nature through N-body Simulations
forces on each particle and advancing all particles through an infinitesimal time-
step.
For a computer simulation, the solution process can be conceptualized as fol-
lows. Infinite Taylor series expansions model the force of one particle on another
and integration of the expansions through infinitesimal time-steps evolves the sys-
tem. To make this physical model computationally feasible, a finite Taylor series
and finite time-steps are used, If all position derivatives up to the q'th are known
for a particle, it can be advanced through time-step At by straightforward Taylor
series integration.
xj(t t At) q (At?d5xi(t)
j=1 j!			dti
fori			1...N			(1.3)
This finite treatment introduces an unavoidable error. The error will be small in a
single integration step if the particles are advanced through a small distance and
the potential on the particles changes by a small amount. This alone, however,
is not sufficient since the chaotic nature of the N-body system causes extreme
sensitivity to initial conditions. Conventionally, the measure of divergence of the
two models is the difference in positions (and velocities) of the particles. Two N-
body configurations initially differing by an arbitrarily small amount can diverge
exponentially away from each other with increasing time. Thus, although the
3
error in each integration step is small, the total error could be large. Reif and
Tate [RT93], show that integrating N particles with 2--HN0?1? accuracy through
N0(1) time-steps is PSPACE-complete. This suggests that there is no polynomial
time computable closed form representation of the trajectory of N-body systems
and that the problem is intrinsically hard.
However, the statistical behavior of the system has been empirically observed
to be stable. Although the exact positions and velocities of an N-body integration
are unreliable, statistical properties turn out to be reliable even for a fixed precision
in each time-step. Dynamical systems theory offers some theoretical justification
for this in the form of "shadowing lemmas". The interested reader is referred to
Guckenheimer and Holmes [GH83J. Fortunately, researchers are usually interested
only in statistical properties of N-body systems. Large N (1o4?1o6) is often re-
quired for statistical validity of the results and efficient algorithms are therefore
very important.
1.2 Brief History
A lot of work has been done on N-body algorithms. Most of the work is on
sequential algorithms but in recent years, some work has also been done on parallel
algorithms.
1.2.1 Sequential Algoritlims
N-body algorithms can be classified on the basis of whether they exploit spatial
location of particles or temporal proximity of successive states to reduce simulation
time. See Table 1.1.
The particle-particle or direct method advocated by von Hoerner [vH6O] ex-
ploits neither, Aarseth's individual time-step algorithm [Aar63, Aar85J exploits
proximity in time while the Ahmad-Cohen [Ac731 scheme tries to exploit both
space and time proximity. All of these methods perform direct summation of
forces at each particle. Particle-Mesh algorithms exploit the spatial uniformity
of the particle distribution and obtain the potential by solving the Poisson equa-
tion on a mesh. They are fast but have limited resolution and are unsuitable for
non-uniform particle distributions. Particle-Particle/Particle-Mesh algorithms use
particle-particle methods to compute short-range interactions and particle-mesh
methods to compute long-range interactions, in an attempt to get both speed and
accuracy. They are very effective for nearly uniform distributions. Hockney and
Eastwood [HE89] provide a good review of most of these algorithms. Tree methods
organize the particles as a hierarchical tree and can be classified into three cate-
4
gories. Appel's algorithm [App85] which clumps particles based on nearest neigh-
bor relationships, the Barnes-Hut [BH86] tree algorithm which groups particles by
oct-trees and the fast multipole algorithm by Greengard and Rokhlin [GR87,Gre87]
(and independently discovered by Zhao [Zha87]) which is also oct-tree based, all
exploit only the spatial proximity of particles. While the first two have a time
complexity of O(NlogN) to calculate potentials on all particles, the last has a
complexity of 0(N). Each of these tree algorithms can be generalized to exploit
temporal proximity as well. Jernighan and Porter [JP89] generalize Appel's al-
gorithm, Aarseth and McMillan1 [MA93j generalize Barnes-Hut and this thesis
generalizes the fast multipole algonthm.
1.2.2 Parallel Algorithms
Considerable effort has gone into vectorization of N-body codes for fast execution
on a supercomputer. McMillan [McM86] discusses vectorization of the individ-
ual time-step method. Hernquist ?Her9O], Makino [Mak9OJ, Barnes [Bar9O] and
Schmidt and Lee [SL91] discuss vectorization of tree codes. In another direction,
GRAPE-3 (GRavity PipE-3) [OME+92], a specialized computer for gravitational
N-body simulations has been developed.
Some work has been done on parallel implementations also. See Table 1.2.
Salmon [Sal91] implements a parallel version of the Barnes-Hut algorithm on the
n-CUBE. Singh et al [SHT+92] study load balancing issues in parallelization of
the Barnes-Hut and Fast Multipole algorithms on the Stanford DASH multipro-
cessor. Parallel versions of the fast multipole algorithm have been implemented
on various architectures --H Zhao [Zha87i on an 8k-processor CM-2, Greengard and
Gropp [GG9O] on an 18-processor Encore Multimax, Zhao and Johnsson [ZJ9l]
on a 16K-processor CM-2 and Leathrum and Board [LB91] on a 8-processor B012
transputer torus. In all of the parallel tree code implementations, particles are
advanced synchronously in lock-step.
1.3 Outline of the Thesis
1.3.1 Philosophy
In this thesis, we attempt to provide more rigorous justifications for some empir-
ically well-known concepts. These justifications are necessary to understand the
new algorithms that we introduce. However, the N-body problem is intrinsically
chaotic and we are often forced to make assumptions which are valid empirically
1Came to the author's knowledge while this thesis was being written
5
but difficult to prove formally. We explicitly state such assumptions with intuitive
reasons for making them whenever possible.
The new algoritlims we introduce are intended for practical implementations.
The presentation format, however, has been chosen for ease of theoretical justifi-
cation. In the chapter on implementation, we explain how the algorithms would
actually be implemented.
1.3.2 Contributions and Outline
This thesis presents the temporal multipole algorithm which exploits both spatial
and temporal proximity.
In chapter 2, we briefly describe the fast multipole algorithm and the individual
time-step algorithm which are building blocks in our new algorithm.
In chapter 3, we derive the individual time step algorithm rigorously and de-
velop the new concept of temporal multipoles which expand the potential in space-
time coordinates. Unlike previous spatio-temporal algorithms, which are mostly
empirical, we give rigorous error bounds. Tools for manipulating temporal multi-
pole expansions are also discussed.
In chapter 4, we present a sequential algorithm based on temporal multipoles.
The notion of particle consistency across cells is introduced and blocked time-steps
are discussed. The algorithm is shown to be essentially optimal for any N-body
simulation.
A parallel version of the algorithm for homogeneous particle distributions is
described in chapter 5. The algorithm is designed to work on coarse-grained parallel
machines. It is shown that the communication costs can be effectively amortized
by using temporal multipole expansions and that the algorithm remains optimal.
Chapter 6 discusses non-homogeneous distributions. A local solution to the
issue of load-balancing is proposed and analyzed where processors redistribute work
among neighbors based on the "center-of-work" heuristic. For systems of particles
which consist of locally homogeneous subsystems, this load-balanced algorithm
would be nearly optimal.
In Chapter 7, we adapt the parallel algorithms to the mesh and the hypercube.
These algorithms are an order of magnitude improvement over existing algorithms
both in theory and in practice and make feasible simulations which previously were
inordinately expensive. For the hypercube, the algorithm is optimal while for a
mesh, it is almost optimal.
Finally, chapter 8 discusses implementation results for both the sequential and
parallel versions of the algorithms for two dimensional N-body systems. The se-
quential algorithm is tested against standard algorithms for three different input
distributions. In each case, it performs substantially better. The algorithm ac-
6
tually shows the predicted asymptotic scaling of time-complexity with number of
particles. The parallel algorithm, implemented on a 16-processor machine config-
ured as a 2D-mesh also does substantially better than the conventional algorithms.
While the performance scaling is not linear for 16 processors, it is expected to be
almost linear for a larger number of processors.
7
Table 1.1: Time Line of Sequential N-body Algoritlims
Year			Spacea Timeb			Algorithm			Complexityc
1960:			x			x			0(N813)
1963:			x			0(N7/3)
1973:			v'			v'			0(N25112)
v'			x			O(N513 log N)
v'			x			O(N513 log N)
1985:			v'			x			O(N513 log N)
1986:			v'			x			Tree			O(N5/3logN)
1987:			v'			x			0(N513)
1989:			v'			v'			O(N413 log N)
1993:			v'			v'			O(N4/3 log N)
1993			v'			v'			Temporal Multipole			o(N4 3)
von Hoerner
Aarseth
Ahmad-Cohen
Particle-Mesh
Particle-Particle/Particle-Mesh
Appel Tree
Barnes-Hut
Greengard Tree
Jernighan-Porter
Aarseth-McMillan
aExplojts Space Proximity
6Exploits Time Proximity
CTime to integrate homogeneous 3D system of N particles for one crossing time
8
Table 1.2: Time Line of Parallel N-body Algorithms
Year			Algoritlim			Architecture			Complexitya
1987:			Zhao			Hypercube			O(N213 log N)
1990:			Greengard-Gropp			Shared Memory			O(N213 log N)
1991:			Salmon			Hypercube			O(N213logN)
1991:			Zhao-Johnsson			Hypercube			O(N213 log N)
1991:			Leathrum-Board			iD Torus			0(N513)
1992:			Singh et al			DASH multiprocessor			0(N7/6)
1993 :			Temporal Multipole			Hypercube			0(N113)
1993 : Temporal Multipole 3D Mesh O(N'/3logN)
aTime to integrate a homogeneous 3D system of N particles for one crossing time
with 0(N) processors. Estimated where unavailable.
Chapter 2
Background
The total time complexity of an N-body simulation algorithm can be expressed as
a product of its spatial complexity and its temporal complexity. The first depends
on how efficiently it exploits the spatial proximity of particles while the second
depends on how efficiently it exploits the temporal proximity of states.
We briefly describe the fast multipole algorithm which exploits spatial prox-
imity of particles and the individual time-step algorithm which exploits temporal
proximity of states. These are used as building blocks in our new algorithm.
2.1 Physical Model
We will consider a set of N particles in a three dimensional coordinate system
interacting through a gravitational or coulombic force. To keep the treatment
general, we will use the following force model. If a single charge of intensity m is
located at x', the field and potential at any point x is given by
x --H
E(x,x') =			x --H x'13
x --H			= m			1
--H XI
(2.1)
The electrostatic and gravitational force laws can be recast in this form by suitable
normalization of variables.
Some physical systems are adequately described by two-dimensional analogues
of these equations where all the particles are confined to a plane. In such cases, a
two dimensional simulation is conceptually simpler and computationally cheaper.
In two dimensions, the field and potential at point x due to a charge of intensity
9
10
m located at x'is given by
E(x,x')
?x--Hx'I) --H--H
x --H
rn x--Hx'j2
mlog( ? 1
(2.2)
Note that a point charge in two dimensions corresponds to a line charge in three.
2.1.1 Conventions
We will often explain concepts in one dimension, which although unphysical, allows
a clearer exposition. As is conventional, we will denote the magnitude or norm
of a vector x by x. Since the potential is a scalar function of a scalar variable,
while the force is a vector function of a vector variable, we will use the potential
to simplify theoretical development. This entails no loss of generality because the
force can be obtained as the gradient of the potential. The physical potential will
always be represented by ?. We will use ? (which is somewhat confusing but by
now standard) to denote the multipole expansion of the far potential due to a
collection of particles. While ? is the actual potential, ? is a representation of
that potential as a multipole expailsion.
2.2 The Direct Algorithm
The simplest and most straightforward method of integrating an N-body system
is the brute-force or direct method. The disadvantage of this simple method is
its time-complexity. All N(N2--H1) pairs of particles are interacted in each time-step
giving an 0(N2) time-complexity per time-step. This is too high even for N as
small as 1000. The algorithm is sketched below.
1. Find force on each particle due to every other particle by direct pairwise calcula-
tion.
2. Assume that the force does not change for a short time-step ?? and advance all
particles through the time-step At.
3. Go to step 1.
11
2.3 Exploiting Spatial Proximity
Spatial proximity of particles can be exploited to decrease the amount of com-
putation required. Particles which are close together experience approximately
the same force from far away particles. This intuition translates into the formal
concept of multipole expansions.
2.3.1 Multipole Expansions
The multipole expansion expresses the potential due to a particle as a Taylor series
expansion in the coordinates of the point at which the potential is to be evaluated.
Example 2.3.1 Let the potential, in one dimension, be given by ?(r) = ri at
distance r from a point charge. Let a unit charge be placed at point x'. Then, the
potential at some point x, x > w',is given by
1			1 x'
_ x ?1 x			(2.3)
This series sum, expressing the potential at any point x due to the charge at point
x`,is called the multipole expansion of the potential due to the charge. If all terms
_ and higher in the series are ignored, the error in the approximation will be
p
of order ? . This truncated series corresponds to a p-term multipole expansion
of the potential. The multipole expansion converges only when x > x'. If x > 2x1,
the error is very small at O(2--HP) for a p-term expansion and in this sense the
expansion converges rapidly outside a circle of radius 2w'. Being only interested
in finite multipole expansions, we will use the term convergence to mean rapid
convergence.
If we have n charges at points x'0, x?1,... , x?, such that x > 2x'? for i ranging
from 0 to n, then, the potential due to the charges at x can be found as summations
of the potentials due to the individual charges as given by (2.3). ?
For an upper bound on the error of? atruncated multipole expansion with O(log ?e')
terms suffices. The field at a point due to a collection of particles can be obtained
as the sum of the multipole expansions of the particles. By exploiting spatial
proximity cleverly to combine multipole expansions, the forces on all N particles
can be computed in 0(N) time (as against 0(N2) for direct).
Note that the sum of the multipole expansions of a set of particles does not
converge rapidly in the neighborhood of the set. In tree algorithms, the hierarchical
structure induced by a tree is used to sum up the expansions fast when particles
are far away. For particles close by, the potential is found by pairwise direct
interaction.
12
2.3.2 Fast Multipole Algorithm
We describe the fast multipole algorithm ([GR87]) in one dimension. We will
describe how to evaluate the potential at a point. The force can be obtained as
the gradient of the potential.
Consider N particles arranged along a straight line. The fast multipole algo-
rithm takes a cell-oriented rather than a particle-oriented view. The line is divided
into equal length intervals which we will call cells. The cells touching a cell (in-
cluding itself) are its nezghbo?s.
2.3.2.1 Near and Far Potentials
The potential at a particle a located in cell ? is partitioned into the fac field and
the neac field so that
= ?a,farfidd t ?a,ne&fidd			(2.4)
The near field potential, due to particles in the neighbors of cell i, is calculated by
direct particle-particle interaction. The far field potential, attributed to particles
outside the neighbors of cell i, is evaluated as a multipole expansion at cell i.
This expansion is used to compute the far potential at each particle in the cell.
The partitioning is necessary for two reasons. Firstly, the multipole expansion
of particles in a cell does not converge sufficiently rapidly in its neighbor cells1.
Secondly, the near field is often modified/softened in computer simulations for
physical reasons.
To efficiently find the far potential at a cell, a tree-structured hierarchy of cells
is introduced. The cells into which the line is divided are at level n, the bottom
level. They are connected in a binary tree fashion as shown in Figure 2.1. Two
children cells at level h combine to form a parent cell at level (It --H 1). The cell at
level 0, the top level, is identified with the 0'th cell and does not have a parent.
2.3.2.2 Notation
Before describing the algorithm, we introduce some notation.
Parent(i) parent of cell ?, i not at level 0 (e.g. Parent (1 0) is 4 in Figure 2.1)
Children(i) set of children of cell i, i not at level n (e.g. Children (1 0) is ?21,22J)
Neighbor(i) set of neighbors of cell i (including i) (e.g. Neighbor(1O) is f9,lQllJ)
1In 3 dimensions, convergence is not sufficiently rapid even in cells next to neighbor cells. In
this case, the near field is redefined as being due to particles in the cell's neighbors and the cells
next to the neighbors. The far field is also appropriately redefined.
13
pointers to children cells
I,			0
Mm			Mm
Level
lo
I 1
2
7			8			9			10			11			12			13			14			?
1516-17-1819202+,4--- 23 24 25 26 27 28 29 30 4
Line of particles cell
Figure 2.1: Cell hierarchy for fast multipole algorithm
Ilist(i) set of cells which are children of neighbors of i's parent, but which
are not neighbors of cell i (e.g. IList(1O) is f 7,8,12?)
?? the p-term multipole expansion with origin at center of cell i,
describing potential outside Neighbor(i) due to all particles inside cell i
?? the p-term multipole expansion with origin at center of cell i,
describing potential inside cell i due to all particles outside Neighbor(i)
The potential, ? (?), is represented as a multipole expansion about an origin point
in space and converges outside (inside) a sphere about that origin. Two multi-
pole expansions about different origins cannot be directly added. Some technique
to shift the origins of multipole expansions is required that guarantees conver-
gence. We abstractly describe some operators for this purpose following Katzenel-
son [Kat88j (for details see [GR87j). See Figure 2.2.
Tran5late??j :			??,			k e Children(i)
Shfftk,j :			??,			k Parent(i)
C0nvert?,j :			??, k c IList(i)
The hatted symbols denote partial contributions to the appropriate potential at
that cell. While the functions ?anslate and Shift introduce no error for finite
14
Translat?,?
cellk
Shiftk,i
CO?V?rt?,i
Figure 2.2: Changing the origins of multipole expansions
multipole expansions, Convert does introduce an error proportional to 2--H? for a p-
term multipole expansion. Let `t' be an operator which adds two p-term multipole
expansions term-wise to produce a new p-term multipole expansion. Using these
operators, we can specify equations to form ? and ? for each cell.
--H			?			Translat%,j(?j)			(2.5)
jEChildren(i)
Shiftparent(i),i(?Parent(i)) t s C0nvert?,j(??)			(2.6)
kEiList(i)
0			(2.7)
The recursive equations above are the driving force behind tree algorithms in
general and the fast multipole algorithm in particular.
2.3.2.3 The Algorithm
In essence, the fast multipole algorithm is an efficient method of finding the far
potential ? at each cell at the bottom level due to all particles outside the neighbors
of the cell. See Figure 2.3. The far potential on a particle can be found by
evaluating the ? of its cell at its position. The near potential due to particles in
the neighbor cells is found by direct pairwise interactions. The total potential is
15
Contribute to far potential
Non--Hneighbors of cell
Neighbors of cell
Contribute to near potential
Figure 2.3: Near and far potentials at a cell
5			k
Figure 2.4: Forming ?? for cell i not at bottom level
just the sum of the far and near potentials. We first give an informal overview of
the algorithm followed by a more formal description.
In the first pass of the algorithm, ?, the far potential outside each cell due to
particles within the cell is formed. For each bottom level cell, ? is directly formed
as the sum of multipole expansions of all particles within. By equation (2.5), ?
for a lower level cell can be formed when the ? for each of its children is known.
This is shown in Figure 2.4. Therefore, ? can be formed for each cell in the tree
moving from bottom to top.
In the second pass, ?, the far potential at each cell due to all particles outside
the neighbors of the cell is formed. For the 0'th cell, which comprises the entire
system, ? is zero. For higher level cells, ? is formed using equation (2.6) from top
to bottom in the tree. This is shown in Figure 2.5.
16
I Parent(z) I
??arent(i) ?			k
j			k
Figure 2.5: Forming ?? for cell ? not at top level
Once ? is available at each bottom level cell, the far potential can be found
at each particle. For each particle in each bottom level cell, the particle's position
coordinates are substituted into the multipole expansion ? of the cell to obtain
the far potential at the particle. The near potential at each particle is found by
direct interaction with all particles in the neighbors of the cell.
We give a more formal description of the fast multipole algorithm below.
0. Choose number of levels in the tree n log2 N, a precision 6, and order of
multipole expansion p = [--H log2 c?
1. For each cell i at level n: Compute ?? as the sum of the multipole expansions
of the particles in the cell about the center of cell i.
2. For each cell ? at levels less than n: Compute ?? using equation 2.5 moving up
the tree of cells.
3. For each cell z at levels greater than 0: Compute ?? using equation 2.6 moving
down the tree of cells.
4. For each particle a in each cell i at level n:
(a) Evaluate ?? at a's position to get the far field potential at the particle.
(b) Find the near field potential at a due to interactions with particles in Neigh-
bor(i).
(c) Add far and near potentials to get total potential on a.
Note that in this algorithm, accuracy is controlled entirely by the order of
the multipole expansion p unlike the Barnes-Hut algorithm which has a tunable
parameter 0 in addition to the multipole expansion order.
17
Remark Time and Space Complexity: Shifting, translating and converting mul-
tipole expansions are all 0(1) (actually O(p2) in 2D and O(p4) in 3D) operations
once p, the number of terms in the multipole expansion has been fixed, Each cell
at the bottom level contains 0(1) number of particles and for a uniform distri-
bution of particles (see next subsection), the total number of cells at all levels is
0(N). Hence, the amount of computation in conversions, translations and shifts
which occur in the algorithm is 0(N) as well. The work done in the near field
computation is also 0(1) for each cell at the bottom level. Thus, the total work
done to compute forces on all particles is linear in N, the number of particles
(per time-step). The space complexity is easily seen to be 0(N) for homogeneous
distributions.
Remark Highe? Dimensions: Generalization of the fast multipole algorithm to
higher dimensions is straightforward. Figure 2.6 depicts cells in two dimensions.
?I1:,1 1:'1i1:,1
I...
1:1			1:111:1
I --H--H--H--Hi--H.
11:n fl?fli11
`1 :fl X:fli 1			1
I...
1 :,fl fl:fli 1			1
11:1			1:111:1
I			:			:			I
I			I
I			:			:
:			:			I
Figure 2.6: Cells for fast multipole algorithm in 2 dimensions
Cells at different levels are shown with different kinds of lines.
Cells marked n are in Neighbor(x). Cells marked i are in IList(x).
Each cell now has 4 children, 9 neighbors (including itself) and 27 ilist members.
The rest of the algorithm remains unchanged.
2.3.3 Adaptive Multipole Algorithm
For non-uniform particle distributions, the algorithm described in the previous
section creates a large number of empty or nearly empty cells and its time com-
18
plexity per time-step is no longer linear in the number of particles. An adaptive
algorithm, which starts with one large cell containing all the particles and divides
it recursively into smaller cells only when the number of particles in a cell is suffi-
ciently large, performs much better. With this algorithm, the cell hierarchy does
not have a fixed number of levels, and leaf cells need not be of the same size. See
Figure 2.7.
Figure 2.7: Cells for non-uniform particle distributions in the adaptive algorithm
There is at most one particle in any box.
The disparity in cell sizes complicates the algorithm. The function "Convert"
cannot be used to move the multipole expansion of a cell k in IList(i) to cell i,
when the size of cell k is larger than the size of cell i. The converted multipole
expansion does not converge within cell k. Instead, conversion has to be done on
a particle by particle basis. Imagine an infinitesimal cell around each particle in
cell k. "Convert" can be used to move this cell's multipole expansion to the center
of cell i. Similarly, "Convert" cannot be used when cell k in IList(i) is smaller in
size than cell i but not sufficiently far away. In this case, the far potential due to
the multipole expansion of cell k has to be directly evaluated at each particle in
cell i. For details see tGR87].
With these modifications, it can be shown that the adaptive multipole algo-
rithm has a linear complexity in both time and space for arbitrary particle distri-
butions per time-step.
19
2.4 Exploiting Temporal proximity
The time-proximity of particles can also be exploited to decrease the amount of
computation required. The intuition is that the force on a particle is almost the
same from one instant of time to the next. Therefore, the force can be extrapolated
avoiding expensive recalculation. In simulations, where particles evolve on different
time-scales in different regions, exploiting time proximity is particularly effective.
We will formalize this notion in the next chapter, but algorithms based on this
approach already exist ?C73,JP89,MA93?.
Time-steps in the computational model are usually chosen so that each particle
takes O(El) steps to reach its closest neighbor, with 6 chosen suitably small. This
ensures that the computational model is close to the physical model. When all
particles share the same time-steps, this time-step is determined by the closest
pair of particles in the entire system. With individual time-steps this restriction
is not imposed and particles evolve at their own rates.
Example 2.4.1 We can estimate the savings obtained using individual time-steps
in integrating a homogeneous two-dimensional distribution of particles for a con-
stant amount of physical time. Consider a homogeneous distribution of N particles
inside a unit circle. (Below, indicates proportionality.)
Individual Time Steps: The average inter-particle separation of N 1 deter-
MN
mines the time-step for the individual time-step scheme. On average, each
particle will, take ? number of time-steps to evolve through a fixed
amount of physical time. The number of time-steps for the individual time-
step scheme is
Shared Time Steps: The expected value of the minimum inter-particle sepa-
ration can be calculated to be Q. Therefore, on average N N shared
time-steps will be required to evolve the system through one crossing time.
The number of time-steps for the shared time-step scheme is
Similar estimates can be made for power law and other distributions with more
effort. In each case, the individual time-step scheme requires a much lower number
of iterations. E]
For homogeneous 3D distributions, Makino and Hut [MH89] estimate that in-
dividual time steps yield a gain of O(N?) over shared time-steps for integration of
all particles over one crossing time2.
2The crossing time (more precisely the half-mass crossing time) is the time needed to traverse
the inner part containing half the mass of a self-gravitating system of particles at a speed equal
to the dispersion velocity.
20
2.4.1 Individual Time-step Scheme
The fast multipole atgorithm imposes a common or shared time-step on all par-
ticles and all particles move forward in lock-step fashion. In individual time-step
schemes, each particle has its own time-step based on how rapidly the force experi-
enced by it changes. As we just saw, individual time-step schemes are a substantial
improvement over shared time-steps. However, care has to be taken to ensure sta-
bility. We describe Aarseth's individual time-step algorithm [Aar85] briefly. Step
2 in the algorithm has been chosen to ensure stability.
Let the current time be I and suppose that particle b needs to be moved at
time 16.
1. Find particle a such that (Ia --H I) is minimum.
2. Project all particle positions to time 1a using derivatives of their old forces (the
old Taylor series expansion).
3. Calculate force on particle a due to all other particles (by direct interaction).
4. Find new Taylor series expansion for particle a and use this expansion to predict
a new time-step ?a
5. Set I to Ia and Ia to Ia t &Ia. Go to step 1.
2.5 Complexities
Usually, N-body simulations are done for a duration which is some multiple of a
physical time. This physical time is usually the half-mass crossing time. The time
complexity of an algorithm is the number of computational steps required to evolve
the system from the initial conditions to the desired time. It is instructive to look
at the complexities of the algorithms we have discussed so far for homogeneous 3D
particle distributions. See Table 2.1.
The direct pairwise algorithin uses neither space nor time optimizations. It
requires 0(N2) time per iteration. Since shared time-steps are used, the number
of iterations required is o(N2/3). The total time complexity is 0(N813).
The fast multipole algorithm uses sophisticated space optimizations to reduce
the time per iteration to 0(N) but still uses shared time-steps. Its total time-
complexity is 0(N5/3).
The individual time-step algorithm does no space-optimization whatsoever but
reduces the required number of iterations to 0(N113). Its total time complexity is
o(N7/3).
21
Table 2.1: Space-Time Complexities in 2D and 3D for integrating N particles
Complexitya			2D			3D
Algoritlim ? Space Time Total Space Time Total
2			8
Direct Shared			2			1			3			2			3
1			5			1			7
Direct Individual			2			2			2			2			3
2			5
Fast Multipole			1			1			2			1			3
1			3			1			4
Temporal Multipole			1			2			2			1
aExponent q of N in O(N?) to integrate homogeneous system of N particles over 1 crossing time
The temporal multipole algorithm of this thesis combines the space optimiza-
tions of the multipole algorithm and the time optimization of the individual time-
step algorithm to obtain a total time-complexity of o(N4/3).
Chapter 3
The Temporal Multipole
Expansioii
Multipole expansions express the potential due to a set of particles as a series
expansion in the spatial coordinates. We introduce the more general ternporal
rnuUzpole expanstons which express the potential due to the set of particles as a
series expansion in space-time coordinates.
The benefits are twofold. First, a temporal multipole expansion can exploit
both spatial and temporal locality of particles. Within its range of validity, it
can be used to calculate potentials inexpensively and accurately. This can lead
to substantial speedups as discussed in the next chapter. Secondly, the time of
creation of a temporal expansion is decoupled from the time of its use. Parallel
algorithms based on temporal expansions can, therefore, amortize communication
costs over successive N-body iterations very efficiently. This will be discussed in
chapter 5.
3.1 The Temporal Expansion
We first discuss the temporal expansion in isolation. Combining the temporal
expansion with the multipole expansion yields the temporal multipole expansion.
Temporal expansions are based on the following simple idea. Consider the far
force at a point due to a set of particles. Changes in the far force are caused by
movement of particles in the set. By characterizing this movement in terms of
derivatives of the particle positions with respect to time, we can predict future
values of the force.
Let a particle be integrated through one time-step, using a Taylor series ex-
pansion of the potential at the particle. The physical model requires an infinite
22
23
o x(t0 t T)
x(to)
0
Figure 3.1: Temporal Expansion for a Particle
number of terms in the Taylor series and infinitesimal time-steps, while in any
computational model, only a finite number of terms and finitely large time-steps
are feasible. This can lead to divergence of the computational model from the
physical model as discussed in Section 1.1. Nevertheless, we can characterize the
divergence between physical and computational models provided both use finite
Taylor series expansions. The following lemma examines how this error behaves
with the number of terms in the series and the size of the time-step.
3.1.1 Temporal Expansion for a Particle
We will express the potential at a space-time point, (r, t?, as a time-varying func-
tion represented by ?((r,t?).
Lemma 3.1.1 (Temporal Expansion for a Particle) Consider a unit point charge
located at (x(t0),t0?. The potential at the origin (o,t?, fort ? [to--HT,to+r], where
T is the time-step through which the charge is integrated, is given by
ffl&t? where bj			1 dkb(x(			(?to)k--Hi			(3.1)
i=O			7!			dtktO))
k=i
Let TI be a fixed positive constant with 7? ? [0, 1%?2? Let time-step T be chosen such
that
y d?x(t0) ? 7?? x(t0)
dt?
where i > 1			(3.2)
24
Then, for any integer q, q > 1 and for d-dimensional potential function ?,
q (t--Ht0)?d??(x(t0))
i=O			dt?
q+1
I?(x(to)) d--H2			7?
_			(3.3)
Proof. We first expand ?((o, t?) as a Taylor series with respect to t.
--Ht0)?d??((o,t0?)			(t--Ht0)?d??(x(t0))
= ?			i!			dt?			i!			dt?
i=O			i=O
Using MacLaurin's theorem to express the series in powers of t we obtain (3.1).
To obtain the error bound (3.3), note that the error E is given by
q (t--Ht0)id??(x(t0))			oo			i d??(w(t0))
dt?			--H			?			dt?
i=O			i=q+1
(3.4)
(3.5)
Using an elementary theorem of calculus on differentiating implicit functions [Edw92j,
T?d??(x(t))			T?dr?(x(t0))			(3.6)
n!			&?			= ? ?Kr r!			dxr
r=1
where ?Kr is the coefficient of T? in (x(to t T) --H x(t0))r
We can bound x(to + r) --H x(to) using (3.2) as follows.
Ix(to+T)--Hx(to)I ?			d?x(t0)
w1).			dt?
 --H7??x(t0) ? (e? --H 1)x(to)			(3.7)
i-i
Therefore,
Ix(to + T) --H x(to)Ir ? (e" --H 1)rX(t0)r ? IT? ?KrI ? [(e" --H 1)? : 7?fl]X(t0)r (3.8)
def
where [Z : 7??] = term involving 7?? in Z.
Now, consider the 3D potential function ?(x(t)) =
function, ?(x(t)) = --H logx(t), the argument is similar.)
dr?(x) _ r!?x)
dxr			xr
(For the 2D potential
(3.9)
25
Combining (3.6), (3.8) and (3.9),
T?d??(x(to))
n!			dt?
Using this in (3.5), the error E is given by
E ? ?x(to))I
T? dr?(x(to))
?Kr
r!			dxr
?(x(t0)) ?ke" --H 1)? :
? ?(x(t0)) Q[(e? --H i)r rn]
_ ?(X)f[2 --H
(3.10)
i=)+i ?2$e? : 7Ii]			(3.11)
To upper bound the summand, we look at the function [(1 --H ffl)?1 : ?jk?, k> 1.
??1??e?7??i .7?k1			? e??			k			k k			k
2			:?]			?????????Io?)idi			(3.12)
Z[2i
The integral above can be simplified as follows.
ooykd			ooykd			1			0o-zkd			F(k+1)			______
o2Y? --H ??ylog2Y			(10g2)k+l0CZZ			(log2)k+1 --H (log2)k+l
(3.13)
Combining (3.11), (3.12) and (3.13),
_________			17q+1
E ? ?(x(t0)) ?%+i 2(log'2)i+1 ? ?(x(?o))i 2(log2)?(log2 --H (3.14)
For ? less than I%?2, the error bound in (3.3) follows. [Z
Remark Finite Number of Terms: While (3.2) requires an infinite set of inequali-
ties to be satisfied, note that if only q derivatives of x are maintained, the inequali-
ties for higher derivatives are automatically satisfied, since they are all taken to be
zero. Similarly, for a q-term temporal expansion, the error-bound proof requires
summation only up to q terms (and not co) to evaluate the coefficients bj.
Corollary 3.1.2 In lemma 3.1.1, the error after k time-steps with parameter ?
$ ? log 2) is approximately the same as the error after 1 time-step with parameter
where n' is k? for worst case error and nY? for expected error, provided n? c
[0, l%?21
26
Proof. Since, ?1 is small, the size of the time-step remains approximately the
same in successive iterations. This justifies the worst case error bound. For the
average case, note that errors and changes in positions in successive iterations can
be thought of as normally distributed random variables. So their sum will have an
expected error corresponding to rffik. ?
llere, r plays the role of an efficiency parameter. For smaller values of r? the
computational model is more faithful to the physical model, at the expense of
increased computation. With a fixed value for r? the error in potential is relative
to the original potential and could be large in close encounters. If 77 is made a
function of r, the absolute error in potential can be controlled. We will discuss
later how r is determined. For a desired precision, 6, the number of terms in the
temporal expansion can be determined from (3.3).
We define the validity pernod of a particle temporal expansion as [t0 --H T, t0 t r],
with to and T as in the above lemma. The temporal expansion can be used with
the appropriate error bounds within its validity period. The time-step r will be
referred to as the validity duration of the temporal expansion.
The validity period has been chosen based on position and position derivatives.
This is because, conventionally, position is thought of as the independent variable
and errors are measured with respect to position and its derivatives. Force or po-
tential derivatives can also be used to determine time-steps. Since, ddtx2is equal
to V?(x), this is meaningful and equally effective in the absence of close encoun-
ters. This choice is used in Aarseth's Individual Time-step scheme discussed in
section 2.4.1. Although Aarseth's scheme is widely used, this is the first rigorous
justification of its fidelity to a physical model using a finite Taylor series. (See
discussion in section 1.1).
With close encounters, however, time-steps based only on potential/force deriva-
tives (i.e. ignoring the first derivative of position since only the second and higher
derivatives are thus accounted for) can on occasion lead to inaccuracies. Appendix
A illustrates such a situation. The main problem is that the particle velocity is not
bounded and can lead to anomalous situations. Moreover, higher time derivatives
of potential can be non-zero even if only q derivatives of x or ? are used (since
the potential is an implicit function of x). This makes it difficult to enforce the
inequalities (3.2).
3.1.2 Temporal Expansion for Collection of Particles
The temporal expansion due to a collection of particles is just the sum of indi-
vidual particle temporal expansions. We should require the validity period of this
expansion to be the minimum of the validity periods of the constituent particles.
27
However, this is a very stringent requirement. Although the potential due to a
distant cluster of particles might change negligibly at a point, it will have to be
re-evaluated frequently. Instead, we will give a statistical interpretation to the far
potential due to a cluster of particles.
In other words, the temporal expansion of a collection of particles has a validity
and meaning beyond that of the particles of which it is comprised. It statistically
averages out random fluctuations of the particle expansions and remains an accu-
rate representation of the potential due to the collection of particles. In practice,
this interpretation has been used before in N-body codes ([AC73,MA93j) with ex-
cellent results. Considering that N-body simulation itself has only a "statistical"
validity, this is a reasonable interpretation.
Thus, we make the following assumption.
Assumption 3.1.3 (Statistical Averaging) The temporal expansion of a spa-
tially proximate collection of particles can be used with the appwpriate erwr bounds
within an empirneally determined validity period even though individual particle ex-
pansions may not be valid.
Derivatives of the potential (rather than the positions) are used to determine the
time-step in the theorem below. This is not theoretically justified as we have
already discussed but has been empirically found to be very satisfactory and has
been in use for almost three decades.
Theorem 3.1.4 (Temporal Expansion) Suppose that the k `th of 5 charges has
strength mk, space-time coordinates (xk, tk?, with xk > R, and temporal expansion
?k((0? t?) at the space-time point (o, t?, with corresponding coefficients bk (3.1)
and error-bound (3.3), where k 1,... ,
Then, at time t, the potential ?((o, t?) induced by the charges is given by
=			where			= ?m'kbk,j			(3.15)
j=O			k=1
In the above, t E [T?,L, T?,HJ, where [T?,L, T?,H], the validity period of the
expansion, is defined by
= [?=m1ax tk--HT?, min tk+T?1 (3.16)
 s			ki
			s
with T? chosen such that
d??((o,t?) ? p? I?((o, t?)
dt?
i> 1			e [0 log2
(3.17)
_			`2
28
Furthermore, under reasonable assumptions, for the d-dimensional potential func-
tion ? and anu q > 1
q
___			A			(3.18)
j=O
d-2 ___
s2
where A			? m? in the worst case, and A =
k=1			k=lk
in the expected case.
Proof. The form of the temporal expansion (3.15) is an immediate consequence
of the preceding lemma and the fact that ?(o,t) = Z?--Hi ?k((o,t?).
For the error bound, note that since 1? is a lower bound for xk, ?(?) must be an
upper bound for ?(x?). Using this fact, (3.18) is obtained as the sum of the errors
due to the individual particles in the worst case. For the expected case, we simply
treat individual particle errors as normally distributed random variables. El
Note that, the error bounds in the above proof are strictly valid only in the
time range when individual particle temporal expansions are valid. However, we
are using the expansion even when individual particle expansions themselves are
no longer valid. Therefore, although theorem 3.1.4 is motivated by lemma 3.1.1, it
does not follow automatically and is valid only under assumption 3.1.3. Another
way to look at it is that, by corollary 3.1.2, the error is small after a few iterations.
The evolution of the potential at (o, t? in this time range is a good indicator of
how the potential will vary at future times. Therefore, although the validity period
determined by the potential derivatives is not strictly valid, for non-pathological
particle distributions, the error will remain small. If we wish to be conservative, we
can force the temporal expansion t() have a validity period which is the intersection
of the validity periods of its constituent particles.
3.2 Temporal Multipole Expansions
If we think of the potential function as being specified by the multipole expan-
sion, the temporal expansion of this potential is the temporal multipole expansion.
The gradient of this potential expansion is the force. All the error bounds and
operations on multipole expansions are automatically valid on temporal multipole
expansions because differentiation is a linear operation and the temporal expansion
is just the summation of time derivatives of the spatial multipole expansion.
Theorem 3.2.1 The functions Translate, Shift and Convert can be used to ma-
nipulate temporal multipole expansions, with error bounds identical to those for
multipole expansions (as in [Gre87,Zha87j).
29
The error in the actual potential is now the sum of the individual errors due to
the multipole expansion and the temporal expansion. The number of terms p in
the multipole expansion and q in the temporal expansion are chosen so that the
two errors are of the same magnitude. The resulting (p, q) term expansion will be
referred to as a temporal multipole expansion. Note that, the temporal expansion
is not entirely symmetric in the four space-time coordinates since the three space
coordinates of a particle depend on the time coordinate.
3.2.1 Temporal Multipole Expansions for Cells
We will use temporal multipole expansions to combine the fast multipole algorithm
and the individual time-step scheme. To that end, we discuss temporal multipole
expansions for cells.
For leaf cells, temporal multipole expansions are simply formed as the sum of
the temporal multipole expansions for individual particles. For lower level cells,
temporal multipole expansions are formed recursively as in (2.5),(2.6) and (2.7).
At the same time that the expansions are formed, a validity period is also asso-
ciated with them beyond which they should not be used. The validity periods
[T?j,L? T?j,H]? for ??, the far potential at cell ? due to outside particles, is cal-
culated directly using (3.16). The validity period [T?j,L? T?z,H]? for ??, the far
potential due to particles inside cell i, is calculated in the manner of (3.16) b?
taking the minimum validity period computed at each cell j's center such that jis
in IList(j). Specifically,
(3.19)
mill t[T?,L,T?,H1 calculated with center of j as origin in (3.16)1
fj:iEiList(j)1
In the next chapter, we use the machinery created here, to develop the sequen-
tial temporal multipole algorithm.
Chapter 4
Sequential Algoritlim
We first describe a sequential temporal multipole algorithm. The algorithm uses
temporal multipoles to combine the fast multipole algorithm and the individual
time-step algoritlim. We show that this algorithm is the best possible under rea-
sonable assumptions.
4.1 Informal Description of Algorithm
The algoritlim combines the space optimization of the fast multipole algoritlim
and the time optimization of the individual time-step scheme. The framework is
the same as in the individual time-step scheme. Each particle has an update time
up to which it can be integrated using its old Taylor series. The particle is updated
by recalculating the force on it and finding a new Taylor series expansion. Near
forces are directly calculated and far forces are evaluated using temporal multipole
expansions for efficiency.
The essential new idea in the algorithm is to be "lazy" or to do the minimum
amount of work necessary. A multipole expansion is recomputed only when its
validity period has expired. When recomputing the expansion, another expansion
might need to be used whose validity period has also expired, in which case that
expansion is recursively recomputed and so on. We will refer to the process of
checking the validity of an expansion and recomputing it if necessary as validalion.
The algorithm is described below in more detail. We use the same notation
used in the description of the fast multipole algorithm and the individual time step
scheme in chapter 2. ? and ? are now, however, (p, q)-term temporal multipole
expansions.
1. Particle a, with the earliest update time, is identified. Let cell i contain a
and let update time for a be ?a
30
31
2. All particles in all the neighbors of? (including itself) are integrated to time
ta using their old Taylor series expansions.
3. The near force on a is found as the sum of the forces exerted on a by all the
particles in ? and its neighbors.
4.
5.
The far force on a must now be calculated. If ??, the far potential at cell ?,
is not valid at time Ia, it has to be recalculated. This is done by recursively
validating the ? for the parent of cell i. Then, ?, the far potential due to
particles in the cell, for each cell in the interaction list of cell ?, is validated
(also recursively). Finally, ?? is formed using the ? of the parent cell and
the ? of the IList cells, and evaluated at the position of particle a to find the
far force on particle a.
The sum of the near and far forces is the total force on particle a. This
force is used to form a new Taylor series expansion. The expansion is used
to determine a new update time for particle a.
6. The above procedure is repeated beginning from step 1, until all particles
have been advanced through the required simulation time.
4.2 Iransient Particle Consistency
The far potential at a cell could be computed from temporal potential expansions
of two cells calculated at different times. A particle transiting between such cells,
could contribute twice or not at all to the computed potential. This transient
potential error is due to the d?sc?eteness of particles: a particle is either in the cell
or not in the cell and no intermediate state is possible. See Figure 4.1. This error
is small in magnitude since it is a "surface" effect while most of the particles are
in the "volume". Simulations ignoring these effects by Aarseth-McMillan [MA93]
have yielded satisfactory results.
One simple way of correcting such errors is to update a cell's temporal expansion
whenever a particle crosses its boundary. But since cells at many levels share
the same boundary, this is expensive and also difficult to parallelize. We discuss
below a different method of correcting such errors, where the idea is to maintain a
consistent interpretation of location of particles across cell boundaries. For these
corrections to work, we are forced to sacrifice the complete asynchrony (i.e. never
calculating an expansion unless its validity period has expired) in the algorithm
described in the previous section. The algorithm is modified so that whenever ?
is computed for a cell, ? for all its children cells is also recomputed. We need an
assumption on particle movement here.
32
Figure 4.1: Particle Transitions across cell boundaries
Transiting particles are indicated by arrows
Assumption 4.2.1 (Particle Movements) Within the validity period of a cell's
temporal expansion, a partieTh can only move from that cell to some adjacent cell.
The validity-period of a cell expansion is such that this will be true in general.
4.2.1 Particle Consistency Operator
To account for particles crossing cell boundaries, we introduce the particle consis-
tency operator, x. It makes a cell expansion formed at an earlier time consistent
with another formed at a later time with respect to particles crossing the cell
boundary, so that, each particle contributes exactly once to the potential at any
cell. We develop the concept more formally below.
Let the temporal expansions for cell i be ? (i,tj? where the time-coordinate ti
denotes the time of creation of the corresponding expansion. For adjacent cells i
and j, we say that temporal expansions ?(i,t,) and are consistent, denoted
?(i,tj) ?U,tj)? if they have the same interpretation of which particle is on which
side of the boundary between i and j. In what follows, we use the convention
that two temporal expansions agreeing on the time coordinate of their creation are
consistent. For instance, ?(i,t) is consistent with The particle consistency
operator makes expansions created earlier in time consistent with those created
later in time. More technically, particle consistency operator ?(j,tj?,(i,ti) operates
on ?(j,tj? to make it consistent with ? where t is assumed to be smaller than
(i,tj?
33
In practice, X(j,tj?,(i,t?) represents corrections due to particles moving from cell
i to cell j in the time interval [tj, ti]. A natural choice for its representation is as a
temporal expansion about the center of cell j. Then X(j,tj?,(i,ti? is simply the sum
of temporal expansions about the center of cell j of particles moving from cell i to
cell j in time range [tj, ti].
The particle consistency operator has the following properties:
Consistency:			(4.1)
Exclusivity:			(4.2)
Space Composition: X(j,tj),(i,t,)			(4.3)
X(j,tj),(i,ti) + ?(j,tj)
Either ? 0 or X(i,ti?,(j,tj) ? 0 but not both.
? Tran5late?,j (X(k,tj),(1,ti?)
k?ChiIdren(j)
lEChildren(i)
Time Composition: X(j,tj?,(i,t? = t X(j,tj),(i,t?
(4.4)
The consistency property is just the definition. Exclusivity follows from the fact
that the operator only makes expansions created earlier in time consistent with
those created later in time and not vice versa. The space composition property
is due to our representation of the consistency operator for a cell as a tempo-
ral expansion about its center. Time composition corresponds to breaking up a
time interval [tj, t] into two parts, [tj, ti] and [ti, ti. Particles transiting between i
and j must have done so in one of the two sub-intervals, and the corresponding
consistency terms can be added up.
4.2.2 Manipulation of Consistency Operators
Consistency operator X(j,tj?,(i,ti? is needed for any two cells i and j where ?(i,tj)
and combine to form another expansion. Thus, ? and j could be neighbors
at the same level with i and j being members of IList(k). In this case, ?(i,tj? and
?(j,tj) combine to form ?(k,tk? Alternatively, i and j could be cells at consecutive
levels with i in IList(k) and j in IList(Parent(k)). In this case, ?(i,tj) combines
with ?(j,tj) (indirectly through ?(Parent(k),tparent???)) to form
Each cell i maintains X(j,tj),(i,ti? and ?(5,tParent(i)?,(i,tt) for each neighbor j of i.
For bottom level cells consistency terms are directly formed from the definition.
For higher level cells, they are formed from the childrens' consistency terms. The
following lemma shows how particle consistency terms can be maintained as new
cell temporal expansions are formed. For simplicity, we show only the lemma for
two cells at same or adjacent levels. Extension to cases when their levels differ by
more than one is straightforward.
34
Lemma 4.2.2 Consider adjacent celisi andj, at leveThL? andL? with ILi --H Li <
1, with temporal expansions ?(i,tj) and and consistency terms X(j,tj),(i,ti? and
X(j,tp&ent??'??,(i,ti? Let ? be reformed for cell i at time t. Then the consistency terms
are updated as follows:
X(j,tparent(i) ?,(i,t? Y??,tParent(i) ?,(i,tt'? + Th((j,tj) ,(i,t)
X(j,t5),(i,t? X'(5?tj?,(i,ti? + X(j,tj?,(i,t)
where			is given by
Lj = Lj :			=			Tr??5l?t?k,j(?(k,tj?,(i,t?)
Lj ? Lj. X(j,tj?,(i,t) =
Lj > Lj : X(j,tj),(i,t? =
(4.5)
(4.6)
s			(4.7)
kechildren(j)
iEChildren(i)
(4.8)
kENeighbor(j)?Chi1dren(i)
z T???5l?t?k,j(X(k,tj?,(i,t))(4?9)
kEchildren(j)flNeighbor(i)
Proof. (4.5) and (4.6) follow from the time composition property. The other
equations are simple consequences of the following observations. First, if particles
move from cell i to j, they must also have moved from children of cell i to children
of cell j. Secondly, the consistency term for a cell is a temporal expansion expressed
about the center of the cell.
For instance, if Lj = Lj, consistency terms from children of i to children of j
have to be translated to j's center and summed. The other cases are similar. ?
4.2.3 Compensating for Transient Particle Potentials
The following lemma shows how to use consistency operators to make particle
transitions consistent across a cell boundary. See Figure 4.2.
Lemma 4.2.3 (Correction) Let cell k be a member of the interaction list of
cell i. Then the contribution of cell k to the far field potential at cell i at time tj
is given by
= C0nvert?,j ?(k,tk? + ? )Q'?k,tk?,(j,tj? (4.10)
jEL(k)
where L = Li u L2 U L3 and Li = IList(i) n Neighbor(k) and L2 =
Children(Neighbor(i) fl Neighbor(k)) and L3 = Parent(Neighbor(k) --H (IList(i) U
Neighbor(i))).
35
k
I			L3			L2:
Li
Figure 4.2: Correcting ? with Consistency Operators
Proof. Any particles which have transited to cell k after ?(k,tk? was formed
could have come only from cells adjacent to k. We wish to make the expansion of
cell k consistent with respect to the other expansions which will form ??. These
expansions correspond exactly to cells in list L. Li consists of cells of the same
size as k, L2 consists of cells smaller than k and L3 consists of cells larger than k.
In each case, particle consistency is achieved by adding in the appropriate particle
consistency term. Since X(k,tk?,(j,tj? is about the center of cell k in each case, the
terms can be added in before the "Convert" operation. E]
For the adaptive algorithm, consistency corrections can be applied in a similar
manner although the details are more complicated.
4.3
Temporal Multipole Algorithm with
Particle Consistency
We describe the temporal algorithm more formally now and incorporate the particle
consistency operators in the description. We drop the time subscripts from the
expansions for simplicity. Thus, ? (i,ti? is denoted by ?? and ?(i,t%?,(j,tj? by Xii
Algoritlim
Initialization
Choose refinement level n, a precision 6, and corresponding p, q [--H log2 61.
(Choice of q also depends on eThciency parameter 7').
Form (p, q)-term temporal-multipole expansions for all cells at all levels.
Set update time tb for each particle b.
Update Particle
Find particle a such that ta minb=1N tb. Let cell i contain particle a.
36
Find Near Force:
Integrate each particle in Neighbor(?) to time ta using its old Taylor-series
expansion.
Find near force on a by directly interacting it with all particles in Neighbor(?).
Find Far Force:
Let U ? cell 5 ?? is required to form ?? but ta ?
Let V f cell 5 ?? is required to form ?k with k E U but ta ?
[T%.,L, T?j,H])
For each 5 in V, recompute ?? and ?jt moving bottom up in the tree using
( 2.5); compute [T?,L,T?H] using (3.19).
For each 5 in U, recompute ?? moving top down in the tree using (2.6)
and lemma 4.2.3; compute [T?j,L?T?j,H] using (3.16).
Evaluate ?? at position of particle a to get far force on a.
Set New Time-step:
Find total force on a and find new Taylor-series representation for it.
Use this series to predict new time-step Ta using emma 3.1.1 and set new
update time for particle a to ta t Ta.
Go to Update Particle.
End of Algorithm
Remark Boundarg Conditions: Dirichiet, periodic or other boundary conditions
can be imposed on the above free space description of the algontlim exactly as
discussed by Greengard [Gre87] but using temporal multipoles instead of just mul-
tipoles.
Remark Convergence Issues: Convergence problems can occur for a particle near
a cell boundary, since, its spatial multipole expansion, and therefore the temporal
expansion, is not guaranteed to converge once the particle crosses the boundary.
In practice, this is not significant because it happens infrequently and also self-
corrects. A particle near the boundary causes the temporal expansion to change
more rapidly, reducing its validity period.
4.4 Complexity Analysis and Optimality
The space complexity of the algorithm is easily seen to be 0(N), just as in the fast
multipole algorithm. The number of cells is linear in N, and 0(1) space is required
per cell and per particle. We briefly describe how to analyze the time complexity
of the algoritlim and show that under reasonable assumptions, it is optimal.
37
The time taken by the algorithm is intimately related to the number of in-
tegration steps through which particles are advanced. Let 9 be the number of
integration steps required to advance all the particles for a standard length of
physical time Tphys. 9 is inversely proportional to the integration time-step. The
number of integration steps required for particle a with time-step Ta(t) at time t,
is
Then
9a			?Tphys dt			(4.11)
Ta(t)
N			N ?P[Ta]dm			(4.12)
9 -			9a -			___
a=1			a=1 ?			Ta
where P[Ta]dTa is the probability that the time-step ties in the range [Ta, Ta t dTa]
We make the following reasonable assumption about any N-body algorithm.
Assumption 4.4.1 (Minimum Computation) Any N-body algorithm requires
a number of computational steps proportional to 9 to integrate an N-body system,
where 9 is the number of time-steps required to integrate all particles in the system
over the desired amount of physical time.
The assumption is justifiable since our objective is fidelity of the computational
model to the physical model. The results of R,eif and Tate [RT93], indicate that
there are no shortcuts for the process of integrating all particles through sufficiently
small time-steps using sufficiently long Taylor series appro?mations.
To analyze the computational complexity of the algorithm, we make one more
assumption, which is almost the same as the statistical averaging assumption 3.1.3,
but more specific.
Assumption 4.4.2 (Validity period for parent cell expansion) The tempo-
ral expansion for a parent cell is formed at most ? times (? < 1) as often as the
temporal expansion is formed for its child cell(s).
The justification for the assumption is the statistical averaging which occurs. The
potential due to a larger number of particles changes less rapidly and small fluc-
tuations due to individual particle motions cancel out. A second reason is that
higher levels in the tree correspond to far potentials from particles which are fur-
ther away. Therefore, the rate of change of this potential is smaller. We will
explore this argument in more detail in theorem 5.3.1 in the next chapter.
We will make an additional assumption on the particle distribution, whereby,
particles are assumed to be distributed in a manner such that their number density
does not change drastically from a cell to its neighbor. This assumption is not
necessary for the complexity analysis of the algorithm, but simplifies the proof
considerably.
38
Assumption 4.4.3 (Non-exponential Density) The change in partieTh den-
sity with distance is sub-exponential in U?e distance i.e. p(r) is a sub-exponential
function of r.
The implication of the assumption is that in the adaptive algorithm, the neighbors
of a leaf cell are either at the same level, or one level higher or one level lower.
This will almost always be true for particle distributions in practice.
Theorem 4.4.4 (Optimality) The temporal multipole algornthm is optimal to
within a constant factor, under reasonable assumptions.
Proof. We show that the time complexity of the temporal multipole algorithm
is 0(g). In the next section, we will show that selection of the next particle to
integrate takes only 0(1) time. The time-complexfty is the sum of the times taken
for selecting and updating particles, the time for calculating the near forces and
the time to calculate the far forces. The first two are simply 0(g) since only 0(1)
time is required per particle integration.
Total Cost 0(g) + C? + C? + %
(4.13)
Here, C? (likewise, C? and C?) is the cost of forming ? (likewise, ? and ?) for all
cells for the entire simulation. C? can be broken up into C?,1eaf and C?,non?Ieaf
which are the respective costs for leaf cells and non-leaf cells respectively. C?,Ieaf
is at most 0(g) since ? for a cell is formed less frequently than the particles in it
are moved. Using assumption 4.4.2, ? for a parent cell is formed at most ? times
as often as its children. Moreover, the number of parent cells with leaf cells as
children is at most half the number of leaf cells.
1			11
--H) + ?C? leaf(1 + --H + --H) +..
=			C?,1eaf + ?C?,1eaf(l + 2			2			4
?			2C?Ieaf(l+?+? +...)
2
leaf
0(g)			(4.14)
Using the non-exponential density assumption, a similar argument shows that
C? is also 0(g). The cost C? can be broken up into two parts: cost for creation of
x and cost for use of ?. Now, ? is only created for a cell when ? is and the number
of ? terms is a constant for each cell. Thus the creation cost of ? is 0(C?). Use of
? occurs only when ? is created and therefore the cost of the use of x can similarly
be accounted as 0(C?). Therefore, Cx is 0(C? + C?). Putting all this together,
Total cost = g+C?+C?+C? ? O(g+g+g+g) = 0(g)
(4.15)
39
See Table 4.1 for a comparison of the time complexities of various algorithms for
integration of N particles in three dimensions over one crossing time. Complexities
are given for homogeneous distributions and power law distributions. Power law
distributions are spherically symmetric about the origin and have a particle density
p given by p(r) oc r?? at radius r. Particle velocities are isotropic and given by
v(r) oc r(2--Ha)12. Derivations and details are available in Makino and Hut [MH88].
Asymptotically, the temporal multipole algorithm is significantly better. For N
about io4, gains are visible in practice despite the larger constant factor.
4.5 Blocking of Time-steps
In the optimality proof above, we ignored the time required for selecting the next
particle to be integrated. If a priority queue or heap is naively used, this complexity
can be as large as O(log N). We show below how this selection can be done in
0(1) time.
The idea is to block the time-steps (?McM86,MA93]). Time-steps are con-
strained to be multiples of a power of 2 with the characteristic smallest time-step
for the system. If the predicted time-step is T and the characteristic time-step is
To, then the new time-step is chosen to be 2kTo where 2kTo < T ? 2k+170 The k'th
block, consisting of particles with time-step 2k, is advanced periodically at time
intervals of 2kTo.
4.5.1 Time-step Evolution
When a particle's predicted time-step moves it into a different block, its next
update must occur before the time-step expires. The next lemma shows that the
new time-step for a particle is at least half the previous time-step.
Lemma 4.5.1 (Time-step Evolution) Consider a partic& being integrated in
accordance with lemma 3.1.1, with time-step T at time to. At time t0 + T, its
time-step T1 is at least as large as 2T if r is less than log 43
Proof. The time-step is determined from (3.2). First, we only consider the
constraint on the time-step due to the first derivative of x. Note that
dx
Tt			___			d
x			x			??log(t)			(4.16)
where x(?)denotes d?x the i'th derivative with respect to time of x. So we can just
dtt?
consider the derivative of the function log x. Within the time range of interest,
bounds on x(t) give corresponding bound for log x(t). We start with (3.7), in which
40
the lower bound has all Taylor series terms with a negative sign while the upper
bound has all terms with a positive sign.
(2 --H e"(t?to)IT)x(to) ? x(t) ? e?(t?to)/Tx(to)
log(2 --H e?(t?to)/T) + logx(to) ? logx(t) ? l0ge"(t?to)/T + logx(to?4.17)
If all the Taylor series terms are taken to be of the same sign in each bound on
as we have done, their derivatives bound the derivative of logx(t), as well,
f?log(2 --H e?(t?to)/T)			?			?)logw(t)			?			d
?tloge?(t?tO)/T
e"(t?to)/T ?			7?
2 --H e"(t--Hto)/T r			--H			?d?log x(t)			?			(4.18)
At time t0 + T, the bound is given by
____ 7?			x(')(to+T)			7?
--H 2--Hc?T --H x(to+T)			(4.19)
Therefore the new time-step Tt as determined using only the first derivative of x
in (3.2) is bounded by
)? 2$e? 7?T ? ?			(4.20)
--HT
_ 2
where in the last step we use the fact that 7? ? log ?34 whence e" ?
3.
For constraints on the time-step due to higher derivatives, we can assume that
the time-step is tightly constrained by every derivative, since, this would maximize
the change in the time-step. But in this case, we can bound log(x(?)/x(i?1)) as
above.
x
x
x
7?
T1
x??) x(i?1)			x(1)
x(i--Hl) x(?--H2)			x
x???			x(i?1)
x(i--H1) x(i--H2)
--H e? T
??_7?
--H 2--H C? T
which gives exactly the same bound on T?. El
x
(4.21)
41
To maintain fidelity to the physical model and for stability reasons, the time-step
is usually not allowed to increase by a factor of more than 2 from one time-step to
the next. (Aarseth [Aar85j recommends a factor of at most Y2). The implication
is that after a particle is integrated, either it stays in the same block or moves one
block higher or one block lower. Therefore, moving a particle to the appropriate
block is an 0(1) operation.
4.5.2 Block Updates and Efficiency
The periodic update scheme in Sundaram [Sun92] (also see [McM86,MA93]) ex-
tends to the idea of block updates. At the jth step, particles in all blocks up to
and including the r'th block are updated, where ? is the maximum power of 2
which divides j. This automatically ensures the required periodicity of 2k for the
k'th block.
Blocking of time-steps also turns out to be more efficient. On the positive
side, a large number of particles are advanced together. Some intermediate predic-
tions of particle positions become unnecessary and temporal expansions need to
be validated only once for an entire block of particles. Cells can also have a block
time-step making recalculation of temporal expansions less frequent. On the nega-
tive side, since, the blocked time-step can at worst be half as large as the predicted
time-step, the number of integration steps can double. Overall, the gains from
blocking more than compensate for this increased number of integration steps.
42
Table 4.1: Time Complexity for integrating N particles over 1 crossing time in 3D
Distribution H Homogeneous			Power Lawa
Algoritlim ?			Comple?tyb			Complexityc
Number of steps (s)			4			24
3			6--H2?			11
Direct Shared			8			12 --H 3?
3			6 --H			_
Direct Individual			7			6--H			24
3			6 --H 2?			11
42 --H 13?			24			31 --H
Ahmad-Cohen			25			24--H8?			11			9
V2
1 6 --H ?			31--HTh97			?
____ t
--H 4			6 --H 2?			9
5
Fast Multipole			3			6 --H 2?			2 ? ? 3
4			24
Temporal Multipole			3			6 --H 2?			11
aSpherically symmetric density p(r) oc r?? at radius r; isotropic particle velocity v(r) oc
bExponent q of N in O(N?) is specified
C1f not mentioned for 2 < ? ft the power law complexity is the same as for homogeneous
Chapter 5
Parallel Algorithm:
Homogeneous
Temporal multipole expansions can be used without toss of accuracy within their
validity periods. This allows communication of multipoles to overlap computa-
tion. In this chapter, we examine how the temporal multipole algorithm can be
parallelized for homogeneous particle distributions.
In parallelizing the algorithm, we have to address two issues: communication
and load balancing. Communication between processors is a necessary evil: nec-
essary for information exchange but inordinately expensive compared to compu-
tation. We will discuss how temporal multipoles can exploit the locality of the
problem and minimize communication. For full efficiency, each processor should
do approximately the same amount of work. For homogeneous distributions, this is
automatically true. We address tbe issue of load balancing for non-homogeneous
distributions in the next chapter. Since gravitational and Coulombic forces fall
away rapidly with distance, they have a strong local bias. Our objective is to give
local solutions which exploit this local bias.
5.1 Parallel Processing Model
We are interested only in MIMD models of computation consisting of a small
network of powerful processors. Parallel computing in the next decade seems to be
moving toward such machines, exemplified by the Intel PARAGON, the nCUBE,
Thinking Machines CM-5, IBM Power Parallel, the Parsytec Supercluster and the
new CRAY T3D.
The communications involved in the parallel algorithm correspond to the pyra-
mid (explained later). We will therefore explain the concepts involved assuming
43
44
processors interconnected in a pyramid architecture. Communication takes place
by message-passing and sending or receiving a message will be assumed to take
0(1) time. In Chapter 7, we will show how to implement the algorithm on various
parallel architectures.
5.2 Communication
In the sequential temporal multipole algoritlim, the computation of multipoles
is need-driven or demand-driven. Multipole expansions are computed only when
some particle needs to use them. In the parallel setting, cells are distributed across
processors and the communication time involved in making a request and receiving
a response is too large. For efficiency, we make the parallel algoritlim synchronous
above the cell level and expansions are periodically computed and communicated.
5.3 Estimating the Validity Duration T?
The temporal expansion of one cell should be still usable when it reaches another.
If the communication time from cell A to I? is T, the time for which the expansion
of cell A is valid at B should also be at least T. Below, we estimate T?(D),
the validity duration of a temporal expansion at distance D, for a homogeneous
distribution. This estimate is subsequently used to show that the parallel temporal
multipole algorithm is optimal.
Theorei? 5.3.1 Consider the potential at the origin due to particles contained
inside a sphere of radius R with center at distance (D + R) from the origin, where
D > R. See Figure 5.1. Let the length of time, for which the temporal ewpansion
of the particles inside the sphere is valid at the origin, be T?(D). Then T?(D) is
at least ?, where vo is the dispersion velocity' of particles inside the sphere.
Proof. The length of time for which the temporal expansion is valid is deter-
mined using equation 3.17. Rearranging this equation,
1/i
7?
dt'
(5.1)
We wish to estimate the minimum possible value for T?(E). The minimum value is
constrained by the first derivative of the potential since we are only considering far
potentials. Let us consider the potential function ?(x) ??? The proof is similar
1Dispersion velocity is just the root mean square velocity i.e. v0 = 4m?3?1v,2Is for s particles.
45
ORIGIN
Figure 5.1: Validity duration of temporal expansion of particles inside sphere at
origin
for ?(x) = --H log x. Consider a unit charge in the sphere located at point x. Its
potential at the origin is given by ?(x) and the first derivative by:
d?(x) _ d?(x)dx _			1dx (5.2)
dt			dx dt			x2 dt
Let there be s particles inside the sphere with positions x1, x2,... , X, and charge
intensities m1,m2,... , m? such that M = Z%S=l m?. The minimum value for T?
will be obtained when the numerator of (5.1) is minimized and the denominator is
maximized. The numerator is
s			3
= ?7?mj?xj) =			D +21?			D +21?			(5.3)
izi
The denominator can be expressed as
dt			--H Smi dt
__			__			(5.4)
m?dx?			1			dx?
x? dt			D2			_			dt
We can approximate the summation by the worst case of all particles moving to-
wards (or away from) the origin with a velocity equal to v0, the dispersion velocity.
This bulk flow is highly improbable but will give the most stringent bound on T?.
Note that as the number of particles within the sphere s becomes larger, "bulk
flow" becomes more and more improbable. At the other end, in the limit when s
is just 1, "bulk flow" can occur with high probability.
3 m?v0			WIv0
dt i=1 D2 --H D2
(5.5)
46
Using these values for the numerator and denominator of (5.1), r? is bounded by
?MD2
--H D + 2R ?`[vo			3vo			(5.6)
where we used the fact that D is greater than R. ?
Remark Validity Duration for Steady-State Distributions: For steady-state dis-
tributions like the Plummer law distribution, "bulk flow" is very highly improbable.
For the more reasonable assumption that the velocities are normally distributed
random variables, we get a lower bound where r?(D) oc DR3?2 when ?(x) = 1
This larger value of r? leads to a more efficient parallel algorithm on the mesh.
5.4 The Parallel Temporal Multipole
Algorithm
For homogeneous particle distributions, all leaf level cells have the same size, con-
tain roughly the same number of particles (say s) and these particles have time-
steps of the same magnitude. For ease of exposition, we assume that each cell
has one processor assigned to it, so that processors can be identified with cells.
Further, we suppress details about force calculations within leaf cells.
Each cell needs to communicate with its children, its parent, its neighbors and
its interaction list. Recalling that the interaction list of a cell consists of cells which
are children of its parent's neighbors, it is sufficient for a cell to communicate with
only its parent, children and neighbors. The communication structure is that of a
pyramid of processors. Figure 5.2 shows the pyramid for a one dimensional system
of particles.
The parallel algorithm just pipelines the temporal expansion communications.
Communication and computation proceed concurrently. Cells internally calculate
validity periods for expansions and if necessary wait for a valid expansion to arrive.
However, by the validity duration estimate of theorem 5.3.1 temporal expansions
almost always reach their destinations in time to be used. For simplicity, we
will assume in the description below that temporal expansions always reach their
destinations within their validity periods.
A formal pseudo-code description of the algorithm follows. The computations
performed at a typical cell i at level h are detailed.
47
------- ---- . y---.--.			. .------.-.- .. .----r-%--. .- -- . . -`--------?------ . ?--.-t-'
?.----:----			?--			u-'-.---.---			?			vi--j			vi4--A-
Q.)."-- Processor Node
Neighbor-Neighbor link
Child-Parent Link
Figure 5.2: Pyramid of Processors for the iD System of Figure 2.1
Initialization
Choose the number of levels n logN desired precision ?, the order of the
temporal expansion p, q --H log2 ?, and number of particles per cell s.
Do one pass through the algorithm without pipelining (i.e. wait for all commu-
nication to complete) and initialize all cell values.
Choose Next Time-step at each cell at level n:
Select next time ta such that (ta --H t) is minimum and there is a particle a in the
cell with update time ta. Propose time ta to neighbors.
Synchronize time with neighbors choosing least time proposed by all neighbors
as current time t.
All particles within cell with update time t (possibly none) will be advanced.
Upward Pass
Comment [Form expansions ?? of potential field due to particles in cell? about
the center of the cell.J
Cell ? at level n:
Form (p, q)-term temporal multipole expansions ??, Xii using some method
within cell ?.
Send expansions ?? and Xi,j to parent of cell ?.
48
Cell z at level h, 0 ? h <
Send multipole expansions ?? and Xii to parent of cell i.
Receive from each child k its multipole expansions ?k and Xk,t
Form new ?? and new Xi,j using lemma 4.2.2.
Downward Pass
Comment [Form local expansions of potential ??, due to particles outside cell
i and its neighbors, about the center of cell ?]
Cell i at level 0:
=0
Send ?o to each child k ofcell i.
Cell? at level Ii, 0 ? h ?
Send to each neighbor j of cell ? the temporal multipole expansions and
consistency terms of children of cell ?.
Receive from each neighbor j the expansions and consistency terms of j's
children.
To each child k of cell i send ?k formed according to lemma 4.2.3.
Receive new ?? from parent of cell ?.
Cell ? at level n:
Receive ?? from parent of cell ?.
Computation of Force/Potential Comment [Use some method within each
leaf cell to advance all particles with update time t. Exchange temporal
multipoles and runaway particles with neighbors.]
Remark Nei9hbor Synckrvn?zat?on: In each cell, particles are independently in-
tegrated when their next time-step occurs. This can potentially cause neighboring
cells to have different local times. Since near neighbor interactions are performed
directly on particles advanced to the same time, neighbor synchronization must
be maintained. This is ensured by having each cell advance its time by the min-
imum of the time-steps of all its neighbors. If a cell's time-step is larger than its
neighbors, it waits for its neighboring cells to catch up.
Remark Integration Method within Cell: The method used within each cell to
advance particles depends on how large the number of particles per cell, s, is. For
large s, N> 10?) the temporal multipole algorithm itself can be used for maximum
efficiency. Further, within each cell, regularization of close encounters can also be
done.
49
5.5 Complexity Analysis
As stated before, each cell is assigned a processor. We make the following assump-
tion to analyze the complexity of the algorithm. The assumption essentially states
that processors do not have to wait in an inactive state while neighbors catch up.
Assumption 5.5.1 (Large Cell) The number of particles per cell, s, is suffi-
cientl? large for both of the following conditions to hold. At least one particle is
integrated in each pass through the algorithm in each cell and the size of a cell is
much larger than the average inter-particle distance.
5.5.1 Communication Efficiency
Let the time taken for the temporal multipole expansion of a cell to reach another
cell at distance D from it be Comm(D). We state the following self-obvious no-
waiting theorem.
Theorem 5.5.2 (No Waiting) If Comm(D) is at most T?(D) for a parallel al-
gorithm, then no processor has to wait for valid temporal expansions to reach it.
Let the width of a cell be R0. Comm(D) for the pyramid is just the time to
go up and down a k0 x RDo pyramid which is 2 log ?RDo communication steps. For
each communication step there is an integration step of average length To. For a
homogeneous distribution, TO is approximately given by nftr0, where ro is the average
inter-particle separation. Thus Comm(D) is 2log ?RDoTo which is 2?log RDo JLv?o For
theorem 5.5.2, we require
Comm(D) ?
r0			D			D
2?--Hlog--H ?
vo			R0 --H			3vo
D			D
2log--H ?
R0 --H 3r0
(5.7)
Under the assumption that cells are sufficiently large (so that R0 > ro), this will
be true.
5.5.2 Time and Space Complexity
We show the analysis for two dimensions. For three dimensions, the analysis is
exactly analogous. There are N particles and at least 45 particles per cell at the
finest level. The number of cells at the finest level of refinement is at most 4N
corresponding to a x $7N pyramid with 0(48N) processors.
50
Each communication step takes 0(1) time and each processor spends 0(1) time
on communication in each pass. Under the large cell assumption 5.5.1, the total
time spent on communication in all processors is 0(g), where 9 is the total number
of particle integration steps. The the time spent on computation is also 0(g). The
total time taken is 0(g) and this is optimal.
Each cell at the finest level stores at most s particles. Each higher level cell
needs only 0(1) space. So the space used per processor is 0(8).
Blocked time-steps will reduce the constant significantly in the time complexity
of the algorithm. All particles in a block get updated at the same time and the
compute/communicate ratio increases, making the algorithin more efficient.
Chapter 6
Parallel Algorithm:
Non-Homogeneous
Since a homogeneous distribution is symmetric, each processor does roughly the
same amount of work. For other distributions, a simple assignment of cells to
processors can cause severe load imbalances leading to inefficiency.
Consider the particle distribution shown in Figure 6.1. Particles are concen-
Physical Space: Particles
Computational Space: Processors
Figure 6.1: Load Balancing: Distribution of Particles and Processors
trated in the two corners of a square in physical space. The processors are arranged
in a uniform grid in computational space. It is clear that overlaying physical space
on the computational space can create load imbalances, with a few processors do-
ing most of the work. This chapter discusses how to distribute the work fairly
across all processors so that the simulation can be done more efficiently. Given
51
52
the local nature of the problem, our emphasis will be on local solutions. We will
be less formal than in previous chapters and only attempt to sketch out a solu-
tion. However, it seems to the author to be a better approach to load-balancing
than other cell-based approaches which do not exploit the locality inherent in the
problem.
6.1 Load Balancing
Ideally, we would like to have more computing power available where there are
more particles. Define the computation density at a point as ? where p is the
space density of particles and T is the average integration time-step for a parti-
cle at that location. The average computation density of a cell is a measure of
the computation requirements of the cell. Informally, we wish to distort physical
space-time with varying computation density on to computational space-time with
uniform computation density. Sparse regions shrink so that few processors are al-
located to them and dense regions expand and are allocated more processors. This
will then correspond to a load-balanced computation. Since particles move as the
simulation progresses and communication is expensive, the load balancing scheme
should be dynamic and local.
Locality is extremely important for efficiency. Most of the communication in
the algorithm occurs between processors containing neighboring cells when they
compute the near force. Near force computation requires communication of de-
tailed information on particles and cells. In contrast, far force communication
costs are small since only the multipole expansions have to be communicated and
when using temporal expansions, even this gets amortized. Therefore, the load-
balancing scheme would be very efficient if neighbors in physical space are also
neighbors in computational space. Before discussing such a scheme, we make a
simple but useful observation.
Fact 6.1.1 If N particles are arbitrarnl? partitioned into two sets, the force on any
particle is the sum of the forces on it due to particles in the two sets.
Thus physical space can be duplicated so long as particles are not, and a cell can
exist in more than one processor simultaneously.
6.1.1 The Center of Work Heuristic
Load balancing is based on the center of work heuristic. Neighboring processors
exchange particles to equalize the amount of work they do. Load balancing steps
are scheduled periodically. The duration between successive load-balancing steps,
Tbal, is fixed by mutual agreement among neighbors. The details follow.
53
6.1.1.1 The Center of Work
Consider a set U of particles. The i'th particle is located at Xj and has time-step
tj at a load-balancing step. Let Wj be the work done on the i'th particle in the
previous Tbal time. The center of work, (Xu, Tu), for the set U is defined to be the
work weighted mean of the space coordinates and the time-step of all the particles
in U.
6.1.1.2 Work Redistribution
(Xu,Tu) --H ?iEU(WiXi,Wi?)
ZiEU W?
(6.1)
Work redistribution occurs based on the "center of work" heuristic. At the junction
J of any four processors, the current center of work (Xc, Tc) for all the particles
in all four processors is found. See Figure 6.2. The spatial component Xc is used
P4			P3
New center of work
to be aligned with J.
New physical space for Pi (shared)
Figure 6.2: Work redistribution based on center of work heuristic
to balance the load by exchanging particles and the temporal component Tc is the
duration of the next load-balancing step. Specifically, physical space is distorted by
aligning Xc with the junction point J of the four processors. After this alignment,
each processor is in charge of all the particles falling within its domain for the next
Tc Tbal units of time (with respect to junction J).
This realignment has the following properties:
Locality: Physically adjacent regions will remain adjacent in processor space.
54
Linearity: The physical space allocated to a processor undergoes only a linear
transformation. Since, every processor has non-zero work to begin with, the
new junction point will always lie within the physical space owned by one of
the four participating processors. Thus, the physical space assigned to any
processor by this method will always be a convex quadrilateral.
Fairness: The load on processors after the realignment will be more equitable.
Processors with more work in the previous Tbal units of time will have less
physical space (and fewer particles) and vice versa.
Note that this realignment occurs independently at every single junction J of every
4 processors. Thus "space" assigned to a processor is not fixed but changing.
Fact 6.1.1 can be used to simplify book-keeping. Each processor finds the square
bounding its quadrilateral and simulates all cells within this square. As a practical
matter, to avoid unnecessary movement of particles between two processors with
approximately the same load, the alignment of Xc with J is done only within some
tolerance factor and not exactly. In other words, particles are transferred from one
processor to another only if there is a significant difference in the amount of work
done by the two processors. The load-balancing time-step may also be scaled by
a multiplicative constant, so that it tracks the evolution of the system.
6.2 Communication Efficiency
Since physical space is distorted by the load balancing scheme, it is no longer clear
that temporal multipoles are available where they are required and when they are
required. To show this, we make the following assumption.
Assumption 6.2.1 (Evolution and Load Balancing) The rate of evolution of
the system is much sThwer than the rate of load-balancing so that the computation
density is the same at every processor.
The idea is that if the processors are load-balanced at the beginning of the simu-
lation, local particle exchanges through the center of work heuristic are sufficient
to ensure that they remain load-balanced throughout the simulation. This is a
reasonable assumption to make since load-balancing steps are frequent and in each
integration step the distance moved by a particle is very small. However, the
assumption can break down for pathological situations. For non-pathological sit-
uations, if Nj is the number of particles and Tj the average time-step at processor
i, we then have
constant for all processors i			(6.2)
Tj
55
From theorem 5.3.1, we know that the validity duration is 3Dvo at distance
D from a temporal expansion of particles with dispersion velocity vo. However,
the dispersion velocity is not the same throughout the system. To simplify the
discussion, we make the following assumption about the N-body system.
Assumption 6.2.2 (Composition of homogeneous subsystems) The N-body
system is composed of subsystems each of which is homogeneous.
Being primarily interested in the mesh model of parallel computation, we will
only consider nearest neighbor mode of communication. Processors pass on infor-
mation to their nearest neighbors and we wish to ensure that temporal expansions
are received within their validity periods at the destination processors. Therefore,
Comm(D), the time for communication of temporal multipoles to distance D, is
just the sum of the communication times across these subsystems. With the as-
sumption that the system is comprised of sufficiently large load-balanced locally
homogeneous subsystems, we can characterize the velocity with which temporal
multipoles are communicated across such a subsystem.
Lemma 6.2.3 Consider one homogeneous subsystem, located inside a single pro-
cessor, of a load-balanced N-body system. Let the subsystem have density p and an
average integration time-step of length T. Then U, the velocity of communication
of temporal expansions within the subsystem is given by
U pi???i?i?? --H p(3?d)I(2d)
where d is the number of dimensions.
(6.3)
Proof. Let N be the number of particles in the subsystem and 1? be its physical
size. Since, the subsystem is homogeneous, N oc pild Once load-balanced, all
subsystems have the same constant computation density. Therefore,
N _ J?dp _
T --H			--H constant
T
(6.4)
Now, the expansion is passed on to the next subsystem in time T. Therefore it
travels a physical distance R in time T and its velocity of communication is given
by
R			1
T			pl/dTl--H1/d
p and T are related by T Oc p?112. Hence,
Ux			1 p(d?3)/(2d)
pl/dp--Hl/2+l/(2d) oc
E]
(6.5)
(6.6)
56
By the above lemma, U is a constant in three dimensions while for two dimensions,
U oc p?114. Now we will estimate the time taken by a temporal expansion to travel
across the entire N-body system. Physically, the expansion is always despatched
by each processor on its path along the shortest physical path to its destination.
This physical path however is dilated by load balancing.
Theorem 6.2.4 Consider a load-balanced N-body system consisting of 5 locally
homogeneous subsystems. Let the i `th subsystem have density p?, size Dj and aver-
age integration time-step Tj. Then, Comm(L), the time for a temporal expansion
to travel a distance D across the entire system is given by
where K is a constant.
Comm(D)			? DiKp,(3?d)I(2d)			(6.7)
i=1
Proof. The expansion is sent along the shortest path in physical space which
might pass through all of the s locally homogeneous subsystems. The time for the
expansion to travel a distance D is simply the sum of the times it takes to travel
through each subsytem on its path.
By lemma 6.2.3,
Comm(D) ? Comm(D?) and D ?
i=1			i=1
Uj oc ?%(d?3)I(2d)
Let the time taken to cross the i'th subsystem be Comm(D?). Then
Comm(D?) = fi --H KD.p(3?d)/(2d)
Uj
Therefore, Comm(D) can be calculated as
Comm(D) = ? Comm(d?) = z DiKp%(3?d)/(2d)
i=1
(6.8)
(6.9)
(6.10)
(6.11)
D
r(D), the validity duration of the temporal multipole at distance D, is simply 3??
where Vj is the dispersion velocity of the set of particles whose potential the expan-
sion represents. If Comm(D) is less than T(D), temporal multipole expansions will
reach their destinations within their validity periods. To ensure this, we require
D
ffi DiKpt(3?d)/(2d) --H 3vi			(6.12)
57
This will be true if ??jp?(3?d)/(2d) is less than ? for all z and for all j. For 3
3
dimensions, we just need Kvj to be less than ?31 for all I.
If we ensure that Comm(D) is less than T(D) for this nearest neighbor corn-
munication model, the algorithm can be implemented on a mesh architecture in a
manner similar to that discussed for homogeneous systems.
Chapter 7
Parallel Algoritlim 011 Various
Architectures
In this chapter we discuss the implementation of the parallel temporal multipole
algorithm on various parallel architectures. We discuss the algorithm for a two
dimensional system. The extension to three dimensions is straightforward. We
will refer to steps where particle integrations actually occur as integration steps.
Those steps which only involve communication of temporal multipole expansions
will be referred to as commtunication steps. By requiring that the number of com-
munication steps per integration step be 0(1), we can ensure that the algorithm
remains optimal.
The parallel algoritlim on the hypercube is optimal and has the same asymp-
totic complexity as the sequential algorithm, namely 0(g). On the mesh, we
present an algorithm which is inefficient by a multiplicative factor of log P on P
processors.
7.1 Relaxed Pyramid
For conveniently mapping the algorithin on to existing processor architectures, we
first define a relaxed puramid on which the algorfthm works as efficiently as on a
pyramid. Relaxed pyramids can be easily embedded in hypercubes and meshes.
Consider an M x M pyramid. The corresponding relaxed pyramid also has
an M x M base and properly contains the pyramid. We will refer to the nodes
of the relaxed pyramid which are also present in the pyramid as pyramid nodes.
Between a pyramid node i at level h and its parent pyramid node j at level (h --H 1),
the relaxed pyramid has a child-parent chain of (Th --H 1) additional chain nodes:
v0. v1			vT			i where This [2AK1i.
I =--H ?, %)`???` ? --H			Here, A is a parameter to be chosen
58
59
depending on the architecture. vJ. is connected to Vjkj?l and is its only child. The
physical interpretation is that all the nodes Vjkj, k # 0 represent the same pyramid
node but hold temporal expansions corresponding to different time-steps. The
number of nodes in the relaxed pyramid is 2AM2 (A > 32) as against -43M2 in the
pyramid and the number of levels is A(M 1) as against log M.
Particle consistency terms are formed as usual. However, there is one techni-
cality, regarding the child-parent chains. Consider particle consistency terms at a
pyramid node i at level h of the pyramid. A new consistency term xti,j or simply
xt (suppressing subscripts for convenience) arrives at time t. Let the consistency
terms of the previous Th time-steps be %t?1, xt?2,.. . , ,y,t?Th?1 Then the consis-
tency term at node i at time t can be formed using the time-composition property
(4.4)
Th--H1
xt			? xt?i			(7.1)
j=O
This is true for each neighbor j of i. From this description, it might appear that
a level h pyramid node needs to store 0(2n--Hh) consistency terms. If h is 0 this
requires 0(M) space.
This requirement to store 0(M) terms can be avoided by the node-chaining
technique. The essential idea is to treat nodes in a series or chain of nodes as
storage space. Suppose node i receives data from nodes in chaini and wants to
receive the same data (in the same order) after a delay of 2T. See Figure 7.1.
Then, it just chooses a suitable chain of processors chain2 of length T and sends
the data from chain1 into chain2. The last node in chain2 "reflects" the data so
that node i receives the data at the end of time 2T as required.
Now, note that at time t, the consistency term xt can also be formed as
xt			xt-1 --H )ct?Th t xt			(7.2)
Only the terms ?t, xt--H1 and xt?Th are required at time 1. The intermediate terms
can be sent back along the child-parent chain halfway in such a way that they
arrive at the node exactly in time to be used.
7.1.1 Homogeneous
The temporal multipole algorithm is the same on a relaxed pyramid except that
now we allow cells with a single child. Cells in the chain (except for those also
present in the original pyramid) do not participate in neighbor-neighbor interac-
tions. A multipole expansion created at time t in some leaf cell reaches another
leaf cell at distance D from it after first going up the relaxed pyramid then go-
ing down. Comm(D), the number of communication steps required to travel to a
T
60
celti
D
chaini
chain2
Figure 7.1: Principle of Node chaining
Node? receives data from chain1 once again after delay 2T by sending it along
chain2 of length T-
processor at distance D, is equal to twice the height of the relaxed pyramid with
base D x D, which is 2A(D --H 1). If we choose the nuinber of communication steps
executed per computation step to be 2A, Comm(D) satisfies theorem 5.5.2 (under
the large cell assumption).
7.2 Hypercube
To embed a pyramid on the hypercube, we proceed as in Zhao and Johnsson [zJ9lj
and as described in Ho and Johnsson [HJ87,HJ9O]. Successive levels of the pyramid
correspond to finer grids. The finest level grid is embedded using gray codes
independently for each axis, Since the number of cells is a power of two along each
axis, this is easily done by partitioning the hypercube connections among the two
axes. Higher level grids are embedded on to a subset of the same set of hypercube
processors. Processors at level k consist of all processors with coordinates which
are a multiple of 2n--Hk along both axes. The distance between neighbors on any
level is at most 4 (for non-diagonal neighbors it is 2). The parent-child distance is
at most 4 as well, since the parent is also mapped onto one of the neighbors. In
this embedding, communication can occur only one level at a time.
7.2.1 Homogeneous
Although the embedding is that of a pyramid, the simulation will be that of a
relaxed pyramid. Conceptually, each child cell simulates its child-parent chain in
61
a sense made more precise below.
In communication step t, level h(t) of the relaxed pyramid is simulated, where
h(t)			M			k = maxf(t mod Al) mod 2? =			(7.3)
2k'			i>O
M
Level 2h is simulated once in every ? communication steps. This being the case,
particle consistency terms for intermediate steps need not be formed at all. (Note
that only levels present in the actual pyramid are simulated but exactly after a
delay corresponding to data movement through a relaxed pyramid.) The child-
parent delay is 2n--Hh for a child cell at level h. However, since data is not pumped
continuously up the child-parent chain but only at intervals of 2n--Hh, old data is
used for the intervening 2n--Hh steps also. Hence, the algorithm functions as it would
on a relaxed pyramid with a value 4 for A. So in this case, 2A or 8 communication
steps are required per integration step.
The number of processors required, for N particles, is ThsN and space required
per processor is O(?21 log ffiN + s) since one processor could be simulating ?21 log 4sN
pyramid processors.
7.3 Mesh
We will assume that the mesh neighbor structure reflects the algorithm neighbor
structure. Thus, a processor has 8 neighbors in 2D and can communicate with each
of them in 0(1) time. First, we show how to embed an M x M relaxed pyramid in
an M x M x log M mesh with 0(1) slow down of the temporal multipole algorithm.
Since, an M x M mesh can simulate an M x Al x logM mesh with O(logM) slow
down, the temporal multipole algorithm can be implemented on an M x M mesh
with O(log M) slowdown.
The embedding of the M x M pyramid in an M x M x log M mesh is done
in the obvious way. Let both be placed symmetrically about the origin of some
coordinate reference frame. Each processor on the pyramid is mapped to the
nearest processor in the mesh. The bottom layer is the same in both and the ?i'th
level of the pyramid is mapped onto the Ii'th level of the mesh. The relaxed pyramid
has the same embedding except that each child-parent chain is embedded on the
parent layer and corresponds to the shortest path between child and parent. In
this embedding, child-parent communication (on the relaxed pyramid) takes 0(1)
time. However, on the h'th layer, neighbor-neighbor communications take 0(1)
time on the relaxed pyramid but O(M2h) time on the mesh. We will postpone a
discussion of this delay and its effects.
62
When the data from its children arrives at a node, it stores this data and for-
wards a copy to each of its neighbors. When the data from the neighbors arrives,
the far potential is computed for each child and dispatched to the child. Since all
the neighbors are equidistant, their data arrives simultaneously. So corrections for
intervening time-steps are accumulated and applied individually, after computa-
tion, to each child's potential.
The data received from children is part of the data sent to a cell's neighbor
and also later used in computations. This data should therefore be stored, until
the neighbor data arrives. But the storage required for this, over successive time-
steps, could be as large as 0(M) at one processor. We can alleviate this problem
by node-chaining. In Figure 7.1, the child-parent chain is chain1 and half the
neighbor-neighbor chain is chain2. The modified child-parent chain passes twice
through the parent. Think of data being pumped pipeline fashion on this chain.
The first time the children's data passes through the parent, a message to the
neighbor can be sent. (Due to symmetry, data from all children arrives at the
same time). The second time, the children's data arrives simultaneously with the
neighbors' data. Required results can be computed and sent out immediately and
no storage is required.
7.3.1 Homogeneous
With node-chaining of the child-parent chain, the delay for a child-parent commu-
nication at level h is 3 * 2n--Hh--H1, which corresponds to a relaxed pyramid with A
equal to 3. To satisfy theorem 5.5.2, 2A or 6 communication steps are needed per
integration step.
Coming back to neighbor-neighbor communication delay, this delay is 2n--Hh on
level h. The distance traveled by a particle in one integration step is proportional
to the inter-particle distance r0, while the distance travelled by the temporal ex-
pansion in one communication step is proportional to the inter-cell distance R0.
By the assumption 5.5.1, R0  r0. It is easy to see that the distance that a particle
can travel in 2n--Hh communication steps is less than the size of a cell at level h+ 1.
Therefore, particle consistency terms remain valid.
In summary, the number of processors required is o(ffiN), the time per N-body
iteration is 0(21 log %N t s) and space per processor is 0(21 log ffi + s).
For the sake of completeness, we mention that an optimal algorithm on the
mesh seems possible. Preliminary results indicate that if the embedding of the
relaxed pyramid is not static but dynamic, the logarithmic multiplicative factor
can be removed from the time-complexity.
63
7.4 3D Algoritlim on 2D Mesh
Conventional parallel machines have a two-dimensional mesh topology while the
N-body algorithm is usually studied in three dimensions. We, therefore, require a
way of mapping the computations of the 3D mesh on to a 2D mesh of processors.
Specifically, we wish to map computations on an M x ?I x M mesh on to
an M x M mesh. This is a well-studied problem [Lei92j. Moreover, since each
processor now simulates M other processors, there is more computation involved
per communication step. This can be used to hide communication costs and it is
straightforward to change the algorithm so that the log M inefficiency is removed
making the algorithm optimal.
7.5 A Distributed System
It would be desirable to be able run N-body codes on conveniently available work-
station networks. There are two problems with such networks. They have a high
communication latency and they a?e typically ethernets with broadcast or global
communication primitives. The rough analysis below, under extremely optimistic
assumptions, shows that the main limitation in implementing the temporal multi-
pole algorithm on such networks comes from the latter. With current technology,
processors will spend most of their time waiting for communication to occur.
For simplicity, let us look at a homogeneous distribution of particles. Let there
be 8 particles per cell and C cells per processor. Let the information in bytes
communicated to a neighbor be Po bits per particle and C0 bits per cell. Only
cells on the common boundary and their ancestors need be communicated to a
neighbor. Then, the amount of information communicated to each of 8 neighbors
is appwximately VT(2C0 + sPo) (in 2 dimensions). So the total data sent out by
a processor per integration step is 8?(2C0 + sPo). Assuming a message transfer
rate of B bits per second,
Time Taken for message send = 8?(2C0+sPo)
B
(7.4)
Optimistically, let us assume that receiving messages from neighbors also requires
only this much time. Let the communication time be overlapped with computing
the near force. This is to avoid processors spending most of their time waiting for
messages. With C cells, this will involve approximately O(8Cs2) force calculations.
If a force calculation takes time Tf, computation time is 8C1s2Tf. For maximum
efficiency, the computation time and the communication time should be roughly
64
equal so that the processor spends no time waiting. Therefore, we want
8Cs2Tf 2x8?(2C0tsPo)
B
2
(7.5)
For optimistic values: 500 bits for P0, 5 x i04 bits for C0, 10 x 106 bits per second
for B, 100 for s, 1 x 10--H6 second for Tf, the obtained value for C is 10 which
corresponds to a 1000 particles per processor roughly. The required latency is 0.1s
which also is not too stringent.
The problem is that the ethernet is intrinsically a broadcast network. If there
are p processors on the network, all working on the N-body problem, the latency
will be p times the intrinsic 0.1s latency since each processor has to wait for every
other processor to finish. Then the number of particles per processor has to increase
enormously to hide this latency of O.lp and in turn causes the latency to go up
since more particles per processor requires more information to be communicated.
The essential problem is one of trying to communicate local information glob-
ally. The N-body problem is more suited to networks which do not suffer from
this problem.
Chapter 8
Implementation
The sequential algorithm in two dimensions was implemented on a Sun sparc-
station. The parallel algorithm, also in two dimensions and without any load
balancing, was implemented on a 16-node Intel iPSC/860 hypercube. The code
written in C consisted of around 4000 lines for the sequential implementation and
around 5200 lines for the parallel. Since each processor itself employed the tem-
poral multipole algorithm, the parallel code was a superset of the sequential code.
8.1 Implementation Details
Below we discuss some of our implementation decisions which might be useful for
future implementations.
8.1.1 Scaling
Particle masses (and charges) were scaled so that the total mass and charge were
each 1. Thus, for N particles, each particle had a charge and a mass of N1 All
particles were placed in a unit square centered at the origin.
8.1.2 Tree of Cells
A tree of cells with each leaf cell containing a list of particles within it was main-
tained. The tree was incrementally updated. When a particle moved from cell i to
cell j, it was deleted from cell i's list and inserted into cell j's list. When the num-
ber of particles in a leaf cell increased beyond 8, it was divided into children cells.
Likewise, if the number of particles in all the children of a cell decreased below 8,
that cell was made a leaf and its children were deleted. Temporal expansions were
65
66
suitably updated in each case. The number of particles per cell was fixed at 40 in
all the tests reported here.
8.1.3 Taylor Series Integration
Particles were integrated using a Taylor series. Time steps for particles as well as
cell temporal expansions were determined by the criterion used by Aarseth [Aar85].
Blocked time-steps were used for efficiency.
8.1.4 Temporal Multipole Expansions
Instead of explicitly forming temporal multipole expansions and time derivatives,
divided differences as described by Ahmad-Cohen [AC73] and Aarseth [Aar85]
were used. Particle forces in the previous four time-steps were used to form the
differences. The same divided-difference technique was extended to form tem-
poral expansions for cells also. For each coefficient in the multipole expansion,
divided differences were individually formed up to the 4th order. These differences
were used to extrapolate the expansions and also to predict the next update-time.
Fourth order divided differences need values in 4 previous time-steps to work and
leapfrog integration with shared time-steps was used for initialization.
Divided differences have the advantage of simplicity and speed over forming the
complete temporal expansion. The disadvantage is that they are not sufficiently
sensitive to close encounters and tend to have a time-lag effect.
Particle consistency terms only at the leaf level were formed and used. At lower
levels, particle consistency was ignored.
8.1.5 Optimizations of Multipole Manipulations
Some optimizations, originally suggested and implemented by Greengard in his
code were used in modified form in the multipole manipulation procedures. The
optimizations take advantage of the symmetry of the IList cells to "Convert" 4
expansions at the same time with cost O(4p + p2) instead of O(4p2). Also, fewer
number of terms were converted for IList cells which were further away. Multipole
expansions for the forces, instead of potentials, were directly formed. Powers of
the complex number vector from cells to their ilist members and children were
pre-computed and stored. This allowed expansions to be evaluated rapidly.
67
8.1.6 Parallel Optimizations
Only information about cells on the shared boundary with a neighbor and multi-
pole expansions for the ancestors and siblings of these cells were communicated to
the neighbor. For each processor, a processor tree was formed. This tree refined
the entire system simulation space top-down although only that processor's par-
ticles were present in this space. Processor level cells corresponding to neighbor
processors and interaction list processors were also formed. This allowed temporal
multipoles to be updated conveniently.
An effort was made to hide communication latency in the parallel version.
After sending out messages to all neighbors, a processor instead of idly waiting
for responses, calculated the near forces in between particles within the processor
itself.
8.2 Verification
Some tests were done to verify the fidelity of the algorithm. The first concern was
whether the accuracy of force calculation using multipoles is sufficient. The second
was whether the accuracy of integration improved when the temporal multipole
algorithm was used and all parameters were tuned for increased accuracy.
8.2.1 Accuracy of Multipoles
Table 8.1: Multipole Order and Force Error
Order (p) Precision (2--HP) Force Errora
4 6.25 x 10--H2			3.0814 x 10--H?
7 7.81 x i0--H3			1.3525 x io--H4
10			9.76 x			i0--H4			2.1064 x			10--H05
14			6.10 x			10--H?			4.2115 x			10--H08
17			7.63 x			10--H6			4.1295 x			10--H09
aMaximum Absolute Relative Error = maxffi1 4&
Fj
Accuracy of multipole calculations were tested using a simple static 2-body
problem. Force calculation should show exponential convergence to the exact force
68
SCALING OF FORCE ERROR WITH MULTIPOLE ORDER
1O?
1O?
O? 1o?
`r
w
Lii
0 6
`r10
0
L
10j
1O?
10 ______ _____ _____ ______
4			6			8			10			12			14			16
MULTIPOLE ORDER
18
Figure 8.1: Error ill force computed using multipoles
as the order of multipoles is increased. Table 8.1 and Figure 8.1 show the results.
The force error used was the maximum relative absolute error i.e.
Error = all particles Actualforce--HM?o$tiPAeforce (8.1)
The semilog plot of multipole order against the force error is almost a straight line
as expected. The convergence is not exactly linear because the multipole expansion
is a mixed-sign Taylor series as discussed by Zhao [Zha87]
8.2.2 Accuracy of Integration
The fidelity of integration was tested using a uniformly distributed set of 100 par-
ticles. The temporal multipole algorithm was compared against the individual
time-step scheme for position and velocity errors. Relative error in energy con-
servation was also observed. Increased accuracy in integration involved changing
69
three parameters. The initialization error (leapfrog scheme was used in initial-
ization) was made smaller by decreasing the leapfrog time-step (TL). The force
computation error was made smaller by increasing the order p of the multipoles
used. (q was fixed at 4 since we used only 4th order divided differences). Finally,
the efficiency parameter ? was made smaller to decrease the integration error.
Since leapfrog integration is second order, these parameters have to be scaled in
the proportionality TL2 776 N 2--H?772 since that is the extent of error in position
due to initialization, integration and force calculation respectively.
Table 8.2 and Figure 8.2 show the results. Again convergence is observed on
the log-log plot of integration precision against the errors. Note that at the highest
accuracy of integration, there is no improvement in energy error. This seems to be
due to round-off error since the direct scheme also has this level of energy error.
Table 8.2: Accuracy of Integration and Errors
Order			Accuracy			Error
p			2--H?			Positiona			Velocityb			EnergyC
4			6.25 x			10--H2			8.62			x 10--H11			1.75 x 10--H6			6.67			x io--H9
7			7.81 x			io--H3			2.94			x 10--H12			8.80 x 10--H8			3.53			x 10--H11
10			9.76 x			i0--H4			2.30			x 10--H13			4.48 x iO--H9			4.69			x 10--H12
14			6.10 x			iO--H5			2.97			x iO--H15			9.53 x 10--H11			1.24			x 10--H13
17			7.63 x			10--H6			3.64			x 10--H16			1.06 x 10--H11			1.37			x 10--H13
a??er?ge Absolute Error = z,ffi1 lAxi /N
bMaximum Absolute Error --H maxN lAvi
cAbsolute Relative Error = AEE I
8.3 Sequential Algorithm: Results
In the sequential case, three different distributions were used to test the algorithm
against other standard algorithins. One was a homogeneous distribution ("RAN-
DOM") with a uniform distribution of particles. The other two were power law
distributions with density p(r) oc ??" and velocity v(r) oc ?(1--H?)/2 "POWER
70
SCALING OF ERROR WITH PRECISION
In
106
10?
O? 10'
w
101:
0
x
VELOCITY
6
10uj
0
x
ENERGY			+
10 10?
*
10?
POSITION
A			3
10			10			10
PRECISION IN INTEGRATION
10?
Figure 8.2: Integration precision against energy, position and velocity errors
LAW 1" corresponded to a value 1 for a and "POWER LAW 2" corresponded to
a value 1.5 for a. Figure 8.3 shows a sample distribution of 500 particles for each
of the three cases. (The `+` shaped empty region in POWER LAW 2 at the origin
seems to be an artifact of the pseudo-random number generator used to gener-
ate the distribution.) Finally, the asymptotic scaling of the sequential temporal
multipole scheme was tested.
All results are with a multipole expansion of order 7 (accuracy of 10--H2 in
force calculations). The time-steps for the shared time-step schemes were chosen
to be 70?0N1 for RANDOM, 0.0025 for POWER-LAW 1 and 07N02 for POWER-LAW 2.
A
These were approximately the minimum time-step chosen by the direct individual
time-step scheme for that distribution.
A simple Barnes-Hut algorithm was also implemented for comparison purposes.
(The opening parameter in the algorithm, 0 was set to 0.5.) So we had three al-
gorithms: Direct, Barnes-Hut and Multipole. In addition, we had two integration
methods: shared time-step with leapfrog and individual time-steps with divided
71
??:?:
A
r?
??
;;? ??x? ?
IL
POWER LAW 1
RANDOM			POWER LAW 2
Figure 8.3: Test Distributions of 500 Particles
differences. (Note that multipole with shared time-steps is the same as the fast
multipole algorithm while multipole with individual time-steps is the same as the
temporal multipole algorithm.) These six schemes were all tested for a varying
number of particles for each of the three distributions mentioned above. In each
case, energy and momentum were monitored for conservation. To make computa-
tions for differing numbers of particles comparable, each distribution was integrated
for a time equivalent to 50 shared time-steps.
8.3.1 Homogeneous Distribution
Table 8.3 and Figure 8.4 summarize the timing results. Cases in which time
taken was excessive were omitted. The energy and momentum errors are also
shown. In comparing the algorithms, the time taken is not the sole criterion. The
errors in conservation of energy and momentum can be equally or more important
depending on the application. In each algorithm, these errors can be controlled by
changing parameters, in which case the time taken will change.
Nevertheless, we are most interested in the algorithmic complexity and the time
taken. Both of the direct algorithms have complexities proportional to 0(N2) as
is visible from the graph. The individual time-step scheme performs considerably
better than the shared time-step scheme for all three methods. The Barnes-llut
algorithm has an almost linear performance but the slope is much steeper than
for the multipole algorithms. Nevertheless, it catches up in performance with the
direct shared time-step method at a 1000 particles. The sudden increase in slope
between 1000 and 2000 particles for the multipole algorithm is apparently because
the asymptotic regime has not yet been reached, and an increase in the number of
200
180
160
140
a?
z 120
z
Lu
? 100
w
80
60
40
20
72
RANDOM: TIME TAKEN ViS N
0
500			1000
__ Direct			--H--H
o shared time-steps
x
-x
0
-x
1500			2000			2500
N = NUMBER OF PARTICLES
Multi
3000			3500			4000
... BarnesHut
x individual time-steps
Figure 8.4: Time taken by various algoritlims for homogeneous distributions
levels of the tree affects the performance.
8.3.2 Power-Law Distributions
Tables 8.4, 8.5 and Figures 8.5, 8.6 show the obtained results.
The direct schemes show no change in performance. However, all multipole
schemes perform slightly worse since the number of tree levels is larger. In par-
ticular, Barnes-Hut performs significantly worse. Since there are 9 levels in the
tree for POWER? LAW 2 for a 1000 particles, this is to be expected. The direct
scheme with individual time steps performs as well as or better than all the shared
time-step schemes even for a 1000 particles, for POWER LAW 1, demonstrating
the importance of temporal optimization.
Since, we implemented particle consistency corrections only at the bottom level,
150
z
z
w
w
73
POWER LAW 1: TIME TAKEN VIS N
0			>c
0			- ?- -
Yoo			200			300			400			500			600			700
N = NUMBER OF PARTICLES
--H Direct			--H--H Multi
o shared time-steps
800			900			1000
... Barnes?Hut
x individual time-steps
Figure 8.5: Time taken by various algorithms for PONvER? LAW 1 distribution
the tree algorithms with individual time-steps did not conserve momentum as well
as the other schemes for POWER? LAW 2. For a 1000 particles, the tree individual
schemes conserved momentum to only 1 part in io4, while the direct individual
scheme conserved momentum to 1 part in io5. In general, it also seems to be true
that the shared time-step schemes conserve momentum better while the individual
time-step schemes conserve energy better. Put this could be an artifact of the
different integration methods used.
8.3.3 Asymptotic Scaling
To test the asymptotic scaling, the temporal multipole algorithm was run on dif-
ferent numbers of uniformly distributed particles, for a fixed physical time. A
distribution with more particles is treated as a higher resolution sampling of a
74
160
POWER LAW 2: TIME TAKEN VIS N
140
1201
Cnz?100l
z
? 801
w
? 601
40
20
o
Yoo			200			300			400			500			600			700
N = NUMBER OF PARTICLES
__ Direct			--H --H Multi
0 shared time-steps
800			900			1000
... Barnes?llut
x individual time-steps
Figure 8.6: Time taken by various algorithms for POWER LAW 2 distribution
homogeneous distribution.
Table 8.6 and Figure 8.7 depict the results. As expected, the log-log plot of
time taken against number of particles is a straight line. The slope of the line is
1.535 in close agreement with the value of 1.5 predicted theoretically. We use only
N above 4000 since the asymptotic regime seems to be reached only beyond this
number of particles.
8.4 Parallel Algorithm: Results
In the parallel case, only the direct algorithm with shared time-steps and the multi-
pole algorithms with shared and individual time-steps were tested for homogeneous
distributions (no load balancing). Most of the powerful parallel machines being
75
TIME TAKEN vis N
E
? 10
w
w
0
4000			N (Iogarithmic)			6000
7000			8000
Figure 8.7: Asymptotic scaling with N of Temporal Multipole algoritlim
built have 2D mesh architectures. Therefore, even though the implementation was
on a hypercube1 only 2D mesh connections were actually used. A hypercube im-
plementation was not done since it would not have given any new insights for as
small a number of processors as 16 or 32.
Since the multipole algorithm requires a square domain whose side is a power
of 2, the number of processors is constrained to be a power of 4. The algorithm
was tested on 1, 4 and 16 processors. The iPSC/860 has a relatively high latency
of 75 microseconds, while one floating point operation takes 0.3 microseconds.
That is, communication is very expensive compared to computation. We break up
the total time taken into into three parts: Tcomp, the time spent on computation;
Twait, the time spent waiting for messages; and ToH the time spent in preparing,
sending and receiving messages1. The communication cost Tcomm is the sum of
Twait and ToH.
1Due to measurement problems, the assignment of costs to Twait and ToH is not very accurate.
ToH also includes some time spent in busy-wait loops.
76
8.4.1 Computation and Communication
Table 8.7 and Figure 8.8 show the computation and communication times for
16 processors for the different algorithms for different N. The direct algorithm is
not shown in the figure since communication time is negligible and computation
time is excessive, The temporal multipole algorithm is more efficient in both
communication and computation than the fast multipole algoritlim. Note also that
the communication time for the temporal algorithm remains almost unchanged
with the number of particles.
100
90
80
70
Co
z 60
z
w
? 50
w
? 40
30
20
10
TIME VIS N ON 16 PROCESSORS
-o
x			>c
"0
1000
__ Total time
0 shared time-steps
2000			3000			4000			5000			6000
N = NUMBER OF PARTICLES
... Communication time
x individual time-steps
7000
Figure 8.8: Time taken by multipole algorithms on 16 processors
Figure 8.9 depicts the percentages of times each algorithm spends in computing
versus communicating. The direct algorithm spends almost all its time comput-
ing while the shared time-step multipole algorithm spends more than half its time
communicating. The temporal multipole algorithm also spends spends a large per-
77
Percentage Times (16 processors)
100%
80%
60%
40%
20%
0%
N=1600			N-3200			N=6400
_			_			=
_ _			_			==
=
--H			ci,			=			ci,
-+			.-+			-+			.--+
0			--H			+			0			--H			-+			--H+			.-+
-			0			=
=			=			0			=			=			0			=			=
E			=			E			=			E			=
E			E			E
--H comm overheads [ll? wait time
computation
Figure 8.9: Tcomp, Twait and ToH for various configurations on 16 processors
centage of its time communicating but this percentage goes down as the number of
particles goes up. Contrast this with the corresponding times taken by 4 processors
as listed in Table 8.8.
8.4.2 Scaling with Number of Processors
Figure 8.10 compares the compute/communicate ratio for 16 processors with that
for 4 processors. The 4-processor version spends more than 90% of its time com-
puting. Part of the reason for this is that during initialization the shared time-step
algorithm is used which is very communication intensive. With 4 processors, all
processors being adjacent to each other, this cost does not show up. This initial
communication cost can be estimated using the communication costs of the shared
time-step multipole algorithm in the table as ? 4.48 for 800 particles. Subtracting
this out, we obtain a much smaller value of 2.09 as the cost of communication for
800 particles on 16 processors. The second reason is that with 4 processors, each
78
Computation vis Communication
100%
90%
80%
70%
60% -
50% -
40% -
30% -
20% -
10%
0%
N-800			N=1600
16
4
1 6			Number of Prncessors			4
? computation =? communication
Figure 8.10: Computation v/s Communication on 4 and 16 processors
processor has at most 3 neighbors while with 16, some have as many as 8 neigh-
bors. Figure 8.11 shows how the time taken scales with the number of processors.
The cost of communication seems to be increasing with the number of processors.
llowever, the local nature of the algorithm guarantees that the communication
cost will become constant when most processors have 8 neighbors and the integra-
tion is performed long enough to amortize initialization costs. Beyond this point,
performance will increase almost linearly (remember the logarithmic factor) with
the number of processors. For the sake of completeness, computation times for 1
processor are given in Table 8.9.
79
2000			3000			4000			5000
N = NUMBER OF PARTICLES
Communication time
N VIS TIME TAKEN
18
1000			6000
__ Total time
11
14
Co 1:
z
z?10
w
uJ
B
2
016
04j			04
0
7000
Figure 8.11: Time taken by temporal multipole algorithm on 1,4 and 16 processors
80
Table 8.3: Time taken (in s) for different algoritlims: RANDOM
Direct			Multi			BarnesAlut
Shared			Individual			Shared			Individual			Shared			Individual
T6			1.29			0.58			1.49			0.48			1.96			0.66
100			EC			4.45e-5			3.55e-6			4.45e-5			3.55e-6			4.45e-5			3.55e-6
Md			1.81e-18			4.10e-18			2.71e-18			1.34e-18			1.1e-8			1.15e-8
T			4.74			2.68			3.68			2.24			8.33			3.78
200			E			1.78e-5			4.00e-6			1.78e-5			3.93e-6			1.78e-5			3.93e-6
M			5.42e-19			3.53e-4			2.53&8			4.00e-3			7.1254e-8			4.00e-3
T			29.01			8.18			14.54			4.69			32.03			9.51
500			E			1.04e-5			8.34e-7			1.04e-5			8.36e-7			1.04e-5			8.34e-7
M			7.58e-19			6.89e-8			5.40e-9			6.35e-8			3.04e-8			6.40e-8
T			140.02			38.04			26.34			8.44			84.21			25.33
1000			E			1.03e-5			8.23e-7			1.03e-5			8.24e-7			1.03e-5			8.23e-7
M			9.21e-19			3.82e-9			3.76e-9			9.91e-9			1.16e-8			1.78e-8
T			-			-			75.74			22.74			-			65.46
2000			E			-			-			2.56e-6			2.05e-7			-			2.05e-7
M			-			-			3.07e-9			6.21e-9			6.38e-9
T			-			-			111.01			35.09			-			194.29
4000			E			-			-			2.54e-6			2.03e-7			-			2.03e-7
M			-			-			1.53e-9			1.39e-9			-			3.83e-9
aNumber of particles
bTime taken (in s)
CEnergy conservation: ABE I
dMomentum conservation: IA Zi mivil
82
Table 8.8: Time taken (in s) for different algorithms on 4 Processors
Number of			Direct Shared			Multi Shared			Multi Individual
Particles Tcomp Twait ToH Tcomp Twait ToH Tcomp Twait ToH
400			16.78			0.33			0.02			5.01			2.94			0.87			1.93			0.50			0.28
800			68.82			0.64			0.02			7.36			2.73			0.87			3.40			0.23			0.25
1600			276.00			1.23			0.02			22.41			1.56			1.25			8.13			0.14			0.30
Table 8.9: Time taken (in s) for different algorithms on 1 Processor
Number of Direct Shared Multi Shared Multi Individual
Particles			Tcom			Tcomp			Tcomp
100			8.04			2.49			1.05
200			33.73			5.65			2.30
400			136.75			17.41			6.23
Appendix A
Force-determined Time-steps
A.1 Pathology with Force-determined
Time-steps
Consider the system of 4 particles shown in figure A.1. The velocity of m is much
larger than that of mi or m2. The Y-component of the forces exerted by mi and
m2 on m cancel each other out. Further, in traveling distance r, the force on rn
changes negligibly since M and I? are both very large. In short, the force on m is
large and the derivative of the force on rn is small. The time-step determined for m
using force derivatives is very large and rn "whizzes" past mi and m2 disregarding
a possible close encounter.
Now, the positions of mi and m2 can be adjusted in such a way that the
Y-component of the force exerted by mi on m does not exactly cancel the force
exerted by m2 on ni. The difference is insignificant at distance r. When m is
much closer, however, the difference in their forces on m becomes appreciable and
can change the trajectory of rn. With force-determined time-steps, this effect will
not be discovered by the algorithm. The position oriented scheme, on the other
hand, will correctly discover this problem. In a normal N-body simulation such
situations will occur with very low probability.
83
84
m2
m a
Y
r
ml
R
Figure A.1: Pathology with Force-determined Time-steps
m ? M and ? ? R
A?row at each pa?tic1e indicates velocity (magnitude and divection).
Bibliography
[Aar63j
[Aar85i
Sverre J. Aarseth. Dynamical evolution of clusters of galaxies, I.
Monthly Notices of the Royal Astronomical Society, 126:223,1963.
Sverre J. Aarseth. Direct methods for N-body simulations. In
Jeremiah U. Brackbill and Bruce I. Cohen, editors, Multiple Time
Scales, pages 377--H418. Orlando: Academic Press, 1985.
[AC73] A. Alimad and L. Cohen. N-body gravitational problem. Journal of
Computational Physics, 12:389--H402,1973.
[App85] A. Appel. An efficient program for many-body simulation. Siam Jour-
nal on Scientific and Statistical Computing, 6(1):85--H103, 1985.
[Bar90] Joshua E. Barnes. A modified tree code - dont laugh - it runs. Journal
of Computational Physics, 87(1):161--H170, 1990.
[BH86] Joshua Barnes and Piet Hut. A hierarchical O(N log N) force calcula-
tion algorithm. Nature, 324:446--H449,1986.
[Edw92] Joseph Edwards. An Elementary ?reatise on the Differential Calculus.
New York: MacMillan and Co., 1892.
[GG9o] L. Greengard and W. D. Gropp. A parallel version of the fast multipole
algorithm. Computers and Mathematics with Applications, 20(7):63--H
71, 1990.
[GH83]
J. Guckenheimer and P. Holmes. Nonlinear Oscillations, Dynamical
Systems, and Bifurcation of Vector Fields. New York: Springer-Verlag,
1983.
[GB87] L. Greengard and V. R?okhlin. A fast algorithm for particle simulations.
Journal of Computational Physics, 73:325--H348,1987.
85
86
[Gre87]
[HE89J
[Her90]
[HJ87]
[HJ90J
[JP89]
Leslie Greengard. The Rapid Evaluation of Potential Fields in Particle
Systems. Ph.D. dissertation, Yale University, 1987.
R. W. Hockney and J. W. Eastwood. Computer Simulation Using
Particles. New York: A. Hilger, 1989.
Lars Hernquist. Vectorization of tree traversals. Journal of Computa-
tional Physics, 87(1): 137--H147,1990.
C. T. Ho and 5. L. Johnsson. On the embedding of arbitrary meshes
in boolean cubes with expansion two dilation two. In International
Conference on Parallel Processing, pages 188--H191,1987.
C. T. Ho and 5. L. Johnsson. Embedding meshes in boolean cubes by
graph decomposition. Journal of Parallel and Distributed Computing,
8:325--H339,1990.
J. G. Jernighan and D. H. Porter. A tree code with logarithmic re-
duction of force terms, hierarchical regularization of all variables and
explicit accuracy controls. Astrophysical Journal Supplement Series,
71(817), 1989.
[Kat88] J. Katzenelson. Computational structure of the N-body problem.
Technical Report Al Memo 1042, Al Lab, MIT, 1988.
[LB91]
J. F. Leathrum, Jr. and J. A. Board, Jr. Parallelization of the fast
multipole algorithm using the b012 transputer network. Transputing
91. Proceedings of the World TYansputer User Croup (WOTUC) Con-
ference, pages 296--H310,1991.
[Lei92J F. Thomson Leighton. Introdution to Parallel Algorithms and Architec-
tures, volume 1, pages 234--H235. California: Morgan Kauffman, 1992.
[MA93] Stephen L. W. McMillan and Sverre J. Aarseth. An O(N log N) inte-
gration scheme for collisional stellar systems. Preprint, 1993.
[Mak90j Junichiro Makino. Vectorization of a treecode. Journal of Computa-
tional Physics, 87(1):148--H160, 1990.
Stephen L. W. McMillan. The vectorization of small-N integrators. In
P. Hut and 5. L. W. McMillan, editors, The Use of Supercomputers in
Stellar Dynamics, page 156. New York: Springer-Verlag, 1986.
[McM86]
87
[MH88j
[MH89]
[OME+92j
[RT93J
[Sal9 1]
[SllT+92]
[5L91]
[Sun92]
Junichiro Makino and Piet Hut. Peftormance analysis of direct N-body
calculations. The Astrophysical Journal Supplement Sernes, 68:833--H
856, 1988.
Junichiro Makino and Piet Hut. Gravitational N-body algorithms - a
comparison between supercomputers and a highly parallel computer.
Computer Physics Repo?s, 9(4):199--H246, 1989.
S.K. Okumura, J. Makino, T. Ebisuzaki, T. Ito, T. Fukushige, D. Sug-
imoto, E. Hashimoto, K. Tomida, and N. Miyakawa. Grape-3: Highly
parallelized special-purpose computer for gravitational many-body sim-
ulations. In V. Milutinovic, B.D. Shriver, J.F. Nunamaker, and R.H.
Sprague, editors, Proceedings of the Twenty-Fifth Hawaii h?ternational
Conference on System Sciences, volume 25, pages 151--H160,1992.
John II. Reff and Stephen R. Tate. The complexity of N-body sim-
ulation. In International Colloquium on Automata, Languages and
Programming, 1993.
John K. Salmon. Parallel Hierarchical N-Body Methods. Ph.D. disser-
tation, California Institute of Technology, 1991.
Jaswinder Pal Singh, Chris Holt, Takashi Totsuka, Anoop Gupta, and
John L. Hennessy. Load balancing and data locality in hierarchical N-
body methods. Technical Report CSL-TR-92-505, Computer Systems
Laboratory, Stanford University, 1992.
K. E. Schmidt and M. A. Lee. Implementing the fast multipole al-
gorithm in 3 dimensions. Journal of Statistical Physics, 63(5-6):1223--H
1235,1991.
Sridhar Sundaram. Periodic updates in processor networks. Technical
Report TR 92-1295, Computer Science Department, Cornell Univer-
sity, 1992.
[v1160j 5. von Hoerner. Z. Astrophys., 50:184--H214,1960.
[Zha87] Feng Zhao. An 0(N) algorfthm for three-dimensional N-body simula-
tions. Masters dissertation, MIT, 1987.
[ZJ9l] Feng Zhao and 5. L. Johnsson. The parallel multipole method on
the connection machine. Siam Journal on Scientific and Statistical
Computing, 12(6):1420--H37, 1991.
