BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR94-1440
ENTRY:: 1994-08-26
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Implementing and Exploiting Static Speculation on Multiple 
        Instruction Issue Processors
AUTHOR:: Moudgill, Mayan 
DATE:: August 1994
PAGES:: 182
ABSTRACT::
Trends in processor architecture and design suggest that static speculation 
will become a candidate for implementation on future high-performance 
processors. In this dissertation, we shall examine issues related to the 
implementation and exploitation of static speculation.  

There are four primary results:

1) Precise Exceptions: Prior work in static speculation has not examined the 
interaction between exception handling and speculative instructions in any 
great detail. We investigate this interaction, exhibiting certain problematic 
subtleties that arise, and show how they can be overcome.

2) Speculative Tagging: Earlier proposals for implementing speculative 
instructions tended to have several drawbacks, including restricted 
applicability. We introduce speculative tagging, a new, more general, 
mechanism for specifying static speculation, and show it is possible to 
optimize exception recovery through this mechanism.

3) Whole-DAG Scheduling: Recently, there has been some work on scheduling 
regions of acyclic code larger than a basic block so as to take advantage of 
static speculation. We describe another such algorithm, known as whole-DAG
scheduling, that contains innovations that make it more flexible, and allow 
it to use better heuristics.

4) Dynamic Speculation: The work on static speculation and exceptions 
suggested an alternative approach to implementing dynamic speculation. This 
approach results in simpler hardware than prior schemes, and is consequently 
cheaper to implement and potentially has a lesser impact on cycle-time.

Additionally, we report results of experimental studies measuring the
effectiveness of whole-DAG scheduling. In it, we show, among other
things, that our scheduling technique can result in near-optimal
schedules, and that the selection heuristics we adopt are superior to
those used in earlier algorithms.
END:: CORNELLCS//TR94-1440
BODY::
Implementing and Exploiting Static Speculation
On Multiple Instruction Issue Processors
Mayan Moudgill
Ph.D Thesis
TR 94-1440
August 1994
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
IMPLEMENTING AND EXPLOITING STATIC SPECULATION
ON MULTIPLE INSTRUCTION ISSUE PROCESSORS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Mayan Moudgill
May 1994
Qc Mayan Moudgill 1994
ALL RIGHTS RESERVED
IMPLEMENTING AND EXPLOI?NG STA?C SPECULA?ON ON MUL?PLE
INSTRUCTION ISSUE PROCESSORS
Mayan Moudgill, Ph.D.
Cornell University 1994
Trends in processor architecture and design suggest that static speculation will become a
?andidate for implementation on future high-performance processors. In this dissertation, we shall
examine issues related to the implementation and exploitation of static speculation. There are four
primary results:
Precise Exceptions Prior work in static speculation has not examined the interaction between
exception handling and speculative instructions in any great detail. We investigate this
interaction, exhibiting ceflain problematic subtleties that arise, and show how they can be
overcome.
Speculative Tagging Earlier proposals for implementing speculative instructions tended to have
several drawbacks, including restricted applicability. We introduce speculative tagging, a
new, more general, mechanism for specifying static speculation, and show it is possible to
optimize exception recovery through this mechanism.
Whole-DAG Scheduling Recently, there has been some work on scheduling regions of acyclic
code larger than a basic block so as to take advantage of static speculation. We describe
another such algorithm, known as whole-DAG scheduling, that contains innovations that
make it more flexible, and allow it to use better heuristics.
Dynamic Speculation The work on static speculation and exceptions suggested an alternative
approach to implementing dynamic speculation. This approach results in simpler hardware
than prior schemes, and is consequently cheaper to implement and potentially has a lesser
impact on cycle-time.
Additionally, we report results of experimental studies measuring the effectiveness of whole-
DAG scheduling. In it, we show, among other things, that our scheduling technique can result
in near-optimal schedules, and that the selection heuristics we adopt are superior to those used in
earlier algorithms.
Biographical Sketch
Mayan Moudgill was born on 1 ith June, 1967 in the city of Bhopal in India After a migratory
childhood, during the course of which he attended 7 schools, he finally graduated in 1984 from
La Martiniere College, Lucknow. He received the Bachelor of Technology in Computer Science
and Engineering in 1988 from the Indian Institute of Technology, Kanpur. He joined the graduate
program of the Department of Computer Science at Cornell University in August, 1988.
ili
This dissertation is dedicated to my parents and my wife, for all the support they have given me
over the years.
iv
Acknowledgements
I wish to thank Keshav Pingali and Stamatis Vassiliadis for the help and guidance they have given
me over the years. It is difficult to quantify how much their advice and support have meant to me,
and how much they have contributed to this dissertation; suffice it to say that, without them, I could
never have achieved what I have.
I would like to thank my thesis committee, i.e. Bard Bloom, Geoff Brown and Keshav Pingali
for their guidance and support.
I have been fortunate to work several summers at IBM. The people I worked with there--HBob
Iannucci, Paul Suhler, Kei Hiraki, Katamudhari Eknadham, Ragunathan Raikumar, Mike Hale, Jim
Chambers, Bill Virun--Hmade my stays there enjoyable and educational. I would especially like to
thank Steven Gregor for teaching me about system architecture. He and Tom Jeremiah evaluated
the design in Chapter 7 and were instrumental in producing a gate-count for it.
I would like to thank my fellow students, especially Micah Beck, Richard Johnson, Wei Li,
Paul Stodghill, Vladimir Kotlyar, Sudeep Gupta, Indu Prakas Kodukola and Richard Huff, for
interesting conversations.
A special thanks to my various house-mates, V. V. Bhaskar, Sandip Bose, Navin Budliiraja,
Suresh Chari, Radha Jagadeesan, Jayesh, and Thshar Deepak Chandra for the many stimulating
discussions that have influenced me and hence this thesis.
Finally, I'd like to thank my wife, Priti Srivastava, for marrying me, for her support, and for
keeping me sane.
v
Table of Contents
2
3
Introduction
1.1 Problems, Previous Approaches, and Our Approach . - -
1.1.1 Mechanisms for Static Speculation
1.1.2 Static Speculation and Exceptions
1.1.3 Static Scheduling for Multiple Instruction Issue
1.1 A Performance of Static Speculation
1.2			Significance and Contributions
1.3			Organization			. . .
Background and Related Work
2.1			Instruction Level Parallelism
2.1.1			Pipelining
2.1.2			Multiple Instruction Issue			. -
2.2			Constraints on ILP. . .
2.2.1			Resource -
2.2.2			Data Dependence
2.2.3			Control Dependence - . . .			. .
2.2A Increasing Exploitable ILP Dynamically
2.3 Scheduling
2.3.1			Basic Block .
2.3.2			Cyclic Code
2.3.3			Acyclic Code
2.4			Speculation .
2A.1			Vocabulary			- . . .
2.4.2			Mechanisms
2.5			Surnmary			. . .			.
Exception Handling and Static Speculation
3.1 Introduction
3.1.1			Precise Exceptions . . -			. . .
3.1.2 Exception Handlers and Precise Exceptions
3.1.3			Abstractions			. -
vi
4
4
5
5
6
7
8
9
9
10
12
12
12
14
15
17
19
19
20
21
22
22
24
29
30
30
30
31
32
33
34
34
34
37
37
39
40
40
41
41
44
44
45
47
47
47
48
49
49
51
53
54
55
56
56
57
57
58
60
61
61
61
64
65
66
69
69
69
70
74
3.1 A Origin vs. Commit: Producing Speculative Code
3.1.5			Assumptions-
Precise			Exceptions & Static Speculation
3.2.1			The Breakdown of Precision. .
3.2.2			Equivalence
3.2.3			Sequentialization . .			. .			. -
3.2.4 Sequentialization Without Shadow Registers
Tnin1?			t;nam
3.2
3.3			,????men?? ecise ?cepUoiis			-
3.3.1			Constraining Speculation
3.3.2			A Proof of Equivalence
3.3.3			Restart. . .
3.3.4			Writing Exception Handlers			for Static Speculation
3.3.5			Discarding Precision. .			. .
Conclusions
3A
4
Speculative Tagging
4.1 Speculation Tags - .
4.1.1 Success and Commits
4.1.2 Failure and Flushing
4.1.3 Register Files -
4.1 A Memory Operations
4.2 Multi-Path Speculation.
4.3 Implementing Fxceptions
4.3.1 Register Pressure
4.3.2 Restart Blocks
4.3.3 Lazy Restart
4.4 Miscellanea -
4.5 Design .
4.5.1 Summary
4.5.2 Cost Comparison
4.6 Conclusions .
5
Scheduling with Speculative Instructions
5.1 Background
5.1.1 Information Representation
5.1.2			List Scheduling
5.1.3			Selection Heuristics			-
5.1 A Global Code Motion
5.1.5 Probabilities in Scheduling
5.2 Whole-DAG Scheduling
5.2.1			Frontiers . . . .			. .
5.2.2 Readiness and Ready Lists.
5.2.3 Miscellanea
vii
5.3
5.4
5.5
5.2A Summary			. -
Selecting Instructions
5.3.1 Computing Late
5.3.2 Selection Heuristics
Comparison with Prior Work.
Summary -
Results
6.1			Preliminaries - .			-
6.1.1			The Pidgin Compiler
6.1.2			Measuring DAGs
6.1.3			Benchmarks
6.2 The Benefits of Speculation -
6.2.1 The Importance of DAGs
6.2.2 Non-numerical Programs
6.2.3 Dynamic Speculation
6.3 The Scheduling Algoritlim
6.3.1			The Variables
6.3.2			The Scenarios
6.3.3			Evaluating Strategies
6.3A			Evaluating Heuristics			- -
6.3.5			Other Concerns
6.4			Tag Usage
6.4.1 The Influence of Strategies
6.4.2			Architecture			-
6.5			Conclusion
Dynamic Speculation
7.1 Prior work . .
7.2 Our approach
7.3 Reclaiming physical registers
7A Branches and speculation
7.5			Precise interrupts.
7.6			Summarizing the design
7.7			Optimization
7.8			Implementation
7.9			Comparison
7.10			Conclusion . .			-
6
7
8
Conclusions and Future Work
8.1 Conclusions and Contributions -
8.2 Future Work
8.2.1 Non-Speculative Architectures
viii
74
76
76
78
80
81
82
82
82
84
85
85
85
86
88
89
89
90
93
96
96
104
104
107
107
109
109
112
114
117
120
122
123
124
127
128
129
129
130
130
8.2.2
8.2.3
8.2A
Register Allocation - - -
May-Dependence Speculation
Dynamic Multi-Path Speculation
A Proof of Speculative Equivalence
Al The Model - -
A.l.1 Atoms
A.l.2 Map
A.1.3 Program
A.1.4 State...
A.1 .5 Sequences
A.1.6 Command
A.1.7 Program Execution
A.2 Similarity
A.3 Compiler Constraints
AA Speculative Equivalence
A.5 Proof - -
B Implementation of Whole-DAG Scheduling
B.1 Preprocessing Functions - - - -
B.1 .1 Splitting Control-Edges
B.1 .2			Isolating DAGs - -			-
B.2			Computing EARLY and			LATh -
B.2.l			Computing Early
B.2.2			Computing LATh			-
B.3			Scheduling DAGs
B.3.l			Determining			Readiness
B.3.2 Recomputing LATh - -
BA Best-3 and Tags
B.4.1 Duplicate Speculation -
BA.2 Live-through - -			-
Bibliography
ix
130
130
130
132
132
132
132
134
134
135
135
139
141
142
143
144
150
150
151
151
151
151
156
157
157
158
159
160
160
163
List of Tables
6.1
6.2
6.3
6A
6.3
6.6
6.7
6.8
6.9
6.10
Input Programs & DAGs
Dynamic Execution Frequency
Overall Improvement
Dynamic Frequency of DAGs in C Code
Scheduling Algorithms -
Result Latencies . -
Actual Performance: Balanced Probabilities
Actual Performance: Profiled Probabilities
Selection Heuristic Performance			- . -
Code Expansion
x
86
87
87
88
90
91
94
94
97
98
List of Figures
1.1
1.2
1.3
lA
1.5
1.6
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
3.1
3.2
3.3
3A
3.6
Original Code Sequence.
Optimal Code Sequence
Sample Loop
Loop Unrolling
Static Speculation
Exceptions
Instruction 1-evel Parallelism
Pipelined Execution
Multiple Instruction Issue
Data Dependencies
Removing False Dependencies
Control Dependencies
Preserving Data Dependencies
Control-Independent Code Motion
Register Renaming
3--H1 ALU - .
Speculation
Software Pipelining
Speculation			. .
Speculation			- .			.
Boosting.
Boosting with "don't cares . . -
Speculation Inexpressible with Boosting
Poison-Bit
Register Manipulation in Boosting
Original Code for Fig. 2.19 -
Speculation with Renaming
Pipelined Bxceptions
Representing Speculation . . .			-
Origin vs. Commit . . -
Speculation Changes Inputs .			-
Prodiicin? a Non-Seouentializable Pro?am
xi
2
2
3
3
5
9
11
13
14
15
15
16
17
17
18
18
20
22
24
25
26
26
27
28
28
29
31
32
34
35
36
3.5
3.7
3.8
3.9
4.1
4.2
4.3
4A
4.5
4.6
4.7
4.8
4.9
4.10
5.1
5.2
5.3
5A
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
5.18
6.1
6.2
6.3
6A
6.5
6.6
6.7
6.8
Problems with Exceptions
Sequentialization			. . -
Restarting
Re-executing Speculative Instructions
Speculative Tagging
Grouping using Speculative Tagging
Speculation with Flushes
Renaming Memory
Tagging with Renaming
Tagging with Copying
Multi-Path Memory Speculation
Multiple Memory Speculation
Illegal Speculation
Restarting after an Exception
Control-Flow Graph.
Def-Use Chains --			. .			. -
Dependence Flow Graph			-
List Scheduling - .			. .
Safe Code Motions
Unsafe Code Motions			. .			-
Predication
Frontier . .
Whole-DAG Scheduling: Top-Level			Structure
Computing Ready
Controlling Code Motion			.			. .
Implementing Code Motion
Eliminating Redundancies			-			. .			. .
Whole-DAG Scheduling: Detailed			Breakdown
Computing late
Selection Heuristic--HCase I.
Selection Heuristic--HCase II
Selection Heuristic--HCase III
The Pidgin Compiler			. .			. -
Instrumentation . .
Scheduling and Simulation . .			. .
Computing Average Execution Time
Distribution of Branch Probabilties .			- .			.
DAG vs. Register Pressure: 4 issue, best 5
DAG vs. Register Pressure: 8 issue, best 8			.
Weighted DAG vs. Register Pressure: 4 issue, best 5
xii
36
38
42
45
48
49
50
50
52
52
53
54
54
55
62
62
64
65
67
68
68
70
71
71
72
75
76
77
78
79
79
80
83
83
84
85
92
98
99
100
6.9
6.10
6.11
6.12
6.13
6.14
7.1
7.2
7.3
7.4
7.5
Weighted DAG vs. Register Pressure: 8 issue, best 8
DAG vs. Increase in Register Pressure: 4 issue, best 5 over none
DAG vs. Increase in Register Pressure: 8 issue, best 8 over none
Tag Live Through Block -			.
Two Tags for a Basic-Block
Tag Usage: best-5, 4-issue, balanced			. .
Mapping.
Reclaiming Registers
Correctly Predicted Branch			.
Mispredicted Branch
Instruction Fetch Stage
8.1 May-Dependence Speculation
A.1 Atoms and Function ?T?pes - . -
A.2 J = Pair(?', I, ? a0,
B.1
B.2
B.3
BA
B.5
B.6
B.7
B.8
B.9
B.lO
B.11
B.12
B.l3
B.14
B.15
Whole-DAG Scheduling and Simulation: Top-Level Structure
Maximal Acyclic SESE
EARLY-.
DFG Bdge Weights for			EARLY			-.			-
EARLY using def-use chains			-
EARLYusingDFG			-			.
EARLY: Data Flow			Equations
LATE: Data Flow Equations			.			- .
Computing READY. - 			-
Recomputing LATE			.			. -			-
Duplicate Speculation: I -			-
Duplicate Speculation: II			-			. -
Live-through: I -
Live-through: II			- .			- .
Live-through: III
xiii
101
102
103
105
105
106
113
116
119
121
125
131
- . . 133
141
150
152
153
154
154
155
156
157
158
159
160
161
161
162
162
Chapter 1
Introduction
Modern microprocessors are increasingly being built with the ability to issue multiple instructions
per cycle. Current generation microprocessors, such as the Motorola MC88l 10 [lii and the IBM
PowerPC [5], can issue 2 to 4 instructions every cycle. In the future, 8 or 16may not be uncommon.
This raises the question: how can this capability be fully utilized?
One way of finding and issuing the necessary number of instructions every cycle is for the
processor to include dynamic scheduliizg hardware. Every cycle, this hardware scans a large
portion of the instruction stream, from which it extracts the issuable instructions. For instance,
given the code fragment shown below in Fig. 1.1, the dynamic scheduler for a 2-issue processor
would examine all 3 instructions. It would recognize that (1) and (3) could be issued in the
current cycle, but that (2) used the result produced by (1) and so could not be issued till (1)
had completed. It would therefore issue (1) and (3) immediately, and then issue (2) in the next
cycle.
(1)			ri			:--H--H r2			+ r3
(2)			rV			:= ri			+ 8
(3)			r5			:--H--H r9			--H r2
Figure 1.1: Original Code Sequence
To keep the complexity within realistic bounds, the dynamic scheduling hardware examines
a "window" of only 16 to 32 instructions every cycle. This size appears to be practical for a
2-issue processor. However, even with this small a window, the dynamic scheduling hardware
required is complex and expensive, and could potentially increase the cycle time. To support
issue rates of 8 to 16, the window will have to be substantially larger. The complexity of the
dynamic scheduling hardware goes up at least as the square of the window-size; thus, the dynamic
scheduling hardware for higher issue rates could prove to be prohibitive to implement, and, even
if possible, will potentially be the botfieneck in reducing cycle times.
An alternative is to use less ambitious, and therefore simpler hardware. For instance, the
processor could be designed to issue instructions in the order they appear in the original program
each cycle, till it has issued as many as it can process in that cycle, or till it encounters an instruction
2
that can not be issued. In this case, if the processor is to be fully utilized, the compiler must order
instructions in the code it produces so that adjacent instructions can be issued simultaneously. This
approach is known as static schedulitig.
As an example, consider the code in Fig. 1.1. Using the in-order issue technique described
above, a 2-issue processor would issue (1) in the first cycle. (2) uses the result of (1), so
it could not be issued until the second cycle. Finally, (3) would be issued in the third cycle.
If, however, the compiler had ordered the instructions as shown in Fig. 1.2 with (1) and (3)
adjacent, the processor would have been able to issue (1) and (3) in the first cycle, followed by
(2) in the second cycle--Ha saving of a cycle.
(1)			ri			r2			+ r3
(3)			r5			r9			--H r2
(2)			r7			ri			+ 8
Figure 1.2: Optimal Code Sequence
This leads to another question: is it possible for the compiler to arrange instructions so that
the processor can find the necessary number of instructions to issue every cycle? This is clearly
dependent on the program; for instance, in the code shown in Fig. 1.2 at most 2 instructions can
be issued in the first cycle, no mafler what the compiler does. Several studies have shown that
the average number of instructions in a program that can be issued simultaneously under ideal
conditions is fairly large, possibly greater than 8 [67,52,8,35]. However, they have also indicated
that when the compiler is restricted to reordering instructions solely within basic blocks, this
number drops to between 2 and 3 [62,17,30]. Clearly, the compiler can exploit the multiple-issue
capability of processors, but to do so it must schedule regions larger than basic blocks.
One obvious extension is loops. A loop body may not contain enough simultaneously-issuable
instructions to use a processor to full capacity. However, it has been observed that instructions from
different iterations of the same loop can usually be issued simultaneously. A simple-minded way
to exploit this observation is to unroll the loop till the new loop body contains enough instructions
to fully utilize the processor [15]. The example shown in Fig. lA illustrates the procedure for a
2-issue processor using the loop shown in Fig. 1.3.
DO I = I, 16
x[i1 = x[i] + 1
ENDDO
Figure 1.3: Sample 1-oop
Clearly, a straightforward compilation of the original loop (Fig. 1.4(a)) will issue only one
instruction per cycle most of the time. In particular, only (3) and (4) can be overlapped. Thus,
it requires 4 cycles to execute one iteration of the loop. However, if the loop was unrolled once,
as shown in Fig. 1.4(b), then instructions from different iterations of the loop could be arranged to
execute simultaneously. In this case, 2 instructions could be issued every cycle most of the time,
resulting in an execution time of 5 cycles for every two iterations.
3
LO:
(1) ri
(2) ri
(3) St
(4) rO =
(5) br<=
= id x[rO?
= ri + 1
x[roi, ri
rO + 1
rO, 16, LO
(a)
Figure 1.4: Loop Unrolling
LO:
(1) ri
(1') r9
(2) r1
(2') r9
(3) st
(3') st
(4) rO
(5) br<=
x[rO]
x [rO+1]
+1
= id
= id
= ri
= r9 +
x[rO], r1
x[rO+1], r9
= rO + 2
rO, 16, LO
(b)
There has been a large amount of research on techniques that overlap instructions from sev-
eral iterations of the same loop, especially on a more sophisticated method known as software
pipelining [31,59,34,2,13,44,68,24]. These techniques can build code sequences that issue as
many instructions per cycle as the processor is capable of issuing. While it may be premature to
say that scheduling loops to fully utilize the capabilities of multiple issue processors is a mature
technology, software pipelining does appear to produce fairly good results.
The remaining problem area is the scheduling ofacyclic codes. In such codes, it is not possible
to build a good schedule by simply "borrowing" instructions from other iterations of a loop.
Instead, it becomes necessary to speculate--Hexecute an instruction before it is known whether the
instruction should have been executed.
Consider the code in Fig. 1.5(a) below. A 2-issue processor would first execute (1) , use its
result to evaluate the branch, and then, if the branch was not taken, execute (2).
(1) ri := r2 + r3
br== r1, 0, LO
(2) r5 := r9 --H r2
(a)
Figure 1.5: Static Speculation
(1) r1 := r2 + r3
(2*) r5 := r9 --H r2
br== r1, 0, LO
Thus, the processor only issues one instruction per cycle, leaving some free capacity. One way
the compiler could utilize this excess capacity is to "speculate" that the branch would not be taken
and schedule (2) from below the branch so it could be issued simultaneously with (1) , thereby
producing the code shown in Fig. 1.5(b)'. This may sometimes be useless; in the case that the
branch was taken, the speculation would have been futile. If, however, the speculation was correct,
and the branch not taken, the execution time of the program would have been decreased.
1In this dissertation, speculated instructions will be indicated by adding a * to the instruction label, or at the end of
the instruction.
4
For a compiler to be able to use speculation in a general fashion requires that the hardware
provide an additional mechanism for static speculation. To understand why, consider a program
execution where (2) in the Fig. 1.5(b) caused an exception such as overflow. Clearly, this exception
can not be reported immediately, since it is not known whether (2) should have been executed at
all. Instead, the hardware must delay reporting the exception till it is known that the instruction
should have been executed, i.e., the branch is not taken. Thus, the static speculation mechanism
must allow the compiler to inform the processor whether an instruction has been speculatively
scheduled, and if so, when its exceptions should be reported.
This dissertation deals with both the mechanisms necessary to implement static speculation,
and the techniques that enable the compiler to take advantage of static speculation when scheduling
acyclic code. It shows that such a static speculation mechanism, in combination with the appropriate
scheduling algorithm, can result in acyclic code being executed on multiple issue architectures in
a near-optimal manner.
1.1 Problems, Previous Approaches, and Our Approach
This dissertation focuses on 4 problem areas in acyclic code scheduling using static speculation: a
mechanism for static speculation, the interaction of static speculation with exception handling and
recovery, static scheduling for multiple-issue processors, and the performance of static scheduling
& speculation. We shall briefly review these areas; detailed discussion shall be deferred to later
chapters.
1.1.1 Mechanisms for Static Speculation
As indicated above, a speculated instruction needs to be treated specially, since it is now being
executed in some program runs where it would not otherwise have been executed. This becomes a
particular problem when the instruction could cause an exception.
The simplest solution, which was adopted in the early static scheduling compilers, is to speculate
only those instructions that are guaranteed not to cause an exception. This category includes
operations such as unsigned addition in C that are guaranteed by the language to never cause
exceptions. It also includes instructions where the compiler can guarantee no exceptions through
analysis.
Of course, this solution considerably restricts opportunities for speculation. However, with
additional hardware, it is possible to overcome these restrictions, and speculate instructions that
can cause exceptions. A common approach is to add special speculative instructions. These
instructions are identical to normal instructions except that exceptions caused by them are not
reported immediately; instead, through the co-operation of the hardware, any such exception is
buffered, and reported only when (and if) it is determined that the speculative instruction should
actually have executed. In Chapter 2, we shall discuss in detail two approaches to implementing
speculative instructions, boosting [36,57] and safe-bit [14,53]. It shall be seen that though these
approaches do allow fairly general speculation, they still impose some restrictions. We shall propose
another approach that is more general than these and is more amenable to use by a compiler.
5
1.1.2 Static Speculation and Exceptions
Speculative instructions eliminate spurious exceptions. However, there are other problems caused
by the interaction of static speculation and exceptions, related to exception handling and recovery.
They shall be discussed at length in Chapter 3. We shall illustrate one such problem using the code
shown below in Fig. 1.6.
(1)			ri			r2			+ r3
br== ri, 0, LO
(2)			r5			r9			--H r2
r5
print
(a)
Figure 1.6: Exceptions
(2*) r5			r9 --H r2
(1) ri := r2 + r3
br== ri, 0, LO
print r5
Assume that the non-speculative instruction (1) caused an exception. At this point, the
program would start executing an exception handler. Assume that this exception handler, as part
of its recovery action, changed the value of r2 to 0. In non-speculative programs, an exception
handler resumes execution after processing the exception by branching to the instruction following
the excepting instruction. Ifthis approach is taken, the value of r5 as printed by the speculative and
non-speculative versions ofthe program will differ. This may be acceptable, or it may be necessary
to come up with a mechanism which will ensure that after recovery the values are consistent. In
either case, the problem must be addressed.
Previous work on static speculation has either ignored exception recovery, or treated it infor-
mally. This has resulted in restrictions being implicitly placed on the kinds of exception recovery
actions allowed, and on the kinds of speculation permissible. We shall treat the interaction of
exceptions and static speculation formally, and, in particular, derive the restrictions that exception
recovery places on static speculation.
1.1.3 Static Scheduling for Multiple Instruction Issue
The earliest schedulers for multiple issue processors were derived from those for pipelined proces-
sors, and, like them, concentrated on scheduling within a basic block. As mentioned above, only
2-3 instructions can be issued simultaneously with this approach. However, this was adequate for
the early multiple issue processors, since they were not capable of issuing more instructions than
this every cycle.
The first scheduling algorithms that processed acyclic regions larger than a basic block con-
sidered sequences of basic blocks. These regions were generalized to trees, and then to arbitrary
acyclic graphs. Bach increase in the region being considered causes a major increase in the com-
plexity of engineering the scheduling algorithm. Unfortunately, it has never been determined
whether this additional complexity results in a concomitant performance boost.
6
A similar issue arises when comparing static scheduling against dynamic scheduling. In static
scheduling there is no reason to restrict speculative movement of instructions to only one side of a
branch; however, because of the cost of fetching instructions past both sides of multiple branches
in hardware, dynamic speculation is tends to speculate past exactly one side of a branch. Thus,
static speculation theoretically has an advantage over dynamic speculation in that it can speculate
past both sides of multiple branches; it is not clear, however, how much of a performance gain this
results in.
Most scheduling algorithms work by repeatedly computing the set of instructions ready for code
motion within the region, selecting some of them to be scheduled based on available resources,
and then performing any necessary code motion on the selected instructions. While selecting
instructions, at times there will be a choice between whether or not to speculate an instruction, or
between speculating one of two instructions. A simple selection heuristic is: when scheduling a
group of simultaneously issuable instructions, first schedule the non-speculative instructions. If
there are more resources available, then schedule speculative instructions, preferring those that are
most likely to be executed. The likelihood of execution is determined by some heuristic, such as
the number of branches the instruction is moved past. The selection heuristic used by previous
algorithms tend to be similar to the one described, or even simpler. However, more complex
heuristic may yield better performance.
We have developed a new scheduling algorithm, whole-DAG schedulii?g, for scheduling general
acyclic regions for execution on multiple-instruction issue processors. Whole-DAG scheduling,
like other algorithms, computes a set of instructions ready for code motion; however, unlike other
algorithms, the set of basic blocks from which the ready instructions are computed is easily con-
trolled. This allows whole-DAG scheduling to control code-motion, take into account restrictions
of the static speculation mechanism, and to simulate the effect of algorithms that schedule less
general regions. Further, whole-DAG scheduling uses selection heuristics more sophisticated than
those used in previous algoritlims.
1.1.4 Performance of Static Speculation
There are two varieties of performance studies: so-called limit studies, and attained performance
studies. Limit studies address the question:
What is the best performance possible (under a certain set of assumptions) for a
program, assuming ideal (possibly unrealizable) hardware and compilation?
Attained performance studies measure the performance that is actually achieved on realistic hard-
ware for code produced by an existing compiler. ?T?rpically, the numbers reported are measured for
a set of representative programs, such as the SPEC benchmarks.
The earliest limit studies [62,l7j ignored the possibilities of speculation and found that, at best,
2-3 instructions could be issued simultaneously, a result duplicated in [30]. However, limit studies
that did allow for speculation [52,67,8,35] found that the performance was much higher, with an
average of around 7 instructions per cycle even under fairly restrictive assumptions.
Attained performance studies measure the performance of code compiled for a particular
multiple-issue processor by running it on the processor (or on a simulation) [29,40,6,25). One
7
way these studies are used is to measure the benefits of a particular architectural feature. Another
use for them, of more interest to us, is used to measure the performance improvements caused by
using a more sophisticated compiler. The performance numbers are reported as a speed-up, i.e., the
run-times of programs compiled using the old compiler versus the run-times of the same programs
compiled using the new compiler.
These studies have one serious drawback, from our perspective--Hthey report performance for
the program as a whole. This makes it impossible to isolate the benefits of specific features, such as
speculation or acyclic region scheduling. Consider the limit studies that report large speedups once
speculation is added. Much of this speedup comes from the fact that speculation makes it possible
to overlap of several iterations of loops. However, this is still possible to do without speculation,
by using software pipelining. So, it is not clear how much additional performance can be achieved
solely through the addition of speculation. A similar criticism can be made of attained speculation
studies that report speedups achieved by a compiler that uses static speculation over a base-line
compiler does not implement software pipelining.
Another problem with attained performance studies is that they are incomparable. The compilers
used in the studies target different architectures, with radically different timings, ruling out any
quantitative comparison. Further, compilers contain many different optimizations and phases,
which interact in some non-obvious ways. Thus, when a performance improvement is reported for
some compiler by the addition of some feature, it is not clear how much improvement would result
by adding the same feature to another compiler, even for the same architecture. The situation is
even worse for studies that compare two radically different compilers, rather than modifications of
the same compiler.
We seek to isolate the performance benefits of static scheduling of acyclic regions, and of
speculation. To achieve this, we perform both limit and attained performance studies that are
restricted to acyclic regions. These studies seek to measure the performance improvement achiev-
able and achieved by adding static speculation. We report our results using a new path-averaged
metric that may make it easier to isolate the benefits of different scheduling algorithms, even when
implemented on different compilers.
1.2 Significance and Contributions
This dissertation shows that it is possible to implement static speculation in a such manner that it
can be used by a compiler to achieve significant performance gains on acyclic code, even while
taking error recovery issues into account. More specifically, the contributions are as follows:
o+ We formalized the treatment of error recovery in code containing speculative instructions,
and derived from it a set of constraints on speculation. (Chapter 3)
We developed a new mechanism for specifying static speculation, one of whose benefits is
that it allows the compiler greater flexibility while scheduling. We have also showed how it
can be used to optimize error recovery. (Chapter 4)
8
We developed a novel acyclic region scheduling methodology, called whole-DAG schedul-
ing, that allows for more control than existing techniques, and uses more sophisticated
heuristics. This algorithm was implemented on an existing FOR?IRAN compiler. (Chapter
5)
We measured both the limits of performance of acyclic regions and the performance attained
by the code produced by our scheduler. These measures are presented using a novel path-
averaged metric. They show that, given sufficient resources, results close to optimal can be
achieved; moreover, multi-path speculation out-performs single-path speculation both in the
rn,
ijinit and in achieved tertbrmance. ??laPter Gi.
We used some ofthe insights that arose from our work on speculative execution and exception
recovery to develop a technique that allows us to implement speculation, exception handling
and register renaming in hardware. This technique is less complex than existing techniques,
with accompanying benefits in hardware costs and cycle-time penalties. (Chapter 7)
1.3 Organization
Chapter 2 of the dissertation reviews the background for this dissertation and reviews prior work
in the area. Chapter 3 describes the implications of exception handling for static speculation. It
formalizes concepts that were previously left to intuition, and shows how how exception handling
can constrain speculation. Chapter 4 describes our mechanism for implementing static speculation.
It includes a detailed comparison with prior proposals. Chapter 5 introduces our scheduling
algorithin. Chapter 6 reports performance numbers for various programs, including both limit and
attamed performance on various target architectures. Chapter 7 describes a dynamic speculation
mechanism developed as spin-off of the work on the static speculation, that can be implemented at
low cost. Finally, Chapter 8 contains our conclusions and suggestions for future work.
Chapter 2
Background and Related Work
This chapter surveys related work in static speculation and scheduling for general acyclic regions.
Section 2.1 describes hardware mechanisms that can simultaneously execute multiple instruction.
Section 2.2 focuses on factors that can inhibit the ability of such hardware to execute several
instruction simultaneously. Section 2.3 discusses the development of compiler techniques that
seek to overcome these inhibiting factors. Section 2A defines the vocabulary of speculation, and
discusses the mechanisms used to implement static speculation. Finally, a summary is provided in
Section 2.5.
2.1 Instruction Level Parallelism
The area ofInstruction Level Parallelism (ILP) studies the set ofhardware and compiler techniques
that improve performance by overlapping the execution of instructions. This area includes multiple
instruction issue hardware, and the compiler techniques required to exploit it. For a detailed survey
of issues and work in ILP, refer to [50]. In this section, we shall focus on the hardware required to
exploit ILP.
Two instructions in a program are said to be parallel if they can be executed in any order
without changing the result of the programI For instance, consider the following code fragment
showninFig.2.llnit,instructions (1) & (2), (1) & (4) and (2) & (3) areparallel,and
their execution could be overlapped.
(1)			ri			r2			+			r3
(2)			r4			r5			*			r6
(3)			r7			ri			+			r8
(4)			r9			r4			--H			rO
Figure 2.1: Instruction Level Parallelism
1Note that instructions do not need to he parallel for their execution to he overlapped; we shall show exarrples of
this later in the chapter.
9
10
2.1.1 Pipelining
Pipelined processors [32] issue only a single instruction per cycle. However, the execution of
an instruction is broken up into several pipeline stages. A pipeline architecture exploits ILP by
having several instructions executing simultaneously, each in a different pipeline stage, as shown
in Fig. 2.2. There are usually at least 4 stages, known as:
Instruction Fetch This stage is responsible for fetching the next instruction(s), usually from an
instruction cache.
Operand Fetch During this stage, the instruction is decoded, and the values to be operated on are
fetched from the register file.
Execute An instruction may go through more than one execute stage. It is in this stage (or stages)
that the result is actually computed. In Fig. 2.1, for example, the multiply takes 3 execute
stages.
Write-Back During this stage, the result produced by the instruction is actually written back to
the register file.
Ifan instruction is waiting for a result produced by another instruction, then it cannot be issued.
Such an instruction stalls the pipeline, halting the fetching of any further instructions till its input
becomes available, at whichpoint it is issued and execution canproceed. As can be seen, instruction
(4) forces the pipeline to stall for a single cycle.
It might appear that the example in Fig. 2.1 is wrong; that an instruction's result should be
written to the register file before it is read by any instruction that needs it. However, a mechanism
called bypassing obviates the need for this, by providing the value produced by the last execute
stage directly to the instruction in the operand fetch stage, if that instruction needs the value.
Two numbers are used to characterize the execution of instructions in a pipeline: the result
latency and the issue latency.
Result Latency Also simply called latency, it is the number of cycles that must elapse before an
instruction that uses the result can be issued. In the example, it is 3 for the multiply and 1
for all the other instructions.
Issue Latency This is the number ofcycles that must elapse before the next operation of a similar
kind can be issued. For instance, in an architecture with a load issue latency of 2, a load
instruction cannot be immediately followed by a load instruction. If there are two adjacent
loads, then the pipeline will stall.
Instructions with multi-cycle result latencies provide an opportunity for pipelined processors
to exploit ILP. If a multi-cycle latency operation is followed immediately by an instruction which
uses its result, then the processor would stall. However, if there were instructions that could be
issued in parallel with the second instruction, the compiler could schedule them to execute before
it. This would allow the processor to do some useful work instead of stalling. The code sequence
11
0
0
?L4 ?
? 0 ? ? I
x
>4
Lr)
>4
0
>4
0
0
>4
a
0
0
+			+
Ln
II Ii
0
0
0
x
a)
a)
12
in Fig. 2.2 illustrates both these situations. The multiply in (2) is a 3-cycle latency operation.
This implies that the instruction that uses its result, (4), cannot be issued for 2 cycles. If the
compiler could find two instructions that could be issued in parallel with (4), it could schedule
them between the two, and thus avoid all stalls. However, it could only find one such instruction,
(3) , and thus only one of the two stalls can be avoided.
2.1.2 Multiple Instruction Issue
Even single-issue processors are usually implemented with several distinct execution pipelines,
e.g., a floating point pipeline and an integer pipeline. Each execute pipeline is capable of starting
off a new execution every cycle. However, a pipelined architecture is capable of only sending one
instruction for execution every cycle. Thus, all but one of the pipelines must idle. Performance
could be increased (at some extra cost) by allowing an operation to be issued to all pipelines every
cycle.
Multiple instruction issue machines, also known as multiple issue machines, issue more than
one instruction per cycle. Fig. 2.3 shows an example of multiple instruction issue. This example
assumes that the processor had a multiplier as well as an adder. In such a case, the multiple issue
architecture would be able to issue instructions (1) and (2) simultaneously, thereby executing
the code fragment in one less cycle.
The issue width of such a processor is the maximum number of instructions that the processor
can issue per cycle. ?pically, a processor with an issue width of N cannot issue any N instructions.
For instance, if a processor can issue an add and a multiply instruction every cycle, it has an issue
width of 2. However, it cannot issue 2 multiplies simultaneously.
2.2 Constraints on ILP
There are several factors that preventtwo instructions from being scheduled in a way thatmaximizes
parallelism. Some of these constrnints on parallelism occur because the processor does not have
enough resources to exploit available parallelism.
Dependencies, on the other hand, force instructions to be ordered, so that one instruction can
not be issued until some prior instruction has been issued or completed. This reduces the amount
of instruction level parallelism available to be exploited. Dependencies, as will be shown in the
following sections, are a function of the program.
In this section we shall review the factors that can inhibit instruction-level parallelism.
2.2.1 Resource
Resource constraints arise when the machine does not have enoughresources to exploitthe available
parallelism. There are several kinds of resources that can bottleneck performance.
Functional Units If there are 3 add operations which could be issued simultaneously, but only 2
integer units, then only two instructions of the three instructions can be issued.
13
LO
0
0			0
+			+
Lfl
II			II			II
H
H
0
u
0
0
0
14
Issue Latency Iftwo instructions attempt to use the same functional unit in successive cycles, but
the first instruction has an issue latency of 2, then the second instruction cannot be issued in
the next cycle.
Issue Width If the processor has an issue width of 2, then the maximum number of instructions
that can be issued in parallel is 2. Even ifthere were 3 instructions that were ready to execute
on 3 different functional units, only 2 of them could be issued.
Registers A program may need to keep 60 values in registers to optimally execute a program.
If the machine has only 32 registers available, the additional values will have to be kept in
memory, and moved back and forth when needed, decreasing performance.
The availability of resources may constrain instructions from being issued in parallel.The
amount of resources available vary from processor to processor. Accordingly, any scheduling
algorithm should take into account the resource availability on the processor for which it is
producing code.
2.2.2 DataDependence
A data dependence arises when two instructions access the same location. This location can be a
memory location or a register. These two accesses may need to be serialized, and thus prevent the
instructions from executing in parallel. There are several kinds of data dependencies, which are
illustrated by the code in Fig. 2A.
(1)			ri			r2			/ r3
(2)			r3			ri			* 5
(3)			ri			:= r?			+ r9
(4)			r4			ri			--H 4
Figure 2.4: Data Dependencies
Instruction (2) gets one of its inputs from r1, which is written to by r1. Clearly, (2) cannot
start execution till (1) has written its results, otherwise it would read the wrong value. This is an
example of aflow dependence.
Instruction (3) cannotwrite itsresults before instruction (1) has written itsresults. Otherwise,
(1) would overwrite the value of (3), and instruction (4) would read the wrong value from
register ri. This is called a output dependence.
Further, instruction (3) cannot write its resultuntil instruction (2) has read the value produced
by instruction (1). Otherwise, (2) would read the result of (3) --Hthe wrong value. This is called
a anti dependence.
15
(1)			ri
(2)			r3
(3)			ri
(4)			r4
(a)			(b)
r2			/ r3			(1)			ri			r2			/ r3
ri			* 5			(2)			r3			ri			* 5
r?			+ r9			(3)			riO			r?			+ r9
ri			--H 4			(4)			r4			riO			--H 4
Figure 2.3: Removing False Dependencies
anti and output dependencies are also known as false depei?encies. This is because if there
were enough registers available, these dependencies could be removed by using different output
registers. Fig. 2.5 shows how, by renaming the output registers, instructions that were anti and
output dependent in Fig. 2.4 could be made independent. The technique of changing the output
registers to remove false dependencies will be revisited later in this chapter.
2.2.3 Control Dependence
Branches create another kind of dependence, called control dependence. Every branch alters
control flow, and thereby makes it impossible to determine which instruction should be issued next.
Consider the code fragrnents shown in Fig. 2.6.
(1)			ri			r2 * r3
(2)			br== ri, 0, Li
(3)			r4 := r5 / r3
(1)			ri			r2 * r3
(3*) r4			r5 / r3
(2) br== ri, 0, Li
(a)
Figure 2.6: Control Dependencies
In Fig. 2.6(a), clearly (3) can be executed in parallel with (1) . However, until it is determined
whether the branch will be taken or not, it is not possible to execute it. There is said to be a control
dependence between the branch and the instructions that follow it--Hthey must be issued after the
branch direction is determined, thereby reducing the amount of parallelism available. A similar
phenomenon occurs when the branch direction is known (as it is in an unconditional branch), but
the address is not.
One might be tempted to ignore control dependencies, and execute instructions from below the
branch before the branch direction is determined, as is shown in Fig. 2.6(b). This can be dangerous.
Consider the situation where r3 has value 0. Executing (3) would result in a divide-by-zero error.
However, this error would not have arisen in the code in Fig. 2.6(a); there, r 1 would have had value
0, the branch would have been taken, and consequently (3) would never have been executed.
Thus, circumventing control dependences requires care. It can be done with the help of special
specilation techniques, which shall be described below.
16
(0) read r4
(1)			br==			r1,			0, Li
(2)			r4			r5			* r3
(3)			r4			r4			--H r2
L1:
(4)
(0)			read			r4
(2)			r4			r5			* r3
(1)			br==			ri,			0, Li
(3)			r4			:=			r4			--H			r2
L1:
(4)
ri			r4 + 1
(a)
(0) read
(2) r9
(1) br--H--H
(3)
L1:
(4)
r4
r5			* r3
ri, 0, L1
r4			r9			- r2
ri			r4 + 1			ri			r4 + 1
(b)			(c)
Figure 2.7: Preserving Data Dependencies
This is not the only problem with moving code past branches. One must take care to preserve
data dependences. Consider the code in example Fig. 2.7(a). Blindly speculating (2) , as shown
in Fig. 2.7(b), will result in the value of r4 read by (0) being overwritten whether the branch is
taken or not. Clearly, this is incorrect in the situation where the branch is taken. This problem can
be easily solved by judicious renaming of registers, as shown in Fig. 2.7(c).
It is important to note that the fact that an instruction occurs after a branch does not necessarily
imply that it is control-dependent on it. An instruction is control dependent on a branch only if
the branch direction can determine whether or not the instruction is to be executed. Consider the
situation shown in Fig. 2.8(a). Here, (3) is below an if-then. Clearly, it will be executed no
matter which branch target is taken. Such an instruction is control-independent of the intervening
branch. Some compilers [6] take advantage of this fact to optimize code by moving instructions
past non-dependent branches, as shown in Fig. 2.8(b). Note that this movement is not speculative.
17
(1)			ri			r2 * r3
(2) br== ri, 0, Li
Li:
(3)			r4			r5 / r3
(a)
(1)			ri			r2			* r3
(3)			r4			r5			/ r3
(2) br== ri, 0, Li
Li
(b)
Figure 2.8: Control-Independent Code Motion
2.2.4 IncreasingExploitableILPDynamically
Several hardware techniques have been developed that increase ILP, letting the hardware issue
more instructions per cycle. We shall briefly consider some of them.
Register Renaming
As mentioned above, using the same register name to hold two distinct values creates anti and
output dependencies that can force otherwise independent instructions to be issued sequentially.
An implementation can provide additional registers, and, in hardware, re?iame one of the two
values; i.e. uses a diiferent register for it, as shown below in Fig. 2.9.
(1)			ri
(2)			. .			ri
(3)			ri
(4)			. .			:= ri
(1)			ri
(2)			. .			ri
(3)			p5
(4)			..			p5
Figure 2.9: Register Renaming
By renammg, the hardware removes the false dependencies, permitting the instructions to be
issued in parallel. Thus, in the example, instead of all 4 instructions having to be issued serially,
now (1) & (3) and (2) & (4) can be issuedinparallel.
3-1ALUs
There is no way to remove flow dependence5. However, it is possible to build hardware that can
execute pairs of instructions with a flow dependence between them in the same time as it would
take to do one instruction. The most common of these is a floating point multiply-add [43], which
can be used to combine floating-point multiply and add instructions into a single instruction as
shown below, in Fig. 2.10.
18
(1)			fi			f3 *
(2)			f8			fi + f2
(a)
Figure 2.10: 3-1 ALU
(1)			fi			f3 * f? + f2
(b)
There are many other combinations that can be implemented, including almost all pairs of
simple (i.e. not including shift, multiply or divide) integer instructions [66,65]. These 3-to-i
ALUs effectively remove a flow dependence, by allowing two 2-1 instructions serialized by a flow
dependence to execute simultaneously.
Dynamic Speculation
Speculation is used to execute instructions before it is known whether they should be executed--H
i.e., before an intervening branch has been evaluated. Speculation is usually used in conjunction
with some static or dynamic mechanism for predicting branch targets [37]. The issue hardware will
fetch and issue instructions from the predicted target. Ifthe prediction turns out to be incorrect, the
result is discarded. Consider the example in Fig. 2.11.
(1)			r3
(2)			ri
(3) br==1 ri, 0, LO
(4)			r3
(a)
Figure 2.11: Speculation
(1) r3
(2) ri
(4*) r3
(3) br== ri, 0, LO
(4) is along the predicted (not-taken) path, and is executed before it is known whether or
not the branch will be taken. If it turns out that the branch is taken, the result will be discarded;
in particular, the value in r3 will be restored to the value produced by (1). Otherwise, if the
prediction was successful and the branch was not taken, the value of r3 would be that produced
by (4). Thus, dynamic speculation removes the control dependence that would otherwise have
forced (4) to be executed after the branch.
Out-of-Order Execution
The techniques outlined above would provide only a limited amount of additional performance if
the processor was constrained to issue instructions in the order that they appear in the instruction
stream. Consider the code obtained through register renaming in Fig. 2.9. If the processor was
restricted to in-orderissue, it would issue (1) , then have to wait till it completed before issuing
any other instructions. An out-of-order issue implementation, would however examine and issue
19
from several instructions every cycle. In particular, such a processor could notice that (3) was
ready to issue, and execute it in parallel with (1)
Prior Work
Out-of-order issue was implemented on the CDC 6600 [60]. The earliest implementation of
ILP-inaeasing hardware dates back to the 1960s. Register-renaming combined with out-of-order
execution was implemented in the IBM 360/91, using the Tomasulo algorithm [63]. A survey of
the early work in this area can be found in [31].
Some work has concentrated on implementing exactly one of these techniques (such as [12,26,
58] on out-of-order issue or [20] on register renaming); however, most recentwork integrates several
of these features. The more ambitious of these proposals choose to integrate register renammg,
out-of-order issue and dynamic speculation [39,49]. [29] explores the costs and tradeoffs of
implementing and integrating these proposals.
3-1 ALUs are a more recent innovation. ?T?rpical1y, the only time they are used is when the
compiler produces special 3-1 instructions. In this form, they can be combinedwith any ofthe other
ILP-increasing proposals. There exists at least one proposal [66,42] that attempts to dynamically
detect when adjacent pairs of 2-1 instructions can be combined to use a 3-1 ALU. However, we
are not aware of any proposal that integrates such a combining mechanism with any other dynamic
ILP increasing mechanism.
2.3 Scheduling
2.3.1 Basic Block
The earliest work in scheduling to exploit ILP was for pipelined processors. These processors need
to exploit only minimal amounts of instruction-level parallelism for full utilization, since they only
need to cover the delay of instructions with multiple-cycle result latencies. The amount of ILP
required is less than 2 for most processors [67]. Since a basic-block typically has an ILP of 2-3,
this provided little incentive for scheduling regions larger than a basic block.
Early theoretical work includes proofs that various flavors of the optimal scheduling for
pipelined processors are NP-complete [10]. A summary of such work can be found in [36].
There are several heuristic scheduling techniques that tend to work fairly well in practice [21,18,
3]. Later work in the area focuses on the interaction with other phases of compilation, especially
register allocation [19,7]. Much of this work is surveyed in [33].
The advent of multiple-issue processors that exploit more ILP than available in a basic block
motivated research into scheduling regions bigger than basic blocks. This research is discussed
next.
20
Original Loop
doi= 1,n
ri :=ldx[i]
r3 ri + r2
r4 r3 + 1
2.3.2 Cyclic Code
Filled Gap
ri :=ldx[i]
ril :=ldx[i+1]
r3 ri + r2
r4 r3 + 1
Figure 2.12: Software Pipelining
Final Loop
ri :=ldx[i]
doi= 1,n
ril :=ldx[i+1]
r3			ri + r2
r4 := r3 + 1
ri :=r11
Cyclic code should be scheduled differently from acyclic code, since it has an additional degree of
freedom. In an acyclic code fragment, the only source for parallelism is that between instructions
in the body of the code. In cyclic code, on the other hand, one can also overlap instructions from
different iterations of the loop.
Software pipelining [59,34,2,13,51] attempts to find a sequence of instructions, which may
include instructions belonging to different iterations, that completely utilizes available resources.
It may not be possible to schedule the original loop body so as to use all available resources. These
"gaps" are filled by instructions from some succeeding iteration. This, in turn, may produce gaps.
These gaps are filled in turn. Finally, a repeating pattern is obtained. This pattern becomes the new
loop body. Fig. 2.12 illustrates this process.
It is interesting to note that, for Do/for loops, it is not necessary to use speculation to
obtain a schedule that uses all available resources efficiently, even in those cases where the loop
body contains branches, There exist techniques, such as hierarchical scheduling [34] or reverse
ifrconversion [68J, that can yield fairly dense schedules when such branches are present.
Speculation does pay off when attempting to software pipeline while loops [61J. In a
DO/ for loop, the number of iterations that will be executed can be computed before the loop body
is entered. This makes it safe to produce code that overlaps execution of several iterations--Hthe
software pipelined loop is executed only as long as it is safe to execute all instructions in the body,
then an epilogue is executed, which finishes executing the remaining instructions from the last few
iterations. In a while loop, there is usually no way of determining a priori how many times the
loop will execute. Therefore, any gaps between the beginiling of the loop and the determination of
whether the next iteration will be executed cannot be safely filled with instructions from the next
iteration. However, ifstatic speculation is available, these gaps can be filled, by issuing instructions
speculatively from the next iteration speculatively.
2.3.3 Acyclic Code
21
Acyclic code scheduling techniques can be distinguished on the basis of how big a region of the
control flow graph (CFG) they pick to schedule, and on the kinds of code motions they allow.
Trace scheduhng [15] considers paths in the CFG. These paths, known as traces, are selected
using information about branch probabilities. The scheduler selects a trace and schedules it. It then
repeats the process till the entire CFG is scheduled. During the scheduling process, instructions
may be moved past forks in the control flow graph both ways, and up through merges. During this
process additional bookkeeping code needs to be inserted to preserve semantic correctness. Trace
scheduling results in fairly good schedules for numerical code. However, it can lead to excessive
code expansion. Also, it is inherently restricted to considering a single path at a time.
Superbiock scheduling [25] is based on trace scheduling. Mter trace selection, all joins on a
trace are removed by tail duplicaflon, i.e., by copying all code below the join. This simplifles the
bookkeeping associated with moving an instruction up past a join.
SRDAG scheduling [38] extended trace scheduling to regions that form trees of basic blocks.
Results reported in that paper suggested that it could perform much better than trace scheduling;
however, it has apparently never been implemented in a compiler2.
Work is underway on an extension to trace scheduling, known as trace schedullng-2 [16]. This
work will attempt to schedule general acyclic regions.
Hyperbiock scheduling [41] extends superblock scheduling to deal with multiple paths. Initially,
the CFG is converted into a basic block by if-converting the instructions that are conditionally
executed. During if-conversion, the conditional instructions are associated with a predicate or
guard that is true only if the instruction should be executed. The resulting basic block is scheduled
normally. Then, reverse if-conversion is applied: basically, the guards are converted into if
statements.
Percolation scheduling [2] provides mechanisms for scheduling instructions on all paths, rather
than just one path. However, the policy used to apply this mechanism is not so clear. One particular
strategy, greedy scheduling, can extract massive parallelism, but is very complex, and results in
uncontrolled production of speculative operations.
Instructions can be safely moved from a basic block to a basic block with equivalent control
dependence, without generating any copies. For instance, an instruction can be moved from below
an if-then-else to the basic block above it, as long as the data dependences permit it. Both the
boosting optimizer [57] and global scheduhng [6] use this code motion optimization.
Selective scheduling [45,44] is a general multiple-path method. This scheduling method is not
restricted to acyclic code; it can be applied to schedule cyclic code, producing a software pipelined
schedule. It, like global scheduling and the boosting optimizer, was designed to implement
speculation. We shall discuss these techniques at greater length in Chapter 6.
2The original paper dealt with microcode compaction
22
(1)
(2)
L1:
(3)
(4)
r4			r2			* r7
br== r4, 0, LO
ri			r2			/ r3
r9			r5			+ 1
(a)
2.4 Speculation
2.4.1 Vocabulary
Figure 2.13: Speculation
(4*)			r9			r5			+ 1
(1)			r4			r2			*
(2)			br==			r4,			0, LO
Li:
(3)			ri			r2			/ r3
An instruction is executed speculatively ifit is issued and may complete before it is known whether
control flow will actually reach the instruction. For instance, consider the code shown in Fig. 2.13.
(4) is speculated, since it is executed, and should complete, before it is known whether the branch
(2) is not taken.
Note that speculation has been used to denote other, less aggressive, behaviors (usually in
the context of hardware speculation). In "fetch" speculation, instructions are fetched from the
predicted path, but not issued. In "execute" speculation, the predicted instructions are allowed to
execute, but their results are not written back till the branch is executed.
The speculation can either be dynamic, i.e., performedby the hardware, or static, i.e., performed
by the compiler. In the case of static speculation, the architecture must provide instructions that
are the speculative equivalents of normal instructions. The compiler will produce a program in
which an instruction that should be speculated has been converted to its speculative equivalent and
moved (by the compiler) past one or more branches.
A speculated instruction must be treated specially in several ways.
Any exceptions caused by a speculated instruction must not be reported until it is known that
the instruction should have been executed. Only after that must the exception be reported.
For instance, if (4 *) in the Fig. 2.13(b) overflowed, the overflow must not be reported if
the branch is taken.
Register updates can be dealtwith in two ways: the outputregister ofa speculative instruction
such as r9 caneither be updated immediately, or later, after it is known whether the instruction
should have been executed. In the second case, speculative instruction register updates must
be handled specially.
There must be some way of indicating to the hardware where the speculative instruction
originally came from, and thus at which point its exceptions should be reported.
Before proceeding, let us define some terms and phrases that shall be used in this dissertation
to discuss speculation. Some of these definitions have two flavors. First, there is a definition
exclusively in terms of a speculative program. Then, there is an intuitive description which
23
assumes that the speculative program was derived from a non-speculative program, and appeals to
the original program and the transformation. It is important to recognize that there is no necessity
for there to be an original non-speculative program.
Origin The origin of a speculative instruction is the position of its equivalent instruction in the
original non-speculative program. The origin is defined only in those cases where the
speculative program was derived from a non-speculative program.
Commit Point The commit point of a speculative instruction is the position in the speculative
program where any interrupt caused by the speculative instruction is reported. The commit
and origin points can be different. Further, it is possible for an instruction to have multiple
commit points. An instruction is said to have been committed when control flow reaches its
commit point, and all actions associated with the commit point have been accomplished.
Speculated Past A speculative instruction is said to be speculated past a branch if there is a
path containing the branch from the speculative instruction to some commit point for that
instruction. In the context ofa transformation from a non-speculative program, an instruction
is said to be speculated past a branch if the instruction used to be "below" a branch (in the
control flow sense), and was moved "above" it. Thus, (4) was speculated past branch (2).
Fail A speculative instruction is said to fail if control flow reaches a point from which it is
impossible to reach the commit point withoutfirst executing the speculative instruction again.
This happens when control flow reaches a branch which the instruction was speculated past,
and then branches in the "wrong" direction. The instruction is said to have failed at that
branch. In the example, (4) would fail at (2) `if the branch was taken.
Succeed An speculated instruction is said to succeed if control flow reaches its commit point. An
instruction is said to succeed at a brnnch if it does not fail, i.e., control flow at a branch
goes in a direction from which it is still possible to reach the instructions commit point. For
instance, (4) succeeds at branch (2) if the branch is executed and not taken.
Resolve A speculative instruction is resolved when the instruction either succeeds or fails.
Outstanding A speculative instruction is said to be outstanding at a point if it has been executed
but not resolved at that point.
The control flow graph in Fig. 2.14 illustrates the movement ofnon-speculative I from its origin
point (which is also its commit point) above branches BI and B2, converting it into speculative
instruction I *. In the process, the instruction has been speculated past branches B 1 & B2. Clearly,
the speculation will fail at B1 ifcontrol flows to the true side ofthe branch, similarly at B2. I * will
succeed only if control flows to the false side of both branches, thereby reaching its commit point.
It will be resolved if, after executing I *, either control reaches the commit point, or ifcontrol flows
to the true side ofeither B1 or B2. The instruction will be outstanding as long as control is at some
intermediate point.
24
1*
`4
Figure 2.14: Speculation
2.4.2 Mechanisms
We ? brieflv dr?--Hihed dvnamic sneculation. and ?rooosals for im?lementing it in Section 2.2.4.
?ve
A more detailed study of such mechanisms can be found in Chapter 7. In this section, we shall
introduce the two main proposals for implementing static speculation. These are boosting [55,57,
56] and safe-bit [14,53]. These schemes differ, as will be seen below, both in the way the commit
points of the speculative instructions are determined, and in the way register updates are handled.
Boosting
Boosting defines the commit point for a speculated instruction by annotating the speculated in-
struction with the path to the commit point. The path to the commit point is described by listing
the number of conditional branches on the path to the commit point, and whether the path is along
the takeninot-taken side of the branch.
Consider the example in Fig. 2.15. The speculated instruction, (1*) has been moved past the
two branches, first past the left (L) side of a branch, and then past the right (R) side of a second
branch. The path from the speculated instruction to the actual instruction is simply the reverse
of this path. Thus, as shown, the additional information added to the speculated instruction in
boosting will be that the path to the commit point is LR, indicating that the instruction will succeed
if the first branch encountered goes to the left, and the second branch encountered goes to the right.
A modification made to boosting was the addition of the ability to use X or "don't care". This is
used when an instruction has been speculated past both sides of a branch, as is shown in Fig. 2.16.
25
(1*) r5.LR:=r8/r2
MM?R
(l)r5 :=r8/r2
Figure 2.15: Boosting
A point worth noting is that not all speculations can be described by boosting, even after this
addition. One such speculation is shown in Fig. 2.17.
An instruction's commit point is necessarily immediately after the last branch in its path. Thus,
any exception caused by the instruction will be initially reported at the first branch it was speculated
past. Notice also that several instructions can be simultaneously committed at branch. This allows
the commit actions to be performed en bloc, which may be more efficient than committing each
instruction separately.
Safe-Bit
In the safe-bit approach, every register has an added safe-bit. If a speculative instruction causes
an interrupt, its output register is u,zsafe, i.e. the safe-bit for its output register is set, and some
interrupt information is stored. Should some speculative instruction have a safe value as an input,
then it, too, is made unsafe. If a non-speculative instruction reads an unsafe value, the interrupt
is reported. Thus, committing a speculative instruction requires an explicit "commit" instruction.
This is any non-speculative instruction which reads the output register for the speculative register.
The safe-bit approach defines the commit point for a speculated instruction using another
instruction, which reads the output register of the speculated instruction. Consider the example in
Fig 2.18. The instruction, (1) , has been speculated to (1*). (1 *) will be committed at the first
read of its result register, r5, by a non-speculative instruction, in this case (2). This assumes, of
course, that there are no intervening writes to r5.
Notice that the safe-bit approach allows very fine-grain control of commit point placement.
This comes at the cost of having to add an extra instruction that reads the output register of the
speculative instruction at the point where we wish to commit it3.
3ofcourse, sometimes there will already he an instruction which reads the output register; in that case it comes for
26
LJ?
M
(1)r5:=r8/r2 - -
Figure 2.16: Boosting with "don't cares"
`I
Figure 2.17: Speculation Inexpressible with Boosting
27
(1*)r5=r81r2
L			-,			R
(1)r5=r81r2
Figure 2.18: Poison-Bit
The use of output registers to link speculative instructions with their commit points means that
a different mechanism has to be used for instructions that have no output register, such as stores.
(53] has proposed such a speculation.
In safe-bits, a speculation fails when the output register ofa speculative instruction is written to
by a non-speculative instruction without having been read. At this point, the safe-bits are cleared.
of course, there is still the problem of detecting failure of instructions with no output register.
Register Access
Another way in which boosting and the poison-bit scheme differ is in their treatment of registers.
In the poison-bit scheme, there is exactly one register file, and all register names refer to entries
in this file. Thus, a register update is visible to all instructions immediately. In boosting, register
updates made by speculative instructions are hidden; in particular, a non-speculative instruction
may not read a value written by a speculative instruction till that instruction commits.
To implement this hiding, boosting uses multiple register files. One of these is the "regular"
register file, which is manipulated by non-speculative instructions in the usual fashion. The others
are known as shadow register files. They are used to buffer the results of speculative instructions.
Each shadow register file is associated with a certain commit point. All instructions are associated
with some file, the non-speculative instructions with the regular file, and the speculative instructions
with the shadow file that has the same commit point as the instruction.
The rules for accessing and manipulating registers are somewhat complicated, and will be
illustrated using Fig. 2.19. The files are arranged in a hierarchy based on the distance to the commit
free. Actually, we can usually choose the commit point ofa speculative instruction to he a point at which there already
exists a read of the output register.
28
Regular			Shadow
Registers Registers
17			17
24			24
18			18
24			24
18			18
24			20
18			18
1			20
(1) ro ri + 1
(3*) r1			ro + 2
(2)r1 :=rO>8
(4*).,			.. r1
Figure 2.19: Register Manipulation in Boosting
point, with the regular file at the top of the hierarchy. In the example, there are only two files, and
the shadow file is necessarily below the regular file. Every instruction reads from its associated
register file. Thus, (3 *) from Fig. 2.19 reads from the shadow file. It writes to its associated
register file. `Thus (3 *) modifies r 1 in the shadow file. It will also modify that register in
all register files lower in the hierarchy, unless the register has previously been written to by some
instruction associated with some file lower in the hierarchy. `Thus, (1) modifies r 0 in both register
files, while (2) only modifies r 1 in the regular file, since r 1 in the shadow file has already been
writtentoby (3*),
If a speculation succeeds, its shadow file is committed by copying its values into the regular
file. After committing, or after the speculation fails, the shadow register file can be reused. This
requires reselling it, which is done by copying values from the lowest active register file.
The original program that was transformed into the speculative version used in Fig. 2.19 might
have been the one shown below in Fig. 2.20.
(1)
(2)
rO			r1 + 1
r1 := rO > 8
branch
r1			rO + 2
r1
(3)
(4)
Figure 2.20: Original Code for Fig. 2.19
It is not possible to blindly perform the same speculation using poison-bits, as is obvious from
29
the code in Fig. 2.21(a). In this case, instruction (4 *) incorrectly reads the value written to r1 by
(2) , instead of that written by (3*). The fix, shown in Fig. 2.21(b), is to use register renaming.
Here, the compiler simply chooses different registers (in this case r4) to connect (3*) and (4 *)
Notice that any other instruction that read the value written to r 1 by (3 *) must now instead read
from r4.
(1) rO
(3*) ri
(2) ri
(4*)
ri +			1			(1)			rO			ri			+ 1
rO +			2			(3*)			r4			rO			+ 2
rO >			8			(2)			r1			:= rO			> 8
ri			.			(4*)			. .			. .			r4
branch			branch
(a)
Figure 2.21: Speculation with Renaming
In general, a scheme that uses M register files containing N registers as proposed in boosting4
makes less efficient use of registers than a scheme that uses a single register file containing
M * N registers. A compiler can always use renaming in the single file scheme to implement any
speculation possible using the multiple file scheme. However, it has more flexibility in how it uses
registers. For instance, it could choose to use all the registers for non-speculative operations, and
thereby avoid spilling registers to memory. Actually, as shall be seen in the next chapter, it may
be necessary to use a multiple file scheme if it is necessary to support a certain strict model of
exception recovery.
2.5 Summary
This chapter introduced the concept ofinstruction-level parallelism, showing how it can be exploited
in hardware. We also showed how ILP can be inhibited, and techniques for increasing it, both
dynamically in hardware, and statically, via the compiler. We described efforts that have been made
in the past to generate good schedules for processors that can exploit ILP. Finally, we defined the
vocabulary we shall use in this dissertation while discussing static speculation. We also described
two schemes for static speculation.
4It must be noted that the register structure is somewhat onhogonal to the commit point specification mechanism;
for instance it would be possible to combine the annotations of boosting with a single register file.
Chapter 3
Exception Handling and Static Speculation
In this chapter, we discuss the interaction of exceptions and speculative code. In Section 3.1, we
introduce concepts used in the rest of the chapter, including an abstract mechanism for specifying
speculation. In Section 3.2, we describe the issues that arise when writing exception handlers for
programs that contain speculative exceptions. We extend the precise exception model so that it
simplifies the task of writing exception handlers for such programs. In Section 3.3, we describe
the details of implementing this extended model. This includes constraints that the speculative
program must satisfy, and certain other requirements imposed by exception handling requirements.
Finally, a summary is given in Section 3A.
3.1 Introduction
3.1.1 Precise Exceptions
Normally, a processor fetches and executes instructions from some body of code, under the control
of the program counter for that program. Occasionally, however, this regular execution sequence
is interripted by an exception, and control is transferred to a piece of code known as the exception
handler, whose purpose is to process the exception. The exception handler takes action appropriate
to the exception, and then, possibly, resumes normal execution of the program.
As an example, consider a low-cost implementation of an architecture that includes an integer
divide instruction. The implementation is designed with no integer divide hardware. Instead,
it emulates the divide instruction in software, by interrupting normal execution whenever the
processor encounters a divide instruction (via an exception such as "unimplemented instruction")
and transferring control to an exception handler. The handler contains code which performs the
divide using implemented instructions such as shifts and subtracts.
Handling any exception after it has been reported requires that the hardware satisfy certain
constraints. ?I?rpically, these are satisfied on most non-speculative architectures by implementing
precise exceptions [54]. An exception is said to be precise if, at the time the exception is reported,
the machine state (i.e., registers, memory, etc.) is that which would be obtained if the following
conditions were satisfied:
30
31
o+ All instructions occurring before the excepting instruction in the program order have corn-
pleted.
o+ No instruction occurring after the excepting instruction has been issued.
o+ The program counter points to the excepting instruction.
It is possible to add dynamic speculation to an implementation, and still satisfy these conditions.
Unfortunately, as will be shown, this definition of precise exceptions breaks down when applied to
architectures with speculative instructions. Further, no adequate extension or replacement has been
proposed in the literature; the interaction between exceptions and static speculation has usually
been treated informally.
3.1.2 Exception Handlers and Precise Exceptions
In this section, we shall discuss the implementation ofexception handlers on a pipelined processor,
since the problems encountered in implementing exceptions on a pipelined processor are the similar
to those encountered in implementing exceptions on an architecture with static speculation.
On a pipelined processor, by the time the hardware recognizes that an instruction will cause
an exception, it is possible for several instructions issued after the excepting instruction to have
completed and updated the machine state. Further, there may be instructions from before the
excepting instruction that are still in execution. Such a situation can arise in the code fragrnent
shown in Fig. 3.1. Assume that the latencies of the divide, multiply, and add are 11, 6, and 4
respectively. Further, assume that the multiply, (2) , detects an exception, such as overflow, at the
last stage of execution. At this point, instruction (3) will have completed and modified r5, while
(1) willstillbeexecuting.
(1)			ri			r2			/ r3
(2)			r4			r5			* r6
(3)			r5			r8			+ r9
Figure 3.1: Pipelined Exceptions
The exception cannot be reported immediately; in particular, the processor must wait for all
instructions prior to the excepting instruction to complete, in case they also cause an exception.
If one of them does except, then that is the exception that must be reported. Thus, in Fig. 3.1,
reporting the exception caused by (2) must be delayed till (1) completes. If it happens to cause
an exception, then that is the exception that must be reported.
Generally, the exception handler must be able to read the inputs to the excepting instruction to
determine the appropriate corrective action. This requires that the effects of any instruction issued
after the exception be undone before executing the exception handler. For instance, in the example,
(3) overwrites the input to (2) stored in r5. Without the ability to restore the original value of
r5, it may be impossible for the exception handler to determine the appropriate action.
Thus, in general, an exception handler requires that, at the point an exception is reported,
the machine state should reflect the execution of all preceding instructions and no succeeding
32
(1)			r4			r2 *
(2) breq r4, 0, LO
Li:
(3)			r9			r5 --H 1 *
(a)
(3*)			r9.F			r5			--H 1
(1)			r4			r2			*
(2)			breq			r4,			0,			LO
L1:
(c)
Figure 3.2: Representing Speculation
(3*)			r9			:= r5			--H 1
(1)			r4			:= r2			*
(2)			breq r4,			0, LO
Li
commit (3)
(3*)			r9			:= r5			--H 1
(1)			r4			r2			*
(2)			breq r4,			0 LO
L1:
r9 :--H--H r9
instructions--Hrequirements that are satisfied if precise exceptions are implemented.
Notice that the complicating factor in the handling of exceptions is out-of-order completion,
i.e., instructions finishing execution while instructions occurring prior to them in the program are
still executing. `Thus, (2) had to wait for (1) to complete before reporting its exception, since
it finished before (1). Similarly, the effects of (3) had to be undone because it finished, and
updated the machine state, before (2) completed. Out-of-order completion, as will be seen, occurs
naturally in architectures with static speculation.
After processing the exception, the exception handler typically resumes normal program ex-
ecution by branching to the excepting instruction (or the instruction after it). Note that this may
cause instructions to be re-executed in a modified state; for instance, if the handler for the overflow
of (2) modified r5, (3) would produce a result different from that produced the first time it was
executed. This has repercussions when dealing with restart in speculative architectures, as snaIl be
seen later.
3.1.3 Abstractions
Instead of choosing to discuss the interaction between exception handling and speculation using
either boosting or safe-bits, we shall use an abstract approach: each speculative instruction's
commit point will be explicitly represented. This allows us to ignore the mechanism used to link
a speculative instruction and its commit point. Thus, the speculation of (3) in Fig. 3.2(a) will
be represented abstractly, as shown in Fig. 3.2(b), instead of using either boosting (Fig. 3.2(c)) or
safe-bit (Fig. 3.2(d)).
Speculation performed using either boosting or safe-bits can be represented using this abstract
approach. Clearly, the only concern is finding the commit point for each speculative instruction.
In boosting, each speculative instruction is commifled immediately after the last branch in its
path; in the abstract representation, we add a commit for that instruction after the branch. In the
33
safe-bit approach, a speculative instruction is committed when its output register is read by some
non-speculative instruction. This is represented in the abstract approach by adding a commit for
the instruction before its first use by a non-speculative instruction.
Another distinction between the two speculation mechanisms is in the way they handle the
output of speculative instructions; in boosting the output values are buffered till commit, while in
safe-bits the output registers are committed immediately. However, we can abstract this difference
in behavior to some degree. Let us define the following terms:
Speculative State: The values produced by uncommitted speculative instructions, and the lo-
cations in which they are stored. In boosting, this would include the shadow registers; in
safe-bits, this would include all registers written to by "uncommitted" speculative registers.
Non-Speculative State: The values produced by non-speculative instructions and by committed
speculative instructions, and the location in which they are stored. In safe-bits, a regis-
ter would be logically moved from the speculative state to the non-speculative state by
committing it, i.e., by testing its safe-bit.
Non-Speculative Location: For a value in the speculative state, the location to which the value
is "copied" when the instruction that produced it is committed. For a value in the non-
speculative state, the location in which it is stored. In boosting, this is the regular register
with the register number specified by the output register for the instruction.
Thus, in both schemes, a speculative instruction first writes its output to the speculative state.
Mter the instruction is committed, the non-speculative location is updated with this buffered output
value. In boosting, this update copies the value from the speculative state to the non-speculative
state. In safe-bits, no value is physically copied; instead, the register in which the value resides is
now considered part of the non-speculative state.
3.1.4 Origin vs. Commit: Producing Speculative Code
In the examples discussed above, we have speculated code so that the commitpoint ofan instruction
is the same as its origin point. Fig. 3.3 illustrates, using boosting, a case where it is not. In the
example, (2) is moved from below (1) to a point above the branch. However, boosting always
commits instructions immediately after the branch. Thus, as indicated in Fig. 3.3(c), the instruction
is committed before (1) ,rather than after it, i.e., the commit point of (2 *) is different from its
origin point.
In practice, however, this distinction is vacuous. ?I??pically, the code for a speculative architec-
ture will be produced by a compiler from some high-level language. The compiler will produce
only the speculative code, as in Fig. 3.3(b). Since there is no non-speculative code to compare
against, there is no way of defining the origin point. Further, if there was a mechanism to force
the compiler to produce both a speculative and non-speculative version of the same program, it
is probable that the non-speculative code produced by the compiler would have identical commit
and origin points. Thus, in our example, the non-speculative code produced would be that of
Fig. 3.3(d).
34
branch... Ii)
(1)rl :=r2+4
(2)r7 :=i9*r2
branch... LO
(2)r7:=r9*r2
(1)rl :=r2+4
(2*) r7.F			r9 * r2			(2*) r7			r9 * r2
branch... LO			branch... LO
(1)			ri			r2 + 4			commit (2*)
(1)			rl:=r2+4
(a) Original			(b) Boosted			(c) Abstract			(d) Equivalent
3.1.5 Assumptions
Figure 3.3: Origin vs. Commit
We shall, as far as possible, use the abstract representation of speculation. Unless otherwise
mentioned, the examples used can apply to either scheme or we shall explicitly mention which
approach is being used. Also, we shall defer distinguishing between cases where the speculative
instruction results are written directly to the regular register file, or are first buffered.
Further, since the focus of the remainder of the chapter is the impact of static speculation on
exceptions, unless otherwise mentioned, we shall ignore complications introduced by pipelining,
parallel issue, etc. `The unpiementarion we shall be assuming i5 a ?rial (i.e., sequential, in-order,
?inp1inp.d?			1?			itation au?mented bv soeculative instructions.
s?? ?sue, non-?`
3.2 Precise Exceptions & Static Speculation
3.2.1 The Breakdown of Precision
Our goal is to make it possible to write exception handlers for programs with speculative instruc-
tions. On a non-speculative architecture, this can be achieved by implementing precise exceptions;
i.e., atthe time an exception is reported, all prior instructions must have executed, and no succeeding
instructions must have executed.
The words "prior' and "succeeding" imply some order on the instructions. In the case of non-
speculative programs the order was the program order. Now, a speculative instruction actually has
two positions in the program--Hits issue point and its commit point. Intuitively, if two instructions
exception, and the exception of one instruction is reported before the other, then that instruction
should precede the other; thus, the natural order in the case of speculative instructions would be
to assume that the speculative instructions are ordered by the position of their commit point in the
program. In that case, the instructions preceding a speculative instruction are those that precede
its commit point, and the instructions that occur after a speculative instruction are those that occur
after its commit point.
35
One reason behind the rule that no instructions after the excepting instruction should have
executed was to ensure that the inputs to the excepting instruction were unchanged. Unfortunately,
this goal is not satisfied by the rule when it comes to speculative instructions, as can be seen from
Fig. 3A. In this example, if instruction (3 *) exceptions, even though no succeeding instruction
(under the order defined) has executed, the input r0 to (3 *) has been overwritten by (2).
Furthermore, as shall be shown later, it is perfectly reasonable for a compiler to produce such code.
(1)
(3*)
(2)
rO
ri			rO + r2
rO
br== r2, 0, LO
commit (3*)
Figure 3.4: Speculation Changes Inputs
Assume, for the sake of argument, that we could restore the state to that which was present at
the time (3 *) was issued, in which case the value in r 0 is the value used by (3 *) This will let
the exception handler read the necessary values. However, now we are faced with the problem of
restart. Exception handlers can modify arbitrary state; assume that the exception handler for (3 *)
set the value of r2 to 0. Ifexecution resumed at the point (3 *) was issued, then the branch would
be taken, resulting in the paradoxical situation where the excepting instruction (3 *) should not
have been executed, and therefore should not have caused an exception.
So the alternative is to resume execution at the commit point. This in turn implies that all
modifications must be done on the commit state. Consider the base case, where the exception
handler makes no change to the state: intuitively, any succeeding instructions must see the value
rO as defined by (2) , not the value at the issue point. Thus, an exception handler is expected to
read from the issue state and write to the commit state--Ha somewhat paradoxical situation.
Previous authors have tried, informally, to sidestep this dilemma by considering only speculative
programs derived from non-speculative programs via a set of structured transformation rules that
preserve the input-output semantics of the original, non-speculative program. They then define a
precise state for the exception of a speculative instruction to be the state obtained if the equivalent
instruction in the original non-speculative program had caused the exception. Thus, if speculative
instruction (4 *) in Fig. 3.5(b) causes an exception, its exception state is said to be precise if the
state is identical to that obtained if (4) in the original non-speculative program of Fig. 3.5(a) had
caused a precise exception.
36
x:=...			(l)vr():=
y:=x*2			(2)vrl :=vr()*2
if(...) (			.......
z := x + 1
x :=...
(3)vr2 :=vx()+ 1
LO:
(4)vr4 :=...
(1) vr():=...
(3*)vr2:=vr()+ 1
(2) vrl :=vr()*2
breqi... LO
commit (3*)
(1) r<):=...
(3*) rl			rO + 1
(2) ro:=r()*2
breqi... LO
commit (3*)
LO:			LO:
(4)			vr4 :=...			(4) r2 :=...
(a)			(b)			(c)			(d)
High-Level			Assembly with			Speculated			Register Allocated
Code			Virtual Registers			Assembly Code			Assembly
(1)
(2)
Li:
(3)
(4)
Figure 3.6: Producing a Non-Sequentializable Program
r4			r2 *
breq r4, 0, LO
ri := r2 / r3
r9			r5 + 1
(a)
(4*)			r9			:--H--H			r5			+ 1
(1)			r4			r2			*
(2)			breq			r4,			0, LO
Li:
(3)			ri			r2			/ r3
commit (4*)
Figure 3.5: Problems with Exceptions
The approach of appealing to a non-speculative equivalent program is attractive; it appears to
model our intuitive notion ofwhat it means to, and where we should, speculate. Unfortunately, it is
as yet informal. For instance, what should we do in the case where we do not start by transforming
from a non-speculative program? It should be clear from Fig. 3A that we can write speculative
programs for which there exist no obvious non-speculative equivalent. Note that, in that example,
any exceptions caused by the speculative instruction (3 *) will be reported after the branch, at
the commit point. At this point the input to (3 *), r0, has been over-written by (2). Now, any
equivalent non-speculative program must execute (3) after the branch. However, at that point,
the input to (3) is unobtainable, since it has been over-written. It is possible to find an equivalent
non-speculative program, but only by rearranging the registers.
As mentioned earlier, it is entirely reasonable for an optimizing compiler to produce such a
program. Consider the compilation sequence in Fig 3.6, which yields the code similar to that in
Fig 3A. Fig. 3.6(a) shows an example of some high-level code. Note that x is defined above the
I f, and that its last use is inside the I f statement. The compiler converts it into assembly, using
"virtual registers" to represent the values computed in the program. Then, it schedules the code, as
37
shown in Fig 3.6(c), introducing speculation. Finally, it allocates processor registers to the values
representedinthe virtual registers. At (1) itallocates rO forvro, andat (3*), ri forvr2. At
(2) ,the compiler notes that the last use of the value in rO has occurred, and so it is free to reuse
rO for vrl. This action makes it impossible to directly convert back to a sequential program; in
such a program, the value of vrO and therefore rO would need to be available at the speculated
instructions origin point.
3.2.2 Equivalence
In the previous section, we introduced the concept of "equivalence" where, intuitively, a non-
speculative program is said to be "equivalent" to a speculative program if an instruction in the
speculative program causes an exception exactly when the "equivalent" instruction in the non-
speculative program does , the machine states of the two programs are identical at the point
the exception is reported, and exceptions in the non-speculative program are precise. Before
proceeding to formalize the concept, let us consider why the notion of having such an "equivalent"
non-speculative program is so attractive,
Assume that there exists an equivalent non-speculative program for some speculative program.
Since exceptions in the non-speculative program are precise, the machine state will be such that
all preceding instructions have executed, and no succeeding instruction will have modified the
state. Thus, it is possible to write an exception handler that can examine the machine state,
including the inputs to the excepting instruction, determine the cause of the exception, make
arbitrary modifications to the state, and resume execution, with the expected behavior. Now, by
our (informal) definition of equivalence, the machine state in the speculative program, when the
"equivalent" exception is reported, is identical to the machine state in the non-speculative program;
hence, it is possible to use the same handler to process the exception. Thus, by ensuring the
existence of an "equivalent" non-speculative program for some speculative program, we ensure
that we can write exception handlers for that speculative program; further, these exception handlers
will be identical to those written for non-speculative programs.
Note that this notion of equivalence is stronger than equivalence by input-output behavior.
For instance, the speculative program in Fig. 3.6(d) has the same input-output behavior as the
high-level program and any non-speculative code generated from it. The only time the behavior
of the speculative assembly code can be distinguished from that of some non-speculative assembly
code is if an intermediate state of the program is examined.
3.2.3 Sequentialization
It is clear that there are some speculative programs for which there exist no equivalent non-
speculative programs. However, we have still not defined when a non-speculative program is
equivalent to a speculative program. Nor have we shown, other than at an intuitive level, how to
apply this notion of equivalence to defining exception models. In this section, we shall attempt
to formalize the approach of using a non-speculative program to define precise exceptions for a
speculative program.
38
breqi r3, 0, L5
(l)r5 :=r7* 3
(2)r9 :=r5+4
(3)rl :=r2+r6
(3*)rl :=r2+r6
(l*)r5:=r7*3			?
breqlr3,0,L5			breqi r3, 0, L5
commit (3*)			`?			(3) ri r2 + r6
commit(1*)			(l)r5 :=r7*3
(2)			r9:=r5+4			??			(2)r9:=r5+4
(a) Original			(b) Speculative			(c) Sequentialized
Figure 3.7: Sequentialization
A speculative program for some architecture with static speculation and a non-speculative
program for the same architecture are said to be eqi?ivaIent if they satisfy the following conditions:
o+ The two programs have the same input-output behavior.
There is a bijective mapping between those instructions, speculative and non-speculative,
of the speculative program, and those instructions of the non-speculative programs that can
cause an exception.
Under the same input, one program has an exception exactly when the other program does;
further, the instructions that cause the exception are equivalent to each other under the
previous mapping.
o+ At an exception, the machine state presented to the exception handler is identical for both
the programs.
A speculative program is said to be sequennallzabTh if there exists a equivalent, non-speculative,
program. Fig. 3.7 gives an example of a sequentializable speculative program. Equivalent instruc-
tions are shown connected with dashed lines.
Of course, this is true only if the architecture specifies that all speculative register updates
are buffered, as in boosting. If so, it is easy to verify that if the speculative and sequentialized
programs are run, and one of them exceptions, the other will do so. Further, the two excepting
instructions will be equivalent, and the machine states will be identical. Notice that the equivalent
non-speculative program is not the same as the original non-speculative program.
Exceptions in a program with static speculation are precise if the program is sequentializable,
and the exceptions in the non-speculative program are precise. We call such a program precise.
39
3.2.4 Sequentialization Without Shadow Registers
Formally, it is necessary for all effects of a speculative instruction to be buffered from the machine
state till the instruction is committed. This includes writes to registers. For instance, if, in example
Fig. 3.7, instruction (3*) caused an exception, then the state should not reflect the execution of
speculative instruction (1 *) ; in particular, register r5 must not have been modified by (1 *) - `This
implies that an approach similar to that adopted by boosting is necessary--Hspeculative instructions
must write their results to auxiliary storage such as shadow registers, which will be copied into the
machine state only when the instructions are committed.
`This is inefficient: the auxiliary registers are equivalent to the normal registers except that they
can be accessed only by speculative instructions. There are situations in which performance would
be improved if the non-speculative instructions had access to all available registers. Further, there
is an overhead for copying at the commit point: while it is possible to achieve the effect ofa commit
without copying, it complicates the register access hardware. Lastly, either spilling speculative
registers (by explicit load instructions or on a context switch) is not possible, with the attendant
inefficiency, or special instructions are required to access the speculative registers.
For reasons of efficiency, the idea of using exactly one register file, and having the results
of speculative instructions update the register file is extremely attractive. If we use such an
architecture, it is clear that exceptions will be imprecise. However, we can define a weaker notion
of precision. We call an exception of a speculative program on such an architecture precise upto
speculative writes.
Consider an archietecture in which speculative instructions modify their output registers im-
mediately: a speculative program and a non-speculative program for the same architecture are said
to be equivalent ipto speculative writes if they satisfy the following conditions:
o+ The two programs have the same input-output behavior.
There is a bijective mapping between those instructions, speculative and non-speculative,
of the speculative program and those instructions of the non-speculative programs that can
cause an exception.
Under the same input, one program has an exception exactly when the other program does;
further, the instructions that cause the exception are equivalent to each other under the
previous mapping.
At an exception, the machine state presented to the exception handler is identical for both the
programs ignoflng those registers that were last wntten to by uncommilled (i.e. outstanding
orfailed) speculative instn?ctions at that point in the speculanve program.
Exceptions in the speculative program are precise upto speculative write if there exists a non-
speculative program that is equivalent upto speculative writes, and all exceptions in the non-
speculative program are precise. Such a speculative program will be called precise upto speculative
writes.
Under this definition, the speculative program in Fig 3.7 is precise upto speculative writes, with
the equivalent program being the sequentialized program shown in the example. This assumes that
40
in the underlying architecture all speculative output register writes occur as soon as the correspond-
ing instruction completes, and that exceptions for non-speculative instructions are precise. If (3 *)
causes an exception, then all the registers but r5 will be the same in both programs. However, r5
was the output register for (1 *) , which is an outstanding speculative instruction at the point ofthe
interrupt. Therefore, the two register states are the same but for those registers that were written to
by outstanding speculative instructions, and (at least with respect to (3 *) ) the two programs are
equivalent upto speculative writes.
3.3 Implementing Precise Exceptions
3.3.1 Constraining Speculation
The burden of ensuring that a program with static speculation satisfies either of the two definitions
of precision is mostly the responsibility of the compiler. The architecture determines whether it
will be possible to achieve precision, or just precision upto speculative writes. After that it is the
compiler which must ensure that the code is sequentializable.
Proving that an arbitrary program is sequentializable is undecidable. However, it is possible
for a compiler to produce speculative programs that are guaranteed tu be sequentializable, and
whose equivalent program can be found using some simple transform. One intuitive transform is
as follows:
Bach speculative instruction is assumed to be similar to some non-speculative equivalent,
where a non-speculative instruction is said to be similar to some speculative isntruction if
they have exactly the same behavior except that the non-speculative instruction except that
- reports exceptions immediately and
--H refers to the non-speculative locations of all values read/written by the speculative
instruction.
For each speculative instruction, add at its commitpoint a similar non-speculative instruction.
If two (or more) instructions share the same commit point, arrange their non-speculative
instructions in the order in which any exceptions caused by them are ordered, i.e. if the
exceptions of one of the two speculative instructions will be reported first, put the non-
speculative instruction similar to that instruction first.
o+ Delete all speculative instructions, and all instructions that are used only to support specula-
tion.
The bijection under which the two programs should be equivalent is the obvious one: each
speculative instruction is mapped to the non-speculative instruction that replaces it, and each non-
speculative instruction is mapped to itself. Such non-speculative and speculative programs are said
to be similar to each other.
41
`The compiler can ensure that the code it produces is sequentializable, and that the equivalent
program can be found using the simple transformation rule mentioned above, by ensuring that the
produced code satisfies the following constraints:
1. A non-speculative instruction cannot access values in the speculative state; i.e., it cannot
read the result of an uncommitted speculative instruction.
2.
If control flow reaches an instruction's commit point, then it must have passed through the
instruction; specifically, a speculative instruction must dominate its commit points, and lie
on every cycle that contains a commit point.
3. Any value read by a speculative instruction must be available at the instructions commit point
at its non-speculative location. Note that this implies that if a speculative instruction reads a
value produced by another speculative instruction, then its commit point must be preceded
on all paths by the commit point of that other instruction.
4. At the commit of a speculative instruction, the non-speculative location must be updated by
the value produced by the instruction, and not some other value.
All of these constraints can be easily satisfied with current compiler technology. In particular,
Constraint 3 and Constraint 4 can be modeled by assuming that there is a "use" of all the inputs
and outputs of a speculative instruction at the instructions commit point. The other conditions can
be satisfied by speculating instructions in a controlled manner.
Of course, it is necessary that the implementation give the same effect as a serial issue ar-
chitecture. Thus, the implementation must implement precise exceptions for all non-speculative
instructions. Further, for architectures that are precise upto speculative writes, it may be desirable
to ensure that all speculative instructions issued before the exception is reported must have updated
the register state, and that none issued after that point should have modified the register state.
3.3.2 A Proof of Equivalence
A formal proofthat a non-speculative program similar to speculative program will be speculatively
equivalent to it if the compiler constraints outlined above are satisfied is deferred to Appendix A.
There, we shall provide a formal model of execution of speculative and non-speculative programs.
We shall define similarity, compiler constramts and speculative equivalence in the context of that
model, and show that similarity and compiler constraints imply speculative equivalence.
3.3.3 Restart
Restarting in the presence of static speculation is somewhat complicated. Mter processing an
exception, the exception handler may have modified the state in an arbitrary fashion. Normally,
under the model of serial execution that we have assumed, no further action needs to be taken.
However, static speculation introduces out-of-order execution. For instance, if (2 *) in Fig. 3.8(a)
exceptions, then (3 *) will have been executed, out-of-order. Intuitively, (3 *) should be re-
executed in the modified context. Thus, if the exception handler modified the value of r3, (3 *)
42
(2*)			r3			r?			--H
(1)			r5			r?			/
(3*)			ri			r3			+
(4)			br			LO
(5)			(2*)
(6)			(3*)
commit
commit
(b) Speculative Code
r9
r2
Figure 3.8: Restarting
Restart (2*):
(1) r3 :=
(3*) ri := r3
jmp (5)
- r9
+1
Restart (3*):
(3) ri := r3 + 1
jmp (6)
(b) Restart Blocks
should be re-evaluated using this new value. In fact, should any instruction exception, then
all speculative instructions that are outstanding at the point the exception is reported mi?st be
re-executed in the new context.
To re-execute all outstanding speculative instructions requires that the exception handling
system must be able to identify them. The direct approach, keeping a hardware log of all such
instructions, can become prohibitively expensive ifwe allow unlimited speculation. If, for instance,
the hardware can keep track of the 32 outstanding speculative instructions, what happens to
programs which has 33 outstanding speculative instructions at some point? Also, this log becomes
additional state that must be saved on context switches. If this approach is used, it is reasonable
to restrict the compiler to produce code with no more than, in this case, 32 outstanding speculative
instructions at any point.
There is an interesting software-based solution to this problem. In most cases the compiler
(or programmer) that produces the program will be able to identify, for each point in the program,
the speculative instructions that must be re-executed should an exception occur at that point. So,
the compiler adds additional information to the program. This information enables the exception
handler (perhaps with some minor additional hardware support) to identify and re-execute the
outstanding speculated instructions after processing the exception. The exact details of this scheme
depends on the architecture it is applied to; it was first developed for boosting, and a description of
it can be found in [55].
Basically, adapting it to the abstract scheme we have been using requires that the compiler
do the following: for each commit point, the compiler produces a block of code known as the
restart block for that commit point. It contains the non-speculative version of the instruction that
is committed at the point, followed by all the speculative instructions outstanding at the commit
point. The restart block is terminated by a jump to the instruction following the commit point. The
compiler also builds an auxiliary table that maps each commit point to the beginning of its restart
block. This is probably a hash-table, indexed by the address of the commit points.
The restart blocks for (2 *) and (3 *) in Fig. 3.8(a) are shown in Fig. 3.8(b). Note that the
restart block for (2 *) contains the non-speculative equivalent for (2 *), and (3 *), which is
outstanding at the commit of (2 *). This restart block is terminated by a jump back to (5) , the
43
instruction following the commit point.
When an excepting speculative instruction is committed, the corresponding branch handler
is re-executed. Instead of jumping back to the commit point after processing the exception, the
exception handler finds the beginning of the restart block for the commit point. It then branches
to the first instruction in the restart block (or to the instruction after the first one), and resumes
normal execution. Thus, before the original program is re-entered, all the outstanding speculative
instructions will have been re-executed.
In this example, if (2 *) causes an exception, the exception handler is entered at the commit
point (4). Assume that during the execution of the handler, it changes the value of r?. Mter
completing execution, itresumes execution at ReStart (2 *). Thus, it re-executes (2) , changing
the value of r3. It also re-executes the outstanding instruction (3*). Finally, it jumps back to the
original program.
If some non-speculative instruction causes an exception, then all outstanding speculative in-
structions need to be executed. It is too expensive to build a restart block for every instruction.
Fortunately, there is a way to avoid this. This optimization, which was developed for the specula-
tion mechanism technique described in the next chapter, depends on the following observation: an
instruction that is outstanding at some point continues to be outstanding till either its speculation
fails, or till it is committed.
Now, assume that it was possible for the exception handler to determine the next commit point
to be executed. Then, the instructions in that commit block include all instructions that were
outstanding at the excepting instruction, less those whose speculations failed, plus some other
speculative instructions issued in the interim. So, what the exception handler does is set a flag,
called the restart flag, which causes the restart block for the next commit point to be executed in
its entirety. This will have the desired effect, as is indicated by the following observations:
o+ All instructions in the restart block that were outstanding at the time ofthe original exception
will get re-executed in the state as modified by the exception-handler.
Those instructions that were outstanding at the original exception, but are not in the restart
block are those that failed; they did not need to be re-executed, since their results could not
be used by any instruction that did not itself fail.
Those speculative instructions that are added in the interim need to be executed in the state
created by the re-execution of prior speculative instructions in the state as modified by the
exception handler.
Of course, these observations are applicable only if the code was compiled to obey the rules
mentioned above.
Thus, assume that in our example, (1) causes an exception. The exception handler, while
processing the exception, changes the value of r?. Now, it is necessary to re-execute (2*) with
this new value, and therefore the value in r3 is invalid. To ensure that (2 *) gets re-executed in
the new state if necessary, the exception handler sets the restart flag, and resumes execution. The
program executes (3 *) using the invalid value for r3. It then executes the branch. If the branch
is taken, the speculation failed. Therefore, the results of (2 *) and (3 *) are never going to be
44
used, and there was no need to re-execute them. If, however, the branch is not taken, then these
instructions must be re-executed. If the branch is not taken, the commit for (2 *) is executed.
Due to the restart flag being set, the entire restart block is executed, resulting in (2) and (3 *)
being re-executed. This time, (2) uses the value of r? as modified by the exception handler,
and produces the correct value for r3. Then (3 *) executes using this corrected value. Thus,
exceptions of the non-speculative instruction (1) is handled correctly, by using the restart flag.
There is another problem: how are context-switches to be implemented? There will be state in
the processor that keeps track of the speculative instructions, including details such as the shadow
register values and which, if any, speculative instruction has caused an exception. One approach
is to dump out all of this state. The other approach is to recreate it, using the restart technique
outlined above. After a context switch, the first time a commit point is reached, the restart flag is
set. This causes the speculative state to be rebuilt from the saved non-speculative state.
3.3.4 Writing Exception Handlers for Static Speculation
Putting the pieces together, it is clear that writing and executing exception handlers for speculative
programs that are precise by our definition is the same as writing exception handlers for non-
speculative programs with precise exceptions'. Basically, the exception handler should written
with respect to the equivalent non-speculative program. It will not be able to distinguish between
a precise speculative program and its non-speculative equivalent However, before resuming
execution, it will have to use one of the techniques discussed above to re-execute all outstanding
speculative instructions.
In the cases the underlying architecture restricts the program to being precise upto speculative
writes, the exception handler will not be able to distinguish between the speculative program and its
equivalent as long as it does not examine those registers last written by an uncommitted speculative
instruction. Thus, as long as the exception handler is transparent to (e.g., does not read) the values
in those registers, it can be written with respect to the non-speculative equivalent program, and
ignore the fact that it is actually being written for a program with speculative instructions. The
handler must, however, restart differently, by re-executing all outstanding speculative instructions.
3.3.5 Discarding Precision
Implementing precise exceptions in the presence ofspeculative instructions creates two problems:
Register lifetimes are unnecessarily extended. If a register value is read or written by a
speculative instruction, the register cannot be reused after the last time the value is read;
instead, the value must be preserved till all speculative instructions that read the value have
been resolved.
o+ Restarting requires re-executing all the outstanding speculative instructions.
1Note that in the case of non-speculative architectures, precision is usually enforced by the hardware, whereas in
speculative architectures, it requires the cc-operation of software and hardware.
45
A technique that would allow us to implement exceptions imprecisely might allow us to get
around these problems. There are various approaches to weakening precision for non-speculative
architectures [46]. These schemes preserve the ability to execute the exception handlers for those
exceptions critical to the operation of a machine, such as virtual memory exceptions, but exchange
the ability to handle all exceptions for simpler hardware. Unfortunately, in the case of exceptions
on architectures with static speculation, these approaches do not solve the problem. It is still
necessary to re-execute all outstanding speculative instructions after returning from an exception,
and this in turn requires that all register values read by a speculative instruction be preserved. For
instance, consider a ThB-miss exception on the speculative load (1*) in Fig. 3.9. There may
be speculative instructions that would have used the result of the load, such as (2 *). So after
the exception is issued and handled, at the commit of (1 *) , there may be outstanding speculative
instructions which used the value of the load, and need to be re-executed.
(1*)
(2*)
(3)
Li:
r9			id
ril			r9 + riO
breq r4, 0, LO
commit
commit
(1)
(2)
Figure 3.9: Re-executing Speculative Instructions
Fortunately, it may be that all critical exceptions can be processed immediately; for instance, the
`1U3-miss of (1 *) could have been processed before executing any of the remaining instructions.
This would do away with the necessity of restarting any instructions. Of course to get any benefit
from this approach one must save all the state that was associated with speculative instructions
at a context switch; otherwise, recreating it would still require the re-execution of all outstanding
speculative instruction.
3.4 Conclusions
In this chapter, we examined the interaction of static speculation and exceptions. We showed that
the standard model of exception handling on non-speculative architectures needs to be extended to
apply to architectures with speculative instructions. We formalized two such extensions, one for
architectures which buffered a speculative instruction's result register updates till the instruction
was committed, and the other for architectures in which speculative instructions updated their result
registers immediately.
It is possible to produce speculative code in a fashion such that it is impossible to satisfy the
extended precise exception model we have proposed; thus, it is necessary for speculative code to
satisfy certain properties if it is to comply with the exception model. We describe these properties.
We discuss the problems associated with restarting a program after handling an exception. The
major problem is that the outstanding speculative instructions in a program must be re-executed.
46
We show why hardware solutions are inappropriate, and discuss a software approach.
Thus, we arrive at a framework for implementing exception handlers for programs with specu-
lative instructions. The programs are compiled so as to be precise upto the limits ofthe underlying
architecture. The exception handlers are written as though for non-speculative programs with
precise exceptions. The only dilference between these handlers, and those that are written for
non-speculative programs is that they must re-execute outstanding speculative instructions.
Finally, we show how to weaken the extended precise exception model, so that it is still possible
to handle those exceptions such as virtual-memory faults that are essential to the functioning of a
"correct" program, but in which all exceptions cannot be handled. This approach makes it possible
to make better use of available resources.
Chapter 4
Speculative Taggirig
In this chapter, we describe an alternate mechanism for specifying speculative instructions. This
mechanism, known as speculative tagging, is introduced in Section 4.1. Section 4.2 investigates
the problems associated with multi-path speculation, and shows how speculative tagging can be
adapted to overcome them. In Section 4.3 we show how speculative tagging can be optimized for
exception handling and recovery In speculative programs. These modifications, as well as some
others discussed in Section 4.4, are summarized in Section 4.5. Finally, we conclude in Section 4.6.
4.1 Speculation Tags
In Chapter 2, we described two mechanisms that had been proposed in the literature for imple-
menting speculative instructions--Hboosting and safe-bit. We showed that they both had certain
strengths and weaknesses. In particular, safe-bits allowed instructions to be speculated in a fairly
general manner. Boosting, on the other hand, could not represent all possible speculations. How-
ever, boosting could allow all instructions to be speculated, while safe-bits had trouble speculating
instructions with no result registers, such as stores. Further, boosting allowed the hardware to
group identically speculated instructions. This grouping made it possible optimize various actions;
for instance, boosting can simultaneously commit a group of instructions.
Neither boosting and safe-bits are completely satisfactory. Ideally, we would like to come up
with an approach that combines the uniformity and grouping of boosting with the flexibility of
safe-bits. Our mechanism, speculation tagging, attempts to do so.
4.1.1 Success and Commits
In the speculative tagging mechanism, which we shall usually abbreviate to tagging, every specu-
lative instruction is annotated with some tag. A speculative instruction is committed when its tag
is committed. Thus, the speculation desired in Fig. 4.1(a) will be represented using speculative
tagging as shown in Fig. 4.1(b).
Notice that we use a special commit instruction to commit (the speculative instructions
associated with) a tag. The combination of tag and commit provides the same flexibility as the
47
48
In a processor that implements speculation, certain effects of a speculatively executed instruction
are buffered. These effects include exceptions, memory updates, and possibly, as in boosting,
register updates. If the instruction is successful, then these effects are committed, and the buffer
space can be reused. If the instruction fails, the hardware must discard, or flush, the buffered
effects, so as to free the buffer space for later reuse. Therefore, to allow reuse of buffer space, it is
important to indicate failure as well as success in an architecture that implements static speculation.
Failure ofspeculation is fairly straightforwardto detect in boosting--Hifcontrol flows offthe path
specified by the boosted instruction, then the instruction has failed. In safe-bits, a speculation fails
when the output register of a speculative instruction is written to by a non-speculative instruction
without having been read.
We indicate failure of speculation by means of explicit flush instructions. When a tag is
flushed, all speculative instructions with that tag are assumed to have failed, and the resources
4.1.2 Failure and Flushing
output-register/register-read mechanism used in safe-bits. Further, since tags are independent of
output registers et. al., they can be applied uniformly to all speculative instructions.
Multiple speculative instructions can have the same tag. These instructions are considered to
be in the same group. This allows us to specify and take advantage of grouping; for instance,
identically speculated instructions can be given the same tag, and all of the speculative instructions
with the same tag are committed when the tag is committed. In Fig. 4.2, both (1) and (2)
have been speculated from the same region. They are given the same tag; hence, when (3) is
executed, they will both be committed together. Notice that it is straightforward to determine if
two speculative instructions are in the same group; simply examine the tags.
Figure 4.1: Speculative Tagging
(2) commit :3
(a)
(1)r5 :=r8/r2
(l*)r5:3 :=r8/r2
A
A
A
A
49
1A noteaboutterminology: (r1) denotes the contents oftheaddress stored in r1. Thus (r1) : = . . is a store
to (rl),and . . := (ri) isareadfrom (ri)
It is possible to use speculative tagging with either a single register file, or with some mechanism
that buffers speculative register updates, such as the shadow register scheme used in boosting. As
has been shown, the single file approach tends to make better use of the available registers. We
shall, in this chapter, and through the rest of the dissertation, assume a single file approach, unless
otherwise stated. This, of course, restricts us to being precise upto speculative writes.
being used to buffer their effects are reclaimed. A flush instruction is typically added right after
a branch direction that, if taken, causes the speculation to fail. Thus, the code produced for the
speculation shown in Fig. 4.1(a) would actually be that shown in Fig. 4.3
Notice that one of the resources that is reclaimed is the tag itself. When a tag is flushed (or
committed) it, too, is reset for reuse; among other things, all speculative instructions that were
associated with the tag cease to be so associated, and a new group is started.
Figure 4.2: Grouping using Speculative Tagging
Upto this point, we have not considered the problems associated with speculating memory opera-
tions, such as stores and loads. It turns out to be necessary to add additional hardware to support
such speculation. To understand why, consider Fig. 4.4(a)1; assume that it is desirable to speculate
(3) up past the branch and issue it before (2). Clearly, doing this naively will result in (3)
overwriting the value to be read by (2)
(a)
(3) commit :3
4.1.4 Memory Operations
4.1.3 Register Files
(l)r5 :=r8/r2
?(2)r4:=r3+3
(l*)r5:3 :=r81r2
50
(l*)r5:3 :=r81r2
(4) flush :3
(2) commit :3
Figure 4.3: Speculation with Flushes
A possible solution is to use different memory locations, as shown in Fig. 4A(b). Originally
both stores wrote to the address in ri. In the transformed version, the speculative store, (3*) is
to some other memory location, whose address is stored in r7. The instructions that used to read
the value stored by (3) , such as (4) , must be modified to now use the address in r?.
(1)			(ri)			.			(1)			(ri)
(3*)			(rV)
(2)			(ri)			(2)			. .			(ri)
branch
(3)
(4)
(5)
branch
(ri)
(a)
(ri)			(4)
(r8)			(5)
Figure 4.4: Renaming Memory
(r8)
(b)
Unfortunately, there are situations where this approach will fail. Consider instruction (5) in
Fig. 4.4(b). Clearly if r8 has the same value as ri when the program is executed, (5) , too,
should be modified to use the new address. However, if r8 and ri are different, it should be left
untouched. Unfortunately, is not always possible for the compiler to determine which of the two
behaviors is correct. Assume (4) was a read of x [I] and (5) of x , where x is some array.
Then, r1 will be the same as r8 exactly when 1 is equal to j. Obviously, there will be situations
when it is impossible for a compiler to determine whether this is so. Worse still, there may be
situations in which the two are sometimes equal, and sometimes not. This is another problem
caused by ahasing.
Aliasing forces us to add extra hardware to support the speculation of memory operations. The
51
mechanism used is an adaptation of the store-buffers used in non-speculative processors. In such
processors, when a store operation is executed, the results of the store must be written to the cache
(and possibly to memory, such as when the cache is a write-through cache). It may be the case
that it is not possible to write the store into cache immediately, either because of a cache-miss, or
because of the cache organization. In this case, the store is buffered in the store-buffer till it is
possible for it to be released to the cache. All loads must look in the store-buffer, as well as the
cache. If there is an address match both in the store-buffer and cache, then the value from the
store-buffer is returned.
The modifications that must be made to the store-buffer to support speculative memory opera-
tions seem fairly straight-forward. Briefly:
o+ A speculative store is first buffered in the store-buffer.
o+ If the speculation succeeds, the store is written to the cache.
o+ If the speculation fails, the store is discarded.
o+ A load looks in the store-buffer for a match.
o+ If there is a match, then the matched value is returned.
Notice, however, that these modifications do not cover the case when there are multiple matches
in the store-buffer. In the original, non-speculative, store-buffer the solution was to return the value
of the match added last to the buffer. This solution will work in the presence of speculation only
if the order of memory references to the same address is preserved during the speculation. In the
presence of aliasing, this condition can result in having to preserve during speculation the order of
all memory operations.
Notice also that the maximum number of speculative stores that can be simultaneously out-
standing is limited by the size of the store-buffe? This size must be known to the compiler, which
must ensure that this limit is not exceeded while speculatively scheduling stores.
4.2 Multi-Path Speculation
Previous work on speculation mechanisms has mostly dealt with speculation past exactly one side
of a branch; it has generally ignored the complications caused by speculating instructions from
both the taken and not taken sides of a branch.
One complicationthat arises in multi-path speculation occurs when the same destination register
is being used for instructions being speculated from both sides of a branch. Obviously, one of the
two has to be renamed. All the uses of this register have to be renamed, too. This is possible as
long as the uses can access only one of the two values, as is the case in Fig. 4.5. If, however, there
exists an instruction that can read both values, such as (3) in Fig. 4.6, then it becomes necessary
to "undo" the renaming by adding a copy operation at the appropriate point.
Memory operations have a similar problem. If memory operations are moved past both sides
of a branch, then they may interfere with each other. Further, because of aliasing, it may not
be possible to determine whether or not the operations interfere. Nor is it possible to solve this
52
(1*) rt:3 :=...
(2*) r7:7
commit :3
flush :7
commit :7
flush :3
(4) o+=r7
Figure 4.5: Tagging with Renaming
(1)rl :=...			(2)rl
(1*)rl:3
(2*) r7:7
commit :7
flush :3
(4)rl :=r7
Figure 4.6: Tagging with Copying
53
(2) ..:=(r3)
Th
(1*) (rl):3
(3*) (r3):7
(2*) ..:3			(ri)
(4*) ..:7			(r3)
(3)			(r3) :=..			collirnit :3			corninit :7
(4)			.. :=(rl)			flush :7			flush :3
Figure 4.7: Multi-Path Memory Speculation
problem using the store-buffer. Consider the case illustrated in Fig. 4.7. Clearly, (2*) should read
the value at address r3. When r3 = r1, this should be the value written by (1 *) However, the
store-buffer as described above will incorrectly return the value stored by (3 *) In fact, unless
r3 is always different from or always the same as r1, no order of the four speculative instructions
is correct.
`This problem arises because a store (3) that would not have been executed simultaneously
with the load (2) in the sequential program is visible to the load in the speculative program. So,
the speculative load instruction must somehow be able to specify those instructions that are visible
to it. One simple technique is to add a set of tags to each speculative load instruction, called the
path mask. When a speculative load is matched in the store-buffer, all stores whose tags are not in
the path mask are ignored. In the example, (2*) could be written . . : 3 : = (ri) [1 . Since
(3*) `s tag, : ?, is not in the mask, it will not be considered. Thus, if r3 = ri, the value stored
by (1*) willberetumed.
Path masks in effect declare all stores that should be visible to the load operation. This property
makes path masks useful for other purposes as well. For instance, path masks can be used to allow
stores from different basic blocks to be speculated past memory operations from other basic blocks,
as is shown in Fig. 4.8. Each load simply masks out the stores that were speculated from below the
load; as a special case, non-speculative loads ignore all speculative stores.
4.3 Implementing Exceptions
The previous chapter showed how exception recovery issues imposed certain constraints on the way
the compiler could speculate instructions. In this section we shall discuss hardware modifications
tailored for speculative tagging that help alleviate some of these restrictions.
54
(l)(rl) :=
(2) ..
(1*) (rl):5 :=
(3*) (rl):2
(2*) ..:5 := (rl)[5]
(4*) ..:2 := (r3)[2,5]
commit :5
Figure 4.8: Multiple Memory Speculation
4.3.1 Register Pressure
One constraint on the compiler is that it has to ensure that, for each speculative instruction, the
instruction's inputs are available at the commit point. Since more than one instruction can be
committed with the same instruction, this implies that a register can be used as the output register
for at most one instruction with a given tag. Consider the speculation shown in Fig. 4.9, which
uses ri as the output register for both (1*) and (2 *) Even though this code has the correct
input-output behavior, it is illegal, since it will result in imprecise exception for (2 *). In particular,
the value in r 1 that is used by (2) would have been overwritten by (3). It is still possible to
perform this speculation by using a different register as the output register of (1 *) ; however, this
increases the register pressure.
(1)
(2)
(3)
branch
ri
= .. ri
ri
(a)
(1*)
(2*)
(3*)
Figure 4.9: Illegal Speculation
ri:3
:3
rl:3 :=
branch
commit :3
ri
It is possible to modify the hardware to make this sequence legal. The hardware keeps track of
whether any instruction associated with a tag has caused an exception; if so, it treats all subsequent
55
branch
ri
r2
(1)
(2)
branch
(3) r4
(a): Original Code
(2*) r2:3
(3*) r4:1
(1*) r1:3
branch
C3: commit :3
Restart :3
(2') r2
(1') ri
(3?) r4:1
jmp c3 + 1
Restart :1
branch			(3') r4
commit			imp Cl + 1
Cl:			:1
(b): Speculated Code
Figure 4.10: Restarting after an Exception
(c): Restart Blocks
speculative instructions with that tag as no-ops (until the tag gets committed or flushed, of course).
This can be implemented by adding an exception flag for each tag. With this modification, the
above sequence would be legal. If (2 *) caused the exception, (3 *) would never be executed,
and the value in ri would be the necessary mput to (2*).
Note that if the values were written by instructions with different tags it is always possible for
the compiler to relieve register pressure. The compiler simply spills the registers, i.e., stores the
register value in memory, and loads it into the necessary register where appropriate, such as before
the commit of the instruction's tag.
4.3.2 Restart Blocks
One possible exception recovery mechanism that can be used is the restart block. let us review
the use of the restart block: when an exception is reported at a commit point, the exception handler
is entered. After the exception handler finishes execution, the code branches to the restart block
for that commit point. In particular, it branches to that instruction corresponding to the excepting
56
speculative instruction. In Fig. 4.10, if (1*) causes an exception, then the restart block should be
enteredat (1').
Finding the beginning ofthe restart block will probably involve using the address ofthe commit
instruction to hash into a table that will provide the address ofthe corresponding restart block. Thus,
the address of the instruction commit :3 would be used to determine the address Restart :3.
However, finding the address of the instruction within the restart block at which execution is to
resume is more of a problem. The hardware can aid this process: specifically, it can keep track of
the number of successful instructions issued for each tag. This is implemented by adding an issue
counter to each tag. The instruction in the restart block corresponding to the instruction that caused
the exception is offset from the beginning of the restart block by the value of the issue counter. If
(1 *) caused the exception, only 1 instruction, (2 *) would have been issued. Thus, the excepting
instructionisRestart:3 + 1.
4.3.3 Lazy Restart
Restart blocks can result in a significant amount of code replication. Part of expansion can be
avoided by using a scheme called lazy restart. In this scheme, the restart block for a commit
instruction contains only those instructions corresponding to the ones that are committed by that
commit; thus, it does not contain copies of instructions that are outstanding at the commit point.
Now, if a commit point reports an exception, all instructions with the same tag issued subsequent
to the instruction that caused the exception will be re-executed using the restart block. All the
outstanding speculative instructions still need to be re-executed. Notice that these instructions are
available in the restart blocks corresponding to their commit points. So, the hardware simply forces
the entire restart block to be executed when (and if) control reaches the commit point.
To implement lazy restart, a restart flag is added to every tag. When an commit reports an
exception, it sets the restart flags for all tags other than the current one. Before trying to commit a
tag, the instruction examines the tag's reset flag. If it is set, the commit branches to the beginning
of its restart block, and starts executing it. The restart flag is cleared at each commit or flush
instruction.
In the example above, instruction (3 ?) would not be part of the restart block. Instead, if (1)
should report an exception, the restart flag oftag : 1 would be set. Then, ifcontrol reached commit
1, the program would start executing the corresponding restart block, and in particular would
re-execute (3). Of course, if control never reached the commit instruction, it would eventually
encounter a flush, and the restart flag would be cleared.
4.4 Miscellanea
Just as registers can be saved to memory and later restored, so should the information associated
with tags. This would be be useful in the same way that register spilling is useful:
o+ When the number of tags required exceeds the number available.
o+ To preserve tag values across function calls.
57
o+ To preserve machine state on a context switch.
This would need additional instructions which moved the state associated with tags, including
exception information, the various flags, and store buffer entries, to and from memory.
As mentioned above, one way of dealing with tags at a context switch is to save their values,
and restore them when the process is swapped in. The contents of the store-buffer must also be
saved.
However, it is not necessary to save this "speculative" state. An alternative mechanism, based
on lazy restart, saves only the state that would be saved in a non-speculative architecture (i.e.,
registers, PC, etc.). Instead, the restart flag for all tags is set when a process is swapped in. This
will cause the program to regenerate the speculative state by re-executing all speculative instructions
that were outstanding at the time of speculation. During this re-execution, any exception that was
caused by a speculative instruction, and thereby lost on a context switch, will be reported when
that instruction's equivalent in a restart block is executed.
4.5 Design
4.5.1 Summary
ri h?ni? f
?Iementin? static sneculation with a
Ibe arctiitecture we ?opose nas va??3us mec??????ms ? 1II1? -
single register file. These mechanisms are built around a new resource, the speculation tag. A
processor has several speculation tags, which can be uniquely identified (written as : n, where n is
some number). Each tag has the following state associated with it:
Issue Counter keeps track of the number of speculative instructions with that tag that have been
issued. Is incremented every time such an instruction is issued, unless some previous
instruction with that tag would have caused an exception.
Exception Flag is set when some speculative instruction with that tag would have caused an
exception. Once set, all further instructions with that tag are converted to no-ops.
Restart Flag is used to signal that the entire restart block associated with the tag needs to be
executed.
Additionally, there may be some state keeping track of exception information.
Each speculative instruction has a tag. This tag is used to buffer and the instruction's "specu-
lative" state, i.e., information about any possible exceptions caused by the instructions, and about
the store-buffer.
Tags can be manipulated in various ways. They can be cornmitted,ffi?shed or restarted. Further,
they can be spilled to and loaded from memory. There may exist special instructions that perform
some combination of these tasks simultaneously. A good example would be an instruction that
takes two sets of tags as inputs, committing one, and flushing the othe? Further, this instruction
could be combined with some other instruction, such as a branch.
58
Loads use tags, in the path-mask, to mask out unwanted stores while searching the store-buffer.
I?rpically, the path-maskfor all loads with the same tag will be the same. Thus, instead ofspecifying
a path-mask with each load, a special instruction can be used that associates a path-mask with each
tag. Then, every load instruction with that tag will implicifly use that associated path-mask.
One important question that we have not addressed in this chapter is the number oftags needed.
Every tag takes up bits in the instruction word; we would therefore like to make provision for as
few tags as possible. Note that the ability to spill tags allows us to reuse tags and therefore perform
more speculation than would be otherwise possible; however, this solution comes at the cost of
extra spills. We would like to keep the number of spills to a minimum. One way of doing that is
to suitably restrict the amount of speculation performed. The scheduling algorithm described in
Chapter 5 incorporates techniques for doing so.
We shall see in Chapter 6 that the amount of speculation required to fully exploit a processor
depends on the issue-width of the processor. The results in that chapter indicates that, for a 4 issue
processor, the best performance is obtained when instructions are speculated into a basic block
from its 5 most likely descendant blocks, while for a 8 issue processor, the best performance is
obtained using the 8 most likely descendants. The results described in that chapter will indicate
that, under these circumstances, one can avoid spilling 98% of the time by providing 5 and 8 tags
respectively. Further, to capture most of the performance, we will need only 3 and 5-7 tags to get
most of the performance from 4 and 8 issue processors.
4.5.2 Cost Comparison
There are several ways in which static speculation increases the cost of building the architecture. It
requires extra bits in instruction words. Some hardware is required to buffer the exceptions caused
by speculative instructions. Finally, extra hardware is needed to "shadow" or buffer the effects
of speculative instructions. We shall roughly estimate these costs for the three static speculation
mechanisms we have discussed: boosting, safe-bit and speculative tagging.
Instruction Cost
The safe-bit approach adds an extra bit to each instruction, indicating whether it is speculative or
not However, this is not the only additional bits added to the stream by the addition of speculation.
In the safe-bit approach, it is necessary to read the output of a speculative instruction to trigger the
reporting of any exception caused by it. For some speculative instructions, the result may be read
at the commit point as part of natural program execution. Otherwise, dummy read intructions will
need to be added to commit the result. This additional overhead may be significant. Even if only
1 in 10 instructions is a speculated instruction that requires an additional (32 bit)2 dummy read
instruction, it still results in an additional 3 bits per speculated instruction.
The number of additional bits required by boosting depends on the maximum number of
branches that need to be speculated past, and how flexible the path specification should be. Assume
2Clearly, there is no need to use a full 32 bit instruction for the dummy read; however, this is the rnethod used in
the published literature on safe-bit speculation.
59
that we pick the most flexible variant ofboosting, which includes "don't-care" branch specification.
That variant requires 2 bits per level. Our results suggest that, in the symmettric 4-issue case,
restricting speculation so that instructions are moved past at most 2 branches is adequate. Thus
boosting requires an additional 4 bits per instruction. Note that boosting needs no additional bits
at the commit point.
In speculative tagging the additional bits required depend on the number of tags required.
Results for symmettric 4-issue processors suggest that 3 tags will be adequate. In that case,
we need an additional 2 bits per instruction. Further, we will need to commit/flush these tags.
Assuming that the commit and flush operations are folded into the branches, as described above,
we add an additional 6 bits per branch instruction. Assuming that conditional branches form 20%
of the total instructions, and that all branches incur this 6 bit penalty, we add an average of 1.2 bits
per instruction to the instruction stream, for a total overhead of around 3 bits per instruction.
Exception State
The safe-bit approach potentially needs to save exception information for every register. Thus,
there needs to be one exception vector for per register.3 Most processors today have 64 registers
(32 integer and 32 floating-point). Thus, safe-bit requires at least 64 exception vectors.
Boosting requires one exception vector per path. Assuming the 2 level boosting described
above, 4 exception vectors are needed.
Tagging requires an exception vector per tag. Assuming 3 tags, for the reasons described above,
3 exception vectors are needed.
Bach exception vector holds roughly 64 bits of information. Implemented as single-ported
SRAMs, this would require roughly 400 transistors per vector. Thus, the safe-bit approach will
require 24000 transistors more than the other two.
Other Buffer State
The other buffer state added because of speculation that we have considered includes shadow
register files and store-buffer. Note that neither is necessary, and the existence ofeither is orthognal
to the technique chosen to specify speculation. For instance, if there is no store-buffer, then stores
cannot be speculated. This restriction is akeady present in the original safe-bit proposal; we could
add it to either of the other two, and do away with the need for a store-buffer. Similarily, though
boosting as originally proposed used shadow registers, it is possible to combine immediate register
update by speculative instructions with the path-to-commit technique used by boosting to specify
speculative instructions.
An store-buffer is basically a hash-table. It contains 32 (or 64) bits of data and a 32 (or 64) bit
address that acts as the key. Implementing an 8-entry 32-bit store buffer using associative storage
3Earlier safe-bitproposals had suggested tnat itwould be possible to save hardware by saving some of the exception
information in the register; however, this introduces complications. For instance, extra data-paths will be required,
from the exception producing "unit" to the register file. This in turn requires either more pons in the register file
or additional arbitration logic. In light of these complications, we shall assume that exception vectors are stored in
locations distinct from the register file.
60
would require roughly 10000 transistors. Note that [53] state that adding such a buffer can improve
performance of the safe-bit technique by 3%-8% by permitting the speculation of stores.
As argued above, replicating the register file to create shadow register files is not as beneficial
as simply doubling the number of registers. However, if one chose to do so (perhaps for reasons
related to exception recovery), each copy of a register file containing 32 32-bit registers would
require 7000 transistors. However, this estimate significanfly underestimates the actual costs of
replicating the register file, since it ignores the fact that the registers will be multi-ported. It may
be off by as much as a factor of 5.
4.6 Conclusions
In this chapter, we proposed a novel mechanism for implementing static speculation, called specu-
lative tagging. Speculative tagging combines some of the strengths of prior proposals. It provides
the flexibility of the safe-bit approach, thereby enabling it to perform several types of multi-path
speculation not expressible in boosting. It allows speculative instructions to be grouped, which
minimizes the number of commit points in the program, thereby allowing exceptions to be imple-
mented with fewer and smaller restart blocks.
Grouping has other advantages, which stem from the fact that instructions with the same tag
belong to the same group. Two uses of this were shown while explaining paths; first, instead of
having to specify all speculative stores visible to a speculative load, we simply specified the tags.
second, instead of defining a path for each load, we specified a path per tag.
We investigated the problems associated with multi-path speculation. We showed how to
specify multi-path speculation using speculative tagging. Finally, we showed how to modify
speculative memory operations, so as to support multi-path speculation of memory operations.
Lastly, we considered implementing precise exceptions. We showed how our scheme could be
modified to somewhat reduce the register pressure caused by the requirements of implementing
precise exceptions on a statically speculative architecture. We showed how to modify our scheme
to optimize restart. for instance we introduced the instruction counter that allowed speculative
tagging to quickly identify the instruction in the restart block where execution should resume after
exception handling. We also described a technique called lazy restart that enables us to reduce the
size of the restart blocks.
Chapter 5
Scheduling with Speculative Instructions
In this chapter, we describe our scheduling algorithm, called whoTh-DAG schediihng. Section 5.1
provides the details of prior work from which our algorithm was developed. Section 5.2 details
the whole-DAG scheduling algorithin. A more detailed description of the selection heuristic used
in the algorithin is given in Section 5.3. A brief comparison with prior work is give in Section 5.4.
Finally, we conclude with a summary in Section 5.5.
5.1 Background
In this section we shall discuss certain aspects of scheduling: information representation, list
scheduling and selection heuristics, global code motion, and branch probabilities.
5.1.1 Information Representation
?I?rpically, the code to be scheduled is presented to the scheduler as a directed graph, the controlflow
graph (CFG) [lj. We shall briefly review some terms associated with the CFG. The nodes of a
control-flow graph are basic blocks. Basic blocks are linear sequences of instructions. There is a
directed edge between two basic blocks if it is possible for control flow to reach the second basic
block after passing through the first basic block. A fork in the CFG occurs when a basic block
has more than one successor. A join in the CFG occurs when a basic block has more than one
predecessor. Fig. 5.1(b) shows the CFG for the code in Fig. 5.1(a).
Scheduling algorithins make extensive use ofdependence information. One way ofrepresenting
data dependences between instructions is by using def-ztse chains. Def-use chains connect two
dependent instructions by a directed edge. Thus the def-use chains for the flow dependences in
the example in Fig. 5.1 would be represented by the def-use edges shown in the dependence graph
shown in Fig. 5.21. There should also be an edge from (1) to (?), representing the output
dependence between them caused by r 1.
We list some terms associated with the data dependences of a program below.
1Note that, for clarity, we have not shown the branch instrrlction.
61
62
(1)rl
(2)r2:=..
(3)br==...Ll
LO:
(4)... :=r2...
(5) goto L2
Li:
(6)... :=r2...
(7)rl :=...
L2:
(8)... :=rl
Fork
Th?
(1)rl
(2)r2:=..
(3) br== ... Li
M
(4)...
Merge			(8)... ri
(7)rl
Figure 5.1: Control Flow (3raph
(1)rl
Figure 5.2: Def-Use Chains
(a)
63
Dependence Path A dependence path is a path in the dependence graph, i. a sequence of
instructions each of which is dependent on the previous instruction in the sequence.
Path Length The length of a path is the time it would take to execute the instructions in the path.
Critical Path The critical path in a basic block is the dependence path with the greatest length. It
is a lower bound on the amount of time to execute the instructions in a basic block.
Critical Instruction An instruction on the critical path is known as a critical instruction.
Distance The distance between two instruction is the length of the longest path joining them.
Start The staat instructions are those instructions that are not dependent on any other instruction.
For engineering reasons, a dummy instruction called the start instruction is often added to
the CF(3, and all otherwise independent instructions are made dependent on it.
End
The end instructions are those instructions that have no instruction dependent on them. Again,
a dummy instruction called the ei?d instruction is often added to the CFG, and it is made
dependent on all instructions that otherwise have no other instructions dependent on them.
Early The early of an instruction is the distance between start and the instruction. It represents
the earliest time after which an instruction can be issued in any execution.
Late The late of an instruction is the distance between the instruction and end. It represents the
least interval possible between the execution of the instruction and the execution of the last
instruction in the block.
While def-use chains represent the data dependences between instructions adequately, they
ignore the effects of control dependences. Therefore, we use an alternative dependence represen-
tation technique, the dependenceflow graph (DFG) [48,28]. In the DFG, dependence edges that
cross basic block boundaries are "pinned". In particular, a data dependence edge that crosses a
fork is split into several edges by the addition of a dummy node associated with the fork known
as a switch. Similarly, a data dependence edge that crosses a join is split into several edges by
the addition of a the merge node, which is associated with the join. Further, some of these newly
created edges are combined. Consider the flow dependences for register r2 in the example of
Fig. 5.1. In the DFG representation, the dependence edge between (1) and (4) is broken into
two edges: between (1) and a switch, and between the switch and (4). The dependence between
(1) and (6) is broken up similarly. Finally, the same edge (and switch) can be used for the (1)
to switch component of both dependences. Thus, the two dependences are represented in the DFG
by 3 dependence edges: (1) to switch, switch to (4) and switch to (6).
Note that there is no difference between representmg just data dependences and using a DFG
representation as long as the dependences are confined to a single basic block.
A detailed discussion ofthe properties, construction and uses of DFGs can be found in [27]; for
our purposes it is enough to note that no dependence in the DFG can cross a basic block boundary
without also passing through a switch or merge node, and we can discover the required control
dependence information by examining these switcWmerge nodes.
64
(l)rl
(2)r2:=..			??
(3)br...LO			I			Switch
r
(4)... :=r2...,'			(6)... :=r2...
Merge
Figure 5.3: Dependence Flow Graph
Our code typically computes properties of the program by setting up and solving data-flow
equations on the DFG. Since the implementation relies fairly heavily on the properties of DFGs,
we shall defer all such details to Appendix B. In this chapter, we shall attempt to explain the
algorithms and heuristics used without reference to DFGs.
5.1.2 List Scheduling
In this section, we shall describe a technique used for scheduling within single basic blocks, known
as llst-scheduling. It forms the basis for several global scheduling techniques, including ours.
List-scheduling operates on a cycle-by-cycle basis. At each cycle, upto N instructions are
scheduled, where N is the issue width of the processor. For correctness, all instructions that an
instruction depends on must already have been scheduled before the instruction is itself chosen for
scheduling.
Further, for optimal performance, the instruction(s) scheduled should be selected so as to
minimize the likelihood of a stall. For instance, the processor will stall if it issues an instruction
with result latency 1 and then encounters an instruction which uses that instruction's result before 1
cycles have elapsed. To minimize the chances ofthis happening, the scheduler will start considering
instructions that use the result of a 1 latency cycle for scheduling only after 1 (scheduler) cycles
have elapsed.
These observations form the basis for the ready-list. The ready-list at any cycle is a list of those
unscheduled instructions that are ready at that cycle, where an instruction is considered ready if it
satisfies both of the following conditions:
65
Original			Cycle
(1)rl:=r4*7			1:
6( (2)i;3:=r2+2
(3)r5:=rl+r3
(4)r6:=r3+8
Ready List			Scheduled
(1)rl :=r4*7
(2)r3 :=r2+2
2:			(2)r3:=r2+2
3: (4)?:=r3+8
4: (3)r5:=rl+r3
Figure 5.4: List Scheduling
(1)rl :=r4*7
(2)r3 :=r2+2
(4)r6:=r3+8
(3)r5 :=rl +r3
o+ All instructions on which it is dependent have been scheduled.
o+ If it uses the result of an instruction with latency 1, then that instruction must have been
scheduled 1 cycles (or more) earlier.
At each cycle, the scheduler schedules upto jV instructions from the ready-list, and then increments
the cycle. As a consequence of this, additional instructions may be added to the ready list.
This process is illustrated by the example in Fig. 5.4, in which the target processor is a single
issue pipelined processor with a multiply result latency of 6. Initially, at cycle 1, the ready-list
consists of those instructions which are independent of all instructions in the basic block, namely
(1) and (2). Out of these, the scheduler picks one for scheduling, say (1). It then considers
cycle 2. At this cycle, the only ready instruction is (2) , which it then schedules. At cycle 3, (3)
becomes ready: it uses the result of single latency operation (2), which was scheduled in the
previous cycle. This is the only instruction on the ready list, and is scheduled. At cycles 4-6, the
ready-list is empty, and no instruction can be scheduled. Finally, at cycle 7, (4) becomes ready,
and is scheduled.
5.1.3 Selection Heuristics
While list scheduling, there will be some cycles in which the number of instructions that are ready
exceeds the number of instructions that can be issued. Some heuristic has to be used to choose
between the available instructions.
Clearly, the time taken to execute the instructions in a basic block can be no shorter than the
length of the critical path. If this bound is to be met, critical instructions should be scheduled as
66
soon as they become ready. Otherwise, if a critical instruction is scheduled later, then executing the
instructions as scheduled wilI necessarily take longer than the critical path length. Thus, critical
instructions must be scheduled before non-critical instructions.
One can extend this idea further. An instruction cannot be scheduled before its early time
without stalling the processor, and thus lengthening the time to execute the program. Similarly,
an instruction cannot be scheduled less than late cycles away from the end. Thus, an instruction
cannotbe scheduledany laterthancr?ticalpath length --H latewithoutimpairingperformance.
However,itcanbescheduledinanycyclewithinthesebounds. Thisistheslackof the instruction.
slack crit?calpath length --Hlate--Hearly
Not surprisingly, the slack of a critical instruction is 0.
In Fig. 5.4, the critical path was (1), (3), with a length of 7. (2) was independent of all
other instructions, so its early was 0. (4) , with a latency of 1, depends on it, so (2) had a late
of 1. Thus, (2) could be scheduled at any cycle between 1 and 6 (inclusive) without hurting
performance; consequently, it has a slack of 6.
Now, an instruction with greater slack has more cycles in which it can be scheduled without
hurting performance. If a conflict for resources arises, it is bener to allocate the resource to the
instruction with less freedom. The instruction with more freedom will have more chances to find
a cycle in which there is no conflict for resources. Thus, a scheduling heuristic that can be used
is, when selecting between instructions, prefer those with less slack. For instance, in the example,
(1) had a slack of 0, while (2) had a slack of 6. Thus, we scheduled (1) first. This still left 5
other cycles in which (2) could be scheduled without impairing performance. As it turned out, it
was simple to find another cycle in which to schedule (2).
The slack selection heuristic can be greatly simplified when using list scheduling. List schedul-
ing schedules cycle-by-cycle. In particular, when it is attempting to schedule an instruction, it does
not matter what the instruction's early was--Hthe instruction cannot be scheduled any earlier than
the current cycle. Thus, a more appropriate value for slack would be
slack cr?ticalpath length--H late--Hcurrentcycle
Of these terms, only late is instruction dependent. In particular, choosing the instruction with
minimum slack is equivalent to choosing the instruction with maximum late. This simplified
heuristic of choosing the instruction with maximum late is commonly used in list scheduling.
5.1.4 Global Code Motion
The scheduler is typically given a region of code. It rearranges the instructions within this region,
so as to decrease the amount of time taken to execute the region. When the region is a basic
block, the scheduler simply reshuffies the instructions within the basic block. However, if the
region contains several basic blocks, then instructions may be moved between basic blocks. This
isknownasglobalcodemotio7i.
Global code motions can be divided into two categories: safe and i?iisafe. The safe code
motions are illustrated in Fig. 5.5. Briefly, they include:
67
I'
I'
I,
(a) Down Past Fork			(b)Up Past Join
Figure 5.5: Safe Code Motions
Movement down past fork (Fig. 5.5(a)): When an instruction is moved down past a fork, it is
replicated and added to each destination basic block of the fork.
Movement up past join (Fig. 5.5(b)): When an instruction is moved up past a join, it is replicated
and added to each source basic block of the join.
The unsafe code motions (Fig. 5.6) are those that add instructions to paths on which they would
otherwise have not been present. If one of these instruction caused an exception, then the program
would incorrectly have caused an error. Hence, some mechanism must be present to prevent
this from occurring. The unsafe code motions, and possible hardware mechanisms that allow the
compiler to safely perform these motions, are listed below:
Movement up past fork (Fig. 5.6(a)): To allow an instruction to be moved up past a fork,
some support for speculative execution is required. The approach we are studying involves
convertmg the instruction being moved into a speculative instruction.
Movement down past join (Fig. 5.6(b)): Predicated execution [47] allows the movement of
instructions down past joins. It uses predicated instructions, i.e. instructions with a third
input known as the predicate. Predicated instructions are executed only if the predicate
is true. To safely move an instruction down past a join, it is converted into a predicated
instruction with an appropriate predicate input, as shown in Fig. 5.7.
Our work ignores the possibility of movement down past joins. We shall use the terms
movement up and down to indicate safe code motions up past joins and down past forks. We will
use speculative movement to denote movement up past forks.
68
(a) Up Past Fork
(b) Down Past Join
Figure 5.6: Unsafe Code Motions
(1) ifnot ri, LO
(2)			r3			r5 + r?
LO:
(1) ifnot ri, LO
LO:
(2*) r3 r5 + rV if ri
(a)
Figure 5.7: Predication
69
5.1.5 Probabilities in Scheduling
Certain regions of code get executed more often than others; consequently, the effect of reducing
their execution time on overall execution time is much greater. In fact, in certain cases it may be
beneficial to increase the execution time of some region, if it decreases that of some other region,
since the execution time of the program as a whole may still go down. Such considerations play a
major role in most global scheduling algorithms.
The relative execution frequencies of various regions can be derived from the branch proba-
bilities, i.e. the probability that a branch will be taken/not taken. Typically, the method used to
identify branch probabilities is profilitig. In profiling, an instrumented, unoptimized, version ofthe
program to be compiled is executed using representative sample data. For each branch, the number
of times it was taken and not taken is recorded. These numbers are used to estimate the branch
probability. Then the program is compiled and optimized using these probabilities.
Of course this approach may not always be acceptable; there may be no representative sample
inputs, the user may not wish to compile/execute the program twice etc. In this case, the compiler
will need to estimate the probabilities from the structure of the program. Recent work has shown
that the compiler can do this with reasonable accuracy [4].
5.2 Whole-DAG Scheduling
Whole-DAG scheduling extends list scheduling to general acyclic CFGs. Such regions form
directed acyclic graphs or DAGs, hence the name. The extensions made by whole-DAG scheduling
make it possible to use modified versions of heuristics such as late as selection criteria. Further,
the extensions make it easy to control both inter-block code motion and the amount of speculation
one can represent or is willing to allow.
5.2.1 Frontiers
Some global scheduling algorithms allow the movement of branches past branches, and other
transforms that change the control structure of the program being scheduled. We, however, do not
allow any such code motions, and shall preserve the shape of the original graph. Thus, we know
the shape of the final DAG a priori--Hit is the same as that of the input DAG.
This suggests a scheduling strategy similar to that of list scheduling: we start with the first
basic block, and start scheduling instructions in it, in a cycle-by-cycle fashion. When the block
becomes "full", i.e., when we decide not to schedule any more instructions in it, we start scheduling
instructions in its successor basic blocks.
This set of blocks in which we are scheduling instructions at any one time is called afroiitier,
and the basic blocks in it are known as fro,ztier blocks. The frontier blocks always constitute
a node-cut of the DAG. Initially, the frontier consists of the entry nodes in the DAG. When all
predecessors of a block become "full", they are deleted from the frontier, and the block is added to
the frontier. Thus, at any point the status of the DAG is as shown in Fig. 5.8--Hthe frontier blocks
70
Scheduled
Frontier
(Partly Scheduled)
Unscheduled
Figure 5.8: Frontier
are partially scheduled, the blocks above the frontier have been completely scheduled, while the
blocks below the frontier are unscheduled.
The top-level structure of whole-DAG scheduling is summarized in Fig. 5.9. As should
be evident, this algorithin schedules instructions cycle-by-cycle. In each cycle, it schedules
instructions in each frontier block. The instructions to be scheduled in a frontier block are chosen
from a ready list unique to that block. Thus, conceptually, there is a ready list per frontier block.
Note that the ready list may contain instructions from descendant basic blocks; further, such an
instruction could belong to several ready lists. If an instruction from a descendant basic block is
chosen, it must be moved into the current basic block. This may involve replicating the instruction,
or conversion into a speculative instruction, or both. Note also that the ready list will never contain
branches; they are handled separately.
5.2.2 Readiness and Ready Lists
Clearly, the ready lists used in whole-DAG scheduling differ from those used in list-scheduling in
several ways. To reconcile these differences, we must answer the following questions:
o+ When is an instructionready? An instruction's inputsmay be available on some path through
the graph, but not all.
o+ How are the ready lists for each frontier blocks maintained?
o+ How can we identify and use instructions that are ready only on some paths through the
graph?
We shall consider these, and other questions, in this section.
71
frontier = entry blocks of DAG
cycle = 1
while( frontier is not empty
for( each block in frontier
if( block is not full
for( i = 0 to issue width of processor
from ready list for block
select an instruction
move selected instruction into block
schedule selected instruction
add/delete frontier blocks
cycle = cycle + 1
Multi-Path Ready
Figure 5.9: Whole-DAG Scheduling: Top-Level Structure
Consider Fig. 5.10. If program execution was such that control entered the basic block on the right
side of the join, then the multiply instruction that defines r2 would, presumably, be executed at
cycle 3. Assuming that the multiply has a latency of 12, any instruction that uses r2 cannot be
scheduled till cycle 15. Thus, (3) will not be issued till cycle 15. However, if control entered
from the left side, (3) could be issued at cycle 6. So, when should we consider (3) ready? We
can pick cycle 6, cycle 15, or some cycle in between.
As was seen earlier, we use readiness as a way of determining when an instruction can be
scheduled so that it would not cause any stalls. While considering the instruction ready prior to
5->6 (l)r2:=rl+3
(2) r2 := ri * 9
MM
(3)... :=r2+5
Figure 5.10: Computing Ready
3-> 15
72
(a)
Figure 3.11: Controlling Code Motion
cycle 13 might not cause a stall if control entered from the left side of the join, it would certainly
stall the processor when control enters from the right side. Clearly, we must choose the maximum
value--Hin this case, cycle 13.
Between cycle 6 and cycle 13, the instruction is paflially ready, i.e. ready on at least one
control-flow path. We can take advantage of such partial readiness. Details of this are deferred till
a later section.
The Ready List & Code Motion
The ready list of a frontier block, as mentioned earlier, can contain ready instructions from different
basic blocks. In the most general case, the ready list will contain all ready instructions from all
basic blocks that are descendants of the frontier block2. However, there is no reason to choose all
descendants.
By restricting the descendants from which we choose ready instructions we can control the
kinds and amount of code-motions allowed. For instance, if we choose to select a block that is
connected to the frontier block via a join, then we may select an instruction from that block for
scheduling. In that case, the instruction would have to be moved up past the join, as shown in
Fig. 3.11(a). Thus, by choosing a descendant that is below a join, we have implicitly allowed
code motion up past joins. Similarly, if we select a block below a fork, as shown in Fig. 3.11(b),
we implicitly allow speculation. If we select blocks below both sides of a fork, we are allowing
multi-path speculation. It is possible to first select an instruction from the block below one side of
the fork, and then an instruction from the block on the other side of the fork, resulting in speculation
past both sides of a branch.
A systematic policy for choosing descendants of frontier blocks so that their ready instructions
can be considered for the ready list of the frontier block is called a code-motion strategy or, simply,
a strategy. Possible strategies and their consequences on code motion are:
o+ Choose only descendants that are past joins--Hno speculation
2It is incorrect to put an instruction from a non-descendant block into the ready list for tne frontier block. Such an
instruction can never be safely moved into the frontier block. Funher, such a code motion would be of no practical
value.
73
o+ Choose only descendants that along the most probable path to end--Hanalogous to trace
scheduling, results in single path speculation.
The choice ofstrategy shall turn out to be important in determining the performance ofthe scheduled
code.
Maintaining the Ready List
Instead ofmaintaining a ready-list for each frontier block, where some instruction may be present in
multiple ready-lists at any one time, we maintain distributedper-block ready-lists. These per-block
ready-lists can be combined to provide the ready-list for a frontier block.
Every block has an owned-list. Initially, it contains the instructions in the block in the original
DAG. As instructions are moved around in the DAG while scheduling, instructions may be added
to/deleted from the owned-list of a block. The per-block ready-list of a block is simply those
instructions in the owned-list for the block that are ready.
Clearly, the ready-list of a frontier block can be found by picking a set of descendant basic
blocks according to the strategy being used, and then combining the per-block ready-lists for those
blocks.
Partial Readiness
Partially ready instructions can be scheduled in the basic block(s) in which they are ready, but not
in all basic blocks. A simple way of allowing the mechanisms described above to exploit a partially
ready instruction is to move it up past the join. By doing so, the instruction will get replicated.
Some of these copies will become ready, and will be handled appropriately by the scheduling
mechanisms described above.
We still have to identify the partially ready instructions. To do so, we shall make use of late3
and the following observation: if an instruction is dependent on another instruction, then it's late
is less than that of the other instruction's late. Consequently, if an instructions late value is greater
than that of all unscheduled instructions in a (frontier) block, it is not dependent on any of those
instructions, and is ready in that block. Actually, it is a little more complicated than this: there
may be some 1 cycle latency instruction that was scheduled less than 1 cycles ago, on which the
instruction could be dependent. Thus, for accuracy, we have to consider the late values of all
instructions that are unscheduled, or have been not yet "completed".
Based on this observation, we come up with the following technique to handle partial readi-
ness. Basically, for each frontier block we find the maximum late value of the unscheduled and
"incomplete" instructions. We select the minimum of these maximum late values. All instructions
with late values greater than this get moved up past a join (if they are below a join).
3Actually, the late used in whole-DAG scheduling is somewhat different from that used in list-scheduling, and is
described later. However, the observation still holds for the modified late.
5.2.3 Miscellanea
TinnIernent?n?Code Motion
74
Upward code motion of an instruction from a basic block to a frontier block is implemented as
a series of moves along a path joining the frontier block to the basic block. At each step, the
instruction is moved from its current block to the next basic block in the path, till it eventually
reaches the frontier block. In the process, the instruction may be replicated or made speculative.
This process is shown in Fig. 3.12.
Note that by moving this instruction in a step-wise fashion, we may have added redundant
copies--Hi.e. copies of the instruction that recompute the value, even though it is available. After
moving an instruction, we execute a total availability algorithm to eliminate these copies. Note that
the algorithm is fairly restricted--Hit computes availability only for the instruction that was moved,
and only for the region between the frontier block and the original basic block for the instruction.
The result of this computation for our example can be seen in Fig. 5.13
Moving Below Forks
So far we have only considered upward movement of instructions, i.e. speculation and movement
up past joins. We have not considered movement down past switches. Downward movement of
code in whole-DAG scheduling is caused by the scheduling of a branch. When a branch in a basic
block is scheduled, all unscheduled instructions in the blocks "owned-list" are moved down past
the fork, and copied into the block's immediate successors.
We have a choice as to when to schedule a branch. In our algorithm, we have chosen to schedule
a branch as soon as possible, i.e. as soon as it becomes ready. This may cause code expansion,
without necessarily improving performance. If this expansion turns out to be a problem, it may be
desirable to choose a different policy.
5.2.4 Summary
A more detailed break-down of the whole-DAG scheduling algorithm is shown in Fig. 3.14. Note
the use of the following policies:
o+ A frontier block is considered full when all instructions on its "owned-list" have been
scheduled.
o+ A branch at the end of a frontier block is scheduled as soon as it becomes ready.
There are still two policies that have not been discussed. One, the selection heuristic, shall be
discussed in detail in the next section. The other, the choice of code motion strategy, is a subject
of study in the next chapter. As shall be shown, by varying the strategy, we can obtain different
pefformance results.
75
I			_____________________________________________________________________
ri
:trl*
I'
Fisnire 5.12: Imp1emenfln? Code Motion
76
4
Figure 5.13: Eliminating Redundancies
5.3 Selecting Instructions
The selection heuristics we shall use shall be based on a extension of late for DAGs, and on
branch probabilities. We shall first describe how we compute the extended version of late and then
describe the heuristics.
These heuristics are aimed at improving the averageperformance of the DAG--Hi.e., improve
the expected time to execute the DAG, based on the given branch probabilities. This means that, at
times, we shall degrade the performance along a less likely path so as to improve the performance
along a more likely path.
5.3.1 Computing Late
Multi-Path Late & Early
late, in list scheduling, is a measure of the distance to the end of the basic block. Extending this
idea to a path in the CFG is simple: late is the distance to the end of the path. In a more general
DAG, however, there can be multiple paths to the end. One solution to computing late for an
instruction is to enumerate all paths through the DAG containing the instruction, compute the late
for each path and somehow combine them. We could use several methods to combine the late
values; one appealing method is to average the late value for the paths, weighted by the relative
probabilities of each path. This gives greater weight to the more likely late values, but does not
ignore the effect of the late values on less likely paths. However, this is impractical; the number
of paths through a DAG is exponential in the size of the DAG.
We can approximate this by a different approach, shown in Fig. 5.15. We average late values at
each fork, weighted by the branch probabilities at that fork. This averaged late value is then used
77
frontier = entry blocks of DAG
cycle = 1
while( frontier is not empty
handle partial readiness
for( each block in frontier
if( block has ready branch
schedule branch
move instructions below fork
else if( block has unscheduled owned instruction
/* not full */
choose descendants according to strategy
build ready list by combining per-block
ready-lists of descendants
for( i = 0 to issue width of processor
from ready list for block
select an instruction
move selected instruction into block
schedule selected instruction
add/delete frontier blocks
cycle = cycle + 1
Figure 5.14: Whole-DAG Scheduling: Detailed Breakdown
78
10.25
(1)r1 :=...
(4)... :=r2+5			1
2 (2) r2			r1+3			(3) r2			r1*9			13
Figure 5.15: Computing late
Late & Branches
to compute the late value of other instructions. Thus, (1) has a late of 2 on the 25% probability
branch. and a late of 13 on the 75% probability branch. Its late is therefore 10.25.
We compute multi-path early similarly. In this case, we combine at joins, averaged by the
relative probabilities of the blocks preceding the join.
When computing late, instructions that have no other instruction depending on them are given late
value of 0, indicating that they can be scheduled as close to the end as desired. This causes a
problem with branches. A branch has no output; thus no instruction is data dependent on a branch.
Consequently, branches have a late value of 0. and the late for the instructions that compute the
inputs to the branch are also close to 0. This results in these instructions, and therefore the branch,
getting scheduled late, resulting in a sub-optimal schedule.
To improve the quality of the schedule, it was found necessary to give more importance to
scheduling branches, by initializing their late value to something other than 0. The way we
compute this initial value is to compute the early for all instructions in the graph. The late values
of branches is set to the difference between the largest early in the DAG and the early for the
branch.
5.3.2 Selection Heuristics
The selection heuristics determine, in the case where two instructions are competing to be scheduled,
which ofthe two instructions should be preferred. These heuristics can be broken into three cases:
1. The two instructions are in the same basic block.
79
Figure 5.16: Selection Heuristic--HCase I
J(4O)
Figure 5.17: Selection Heuristic--HCase II
2. There is a path containing both instructions.
3. There is no path containing both instructions.
Case I: Same Basic Block
Ifthe two instructions come from the same basic block, as shown in Fig. 5.16, it suffices to compare
their late values. Of the two instructions, prefer the one with the larger late. In the example, the
numbers in parentheses indicate the late for that instruction. Thus, we would prefer J over I.
Case II: Common Path
When the two instructions have a common path between them, but come from different basic
blocks, as shown in Fig. 5.17, it is not enough to compare the late values; the relative probabilities
of execution of the two instructions must be taken into account. Consider the case shown in the
example; even though J has a larger late, it is only executed 1/100 as often as 1 is. Thus, 99/100
of the time, scheduling J instead of I would be useless. So, instead of comparing just the late
values, we compare:
late x probabil?ty of cx ecut?on
In the example, if probability of I being executed is 0.2, then the probability of J being executed
is 0.002. The products of late and probability are 4.0 and 0.08 respectively, indicating that I is to
be preferred.
80
M 80
Figure 5.18: Selection Heuristic--HCase III
Case Ill: No Common Path
If there is no path through both instructions, as shown in Fig. 5.18, the situation becomes even
more complicated. In fact, we have to go back to the original idea of slackness to come up with a
heuristic for selecting between instructions. The reason should be apparent from the example. J
has a larger late than I. However, that block also contains an unscheduled instruction, M with late
80. Thus, J is not on the critical path; indeed it has a slack of at least 40. I, on the other hand, has
a slack of 0. Clearly, I is more "critical", and should be preferred over J.
The reason that late cannot be used to estimate slack is that there are multiple paths to end,
one through the block containing I and the other through the block containing J, and these paths
have different critical path lengths. To estimate the critical path lengths, we examine instructions
in the respective blocks, and identify the instruction with the largest late. We use this value as an
estimate the critical path length. Of course, we must also account for probability. Thus the metric
which we use when selecting an instruction is:
(late --H max late) x probability of execution
We pick the instruction with the largest such value. Note that this value will be non-positive.
In the example, J has value (40 --H 80) x 0.5 = --H20, and 1 has value (20 --H 20) x 0.5 = 0.
Thus, we will prefer to schedule I over J.
5.4 Comparison with Prior Work
Our approach improves over prior global scheduling algorithms in several ways. The earliest such
global algorithms, such as trace scheduling [15] and the boosting optimizer [57], scheduled only a
single path at a time. This could result in the performance of that path being optimized at the cost
of the performance of the DAG as a whole.
Another difference between our approach and other approaches is in the introduction of code
motion strategies. In previously proposed algorithms, either all ready instructions were considered
for scheduling, as in selective scheduling [44], or some fairly restricted region was considered,
such as speculation past exactly one fork [6]. They did not have the ability to vary the amount
of speculation possible. It shall be seen in the next chapter that the flexibility to choose different
strategies is important, and that the best strategy to choose is not obvious.
81
The most prominent difference between our work and prior work lies in our selection heuristics.
?T??pical heuristics [44,6J tend to schedule all non-speculative instructions first. Only after that do
they consider scheduling a speculative instruction. Between two speculative instructions, the more
probable instruction is selected. late for an instruction was computed is the end of its basic block.
It is used only to break ties between instructions from the same basic block.
We, on the other hand, make our selections based on both a global late and probability. No
one factor completes dominates the selection process. Thus, our approach can result in situations
where a speculative instruction is scheduled in preference to a non-speculative instruction. As shall
be shown in the next chapter, factors like this make our heuristic superior to the one described in
the previous paragraph.
5.5 Summary
We have presented a scheduling algorithm known as whole-DAG scheduling that schedules an
acyclic region of code, i.e. a DAG. It is derived from list-scheduling, and uses many of the same
concepts and heuristics, extended to DAGs. The extensions make fairly extensive use ofprobability
information to generate a schedule that optimizes the expected execution time of the DAG.
Chapter 6
Results
In this chapter, we will describe the results of various experiments we performed to measure
the potential benefits of static speculation, and to gain some insight into appropriate methods of
scheduling for static speculation.
We start off in Section 6.1 by describing the foundations for our experiments--Hthe compiler,
the benchmarks etc. The next three sections describe the inferences drawn from the results of the
experiments. Section 6.2 examines the benefits to be gained from static speculation. Section 6.3
compares several variations in the scheduling algorithm, and shows which are better. This section
also includes more details on the experiments performed. In Section 6.4, we use the performance
information to indicate a good choice for the number of speculative tags an architecture should
have. Finally, we summarize our results in Section 6.5.
6.1 Preliminaries
In this section, we describe the foundations on which our experiments are based. This includes a
description of the compiler on which we implemented our scheduler, the benchmark programs we
measured and the methods we used for measuring and reporting performance.
6.1.1 The Pidgin Compiler
Our scheduling algorithrn is integrated into the Pidgin compiler, which was implemented as part
of the Typhoon project. The compiler is designed to be re-targetable, so that one should be able
to change either the language being compiled or the machine being targeted fairly simply. This is
achieved by dividing the compiler into the following 3 modules:
Front-end Reads the high-level language input program and translates it into a compiler interme-
diate representation.
Optimizer Reads the compiler intermediate representation of a program and performs various
optimizations on it. It outputs the optimized version ofthe program in a compiler intermediate
representation, which may be the same as the input representation.
82
83
Front-end
Optimizer
Back-end
* Fortran to TIL			Mi			*TILtoC
* Constant Propagation
* Load Elimination
* Common Subexpression
Elimination
Figure 6.1: The Pidgin Compiler
Front-end			Optimizer Instrument			Back-end
IJ
Figure 6.2: Instrumentation
Back-end Reads the output of the optimization phase and translates it into a form suitable for
execution on the target architecture.
In theory, all that needs to be done to compile a new language is to write a new front-end for that
language. similarly, to compile all the existing languages that are supported for a new architecture,
it should be sufficient to write a new back-end.
The modules we shall be using are shown in Fig. 6.1. As shown, the front-end translates
FORmAN to ?L, our compiler intermediate form. The optimizer reads the ?L representation of
the program, performs the standard set of optimizations [1] on it, and outputs the optimized version
in ?L. The back-end, instead ofproducing machine code for a particular architecture, converts the
?LimoCcode.
For our purposes, we modify the compiler in two ways. First, as shown in Fig. 6.2, we add an
instrumentation module. This alters the C code produced by the back-end so that, on execution, it
will collect various statistics, such as branch probability information. Second, as shown in Fig. 6.3,
we add a scheduler that extracts and schedules DAGs from the optimized code. This ?L code is
then directly simulated.
84
I			I
I			I
Simulate
Front-end			Optirnizer			Schedule			:
I			I
 J			I			J
6.1.2 Measuring DAGs
Figure 6.3: Scheduling and Simulation
Our work deals with the impact of static speculation on DAGs. So, we need to focus on the DAG
code in programs. To this end, we selectively extract DAGs from the programs. Our criteria for
selecting DAGs are:
1 - It must be a single-entry, single-exit region [27].
2. It must not be a basic block, i.e. must contain at least one fork and one join.
3. There must be less than 1000 distinct control flow paths through it.
Criteria 1 is a consequence of our representation of control structure, which makes it easy to
isolate DAGs that are single-entry single-exit regions. Criteria 2 filters out those DAGs that could
not possibly benefit from the introduction of speculation. Criteria 3 is necessary to control the
amount of time required to evaluate the performance of the DAG. It is not very restrictive; in
particular, over the set of benchmarks we used, it caused the rejection of exactly 2 DAGs.
After isolating the DAGs in a program, we schedule them, and simulate the performance of the
scheduled code. While simulating a DAG, we separately measure the time taken to execute each
control flow path through it. We then compute a weighted average of the execution times for the
paths, where the weight is the path probability'. This number gives the average execution time for
the DAG.
To illustrate this process, consider the DAG shown in Fig. 6.4. There are two control-flow
paths through the DAG, ABD and ACD, with execution times of 15 and 5 cycles respectively.
From the branch probabilities, it follows that the two paths are executed 20% and 80% of the time
respectively. Thus, the average execution time for the DAG is 15 * .2 + 5 * .8 = 7 cycles.
We can obtain a lower-bound on the average execution time using a similar approach. Basically,
we measure the critical path through each path in the DAG, ignoring the effect of branches. We
then average the critical path lengths weighted by the path probabilities. This gives us the optimal
average time to execute the DAG, i.e., the average time that it would take to execute the DAG on
a machine with an infinite amount of resources which performs an infinite amount of speculation.
`Perhaps it would be more appropriate to use the term path frequency, since the path "probability" is really the ratio
of times path is executed to toLal number of times DAG is executed.
85
A
15cycles
D
Scycles
Figure 6A: Computing Average Execution Time
6.1.3 Benchmarks
The DAGs we measure are obtained from standard FORTRAN benchmark programs, including the
linpack benchmark and programs from SPEC-89 and Perfect Club benchmark suites. A summary
of the programs and the DAGs extracted from them is given in Table 6.1.
We chose the 4 out ofthe 13 programs from the Perfect Club benchmarks: APS, LGS, TFS and
TIS. Of the 6 SPEC-89 FORTRAN benchmarks, 013. spice2g6 was not considered because
its size prevented our compiler from processing it. 030. mat rix300 was rejected because it is
basically a simple, triply-nested loop, and would therefore would yield no insight into the use of
static speculation for DAGs.
We isolated a total of 234 DAGs from the 9 programs. Interestingly, only 183 of these are ever
exercised when the programs are run using the standard input data provided with the benchmark.
Control never enters the other 51 DAGs.
6.2 The Benefits of Speculation
In this section, we report the benefits to be had from using static speculation to improve the
performance ofDAGs. This includes measurements ofthe importance ofDAGs inboth FORTRAN
and C programs, as well as estimates of the speedups to be obtained by using static speculation
with our scheduling algorithm.
6.2.1 The Importance of DAGs
FORTRAN programs are notorious for spending most of their time executing inner-loop bodies.
Worse yet, most of these loops are simple basic blocks. So, it would seem that a techmque that
improves the performance ofDAGs would have little impact on the performance of the program as
a whole. This appears to be borne out by the figures in Table 6.2, which measures the fraction ofthe
86
Table 6.1: Input Programs & DAGs
Program			Size			Total			Executed
DAGs			Insns			DAGs			Insns
APS			6105			86			9899			64			8081
LGS			2389			24			2118			20			1947
ThS			1986			30			4517			26			2509
485			5			110			2			65
0l5.doduc			5336			64			9809			51			8403
020.nasa7			1108			4			117			4			117
042.fpppp			2717			13			771			11			676
047.tomcatv			195			1			48			1			48
linpack			793			7			312			4			222
total			21114			234			27701			183			22068
dynamically executed instructions that are in DAGs. Overall this number is 4% of all instructions
executed. As we shall show in the next section, adding speculation improves the performance of
DAGs by 13%-17%. At a first glance, it appears that the overall improvement in performance
obtained by the addition of speculation in DAGs will be minor.
However, on closer examination, it becomes apparent that the data is badly skewed by
020. nasa7 and TFS. If we ignore them, 15.5% of the instructions executed by the remain-
ing 7 programs are in DAGs. Thus, it is possible that static speculation can have a significant
impact on some programs.
As we can see, from Table 6.3, which summarizes the improvement in overall performance
for each function, static speculation does have a significant impact on some programs. Thre
benchmarks, i.e., LGS, 015. doduc, and 047. tomcatv show a substantial improvement in
performance. The overall improvement, averaged over all programs, is only around 1%. However,
ifwe again ignore the effects of TES and 020. na s a7, the average improvement for the remaining
7 programs is around 4%.
6.2.2 Non-numerical Programs
Our experiments have been done using numerical FORTAAN programs. It is believed that non-
numerical C programs would have more DAGs than the programs we have considered, and thus
static speculation would be more beneficial to such codes. Unfortunately, we did not have access
to a C compiler. However, there exists an incomplete C front-end to the Pidgin compiler. By
using it, we were able to estimate the fraction of the dynamic "instructions" executed that occur in
DAGsfor3ofthe4SPEC-89Cbenchmarks. We couldnotprocess 001 .gccl .35, sinceituses
set jmp () / long jmp () . The numbers for the other 3 are summarized in Table 6A.
Note that the "instructions" measured here are intermediate between a statement and a RISC
instruction. In particular, they have not been converted to a form suitable for execution on a
87
Table 6.2: Dynamic Execution Frequency
Program			Total Insns			DAG Insns			%age
(in millions)			(in millions)
APS			5905.2			701.7			11.9%
LOS			2482A			527.8			21.3%
TFS			%73.2			16.5			0.2%
2.0			1.6			80.0%
015.doduc			1290.8			474.8			36.8%
020.nasa7			24302.1			0.0			0.0%
042.?ppp			143.0			3.0			2.1%
047.tomcatv			2273.4			195.5			8.6%
linpack			242.4			13.4			5.5%
total			46214.3			1934.3			4.2%
Table 6.3: Overall Improvement
Program			%age Improvement			%age Improvement
(4 issue)a			(8 issue)b
APS			0.3			1.1
LOS			12.4			12.6
TFS			0			0
0.2			0.3
015.do:luc			4.5			7A
020.nasa7			0			0
042.?ppp			0.6			0.7
047.tomcatv			4.3			4.9
linpack			0.5			0.8
1.0			1.3
aprofiled, best-5 over none
hprofiled, best-7 over none
cw?jgh?e(j by total instructions (see Table 6.2)
88
Table 6A: Dynamic Frequency of DAGs in C Code
Program			Number			DAG "Insns"			Total "Insns"			%age
of Dags			(in millions)			(in millions)
008.espresso			240			343.1			1332.2			25.8
022.li			55			1155.3			5348.8			21.6
023.eqntott			30			498.8			819.9			60.8
total			325			1997.2			7500.9			26.2
load-store architecture. They are in a form akin to CISC instructions. However, the numbers
still suggest that C programs spend far more of their time executing instructions from DAGs--Han
average of 26.2%, or around double that for the top 7 FORThAN programs.
From the numbers shown above and to be shown later, it appears that static speculation can
improve the performance of DAG code by around 20%. Assuming that DAGs constitute 25% of
all code executed by C programs, and assuming that this ratio holds good for C programs, static
speculation could improve performance by around 5%.
Interestingly, (6J actually measured an improvement of 4.9% for 0.22li by adding static spec-
ulation to the RS/6000, and scheduling for it. The other programs showed little (less than 0.5%)
improvement. Note that their scheduling technique constrains speculation to movement past at
most one branch, and that the RS/6000 would, at best, correspond to a 2-issue architecture. clearly,
a technique that performed more speculation targeted for an architecture with more resourceswould
show larger improvements.
6.2.3 Dynamic Speculation
Our measurements suggest that the introduction of static speculation can produce modest improve-
ments in overall performance, of the order of a few percents. Studies of dynamic speculation,
however, appear to indicate that adding dynamic speculation can improve performance by an order
ofmagnitude more. For instance, [29] shows an improvement 36% by adding dynamic speculation.
The reason for this discrepancy is simple: dynamic speculation allows instructions to be
speculated past loop back?dges as well as past branches in acyclic code. When instructions
are speculated past a loop backedge, in effect several iterations of the loop are being executed
concurrently. Thus, dynamic speculation improves the performance of loops as well as DAGs.
Similar performance improvements for loops can be obtained without speculation, by using a
compilation technique known as software pipelining [24,51]. Software pipelining schedules the
loop body so that instructions from different iterations of the loop are being executed concurrently,
thereby increasing the utilization of resources and improving performance. Various studies of
software pipelining [24,34] suggest that it can schedule 90%+ of loops so that they execute
optimally.
Dynamic speculation studies typically do not use code containing loops that have been software
pipelined, and thus the performance gains reported show additional improvements obtained through
89
the overlapping loop iterations. It would be interesting to see the performance improvementyielded
by dynamic speculation on codes that have been software pipelined. It is our hypothesis that the
improvements on such codes will be similar to the improvements we have shown.
6.3 The Scheduling Algorithm
In this section, we measure the performance ofwhole-DAG scheduling, and validate certain choices
made by the algorithm. In particular, we focus on the choice ofan appropriate basic-block selection
strategy and the tie-breaking instrution selection heuristic. We will compare the consequences of
using our techniques with the results obtained by prior work. We also measure the code expansion
and register pressure caused by our scheduling algorithm.
6.3.1 The Variables
Most of the global DAG scheduling algorithms are fairly similar in that they are derived from
list-scheduling. However, they differ in two major areas:
Basic-block strategy: When attempting to schedule instructions into a basic-block, the schedul-
ing algoritlim can choose instructions for the ready list from different basic blocks. The
number of such basic blocks chosen and the rules for choosing them, i.e. the strategy, varies
among aIgorithms.
Selection heuristic: When there are two or more instructions in the ready list competing to be
scheduled, the algorithm must use some heuristic to break the tie. This heuristic is another
point of difference among algorithms.
The basic-block strategies used can be divided up into the following:
Single-Path: Typically, these strategies make use of branch probability information. When
scheduling instructions in a block, only instructions in descendant blocks that are on the most
probable path to the end of the DAG are considered.
Unlimited: Here, when scheduling instructions in a block, instructions from all descendant basic
blocks are considered.
Limited: Instead of examining all descendant basic blocks, only instructions from some subset
are considered. For instance, [6] restrict the instructions to be those that are:
o+ In a basic block which is control dependence equivalent to the basic block in which
instructions are being scheduled, or,
o+ Joined to it by at most one branch.
The selection heuristics proposed include:
Late: Use the distance to the end of the DAG, i.e. similar to our LATE.
90
Table 6.5: Scheduling Algoritlirns
Algorithm			Strategy			Heuristic
Trace Scheduling			Single-Path			Late
Boosting			Single-Path			Late
Superblock Scheduling			Single-Path			Late
Global Scheduling			Limited			DOS
SRDAG Scheduling			Unlimiteda			Late
Trace-2 Scheduling			Unlimited			Speculative Yield
Selective Scheduling			Unlimited			DOS
Whole-DAG Scheduling			Variable			Full
aCould also be called Hinited
Speculative Yield: Proposed in [l6j, this uses Late * probability as the tie breaking metric.
Depth of Speculation: Also known as DOS. In this heuristic, distance to end of block is used to
break ties if the two instructions being compared are from the same block. Otherwise, the
more probable of the two is picked.
Full: Our novel selection heuristic, as described in Chapter 5.
Table 6.5 summarizes (our interpretation of) the basic-block selection strategy and the instruc-
tion selection heuristic adopted by the various scheduling algorithms introduced in Chapter 2,
including those used by our algorithin.
Note that we have, as yet, not picked a particular basic-block selection strategy to pick. In
the Section 6.3.3, we shall investigate the behavior of various strategies under different scenarios.
Further, in Section 6.3.4, we shall compare our instruction selection heuristic with those adopted
by previous proposals.
6.3.2 The Scenarios
Target Machine Model
We are compiling for a load-store architecture. All operations are assumed to be fully-pipelined,
with an issue latency of 1. The result latencies of those instructions with latencies greater than 1
are listed in Table 6.6.
We shall not be varying the latencies of the operations during the course of our experiments.
Instead, we vary the number of functional units and issue-widths. We shall consider 2 machines,
which are capable of any 4 and any 8 instructions per cycle respectively. These counts do not
include the number of branches; the machines are assumed to be capable of executing any number
of branches concurrently.
91
Table 6.6: Result Latencies
Instruction			Latency
load			2
fp-add, f?-sub			4
fp-mul			5
int-mul			6
fp-div			11
int-div			12
Branch Probabilities
Branch probability information is used both while scheduling and for simulation. One way of
obtaining these probabilities is to instrument each benchmark program so that it will record the
number of times each branch direction is taken, and then execute the program. We derive the
branch probabilities from the measured branch direction frequencies. We shall refer to these
branch probabilities as profited probabilities.
The branches in numerical FORTRAN benchmarks are notorious for being skewed, i.e. for
taking one direction almost exclusively. Actual measurements, shown in Fig. 6.5, confirm this
belief. Almost 80% of the branches choose one target over the other at least 90% of the time.
However, some programs, particularily non-numerical codes, are believed to exhibit less skewed
branching behaviour. We would, therefore, like to investigate the behavior of various strategies on
code with more balanced branches. To this end, we consider an alternate set of branch probabilities
in which all branches are perfectly balanced, i.e., a branch will pick each of its targets equally
often. We call these branch probabilities balanced probabilites.
As mentioned earlier, we had isolated 234 DA(3s from the benchmarks. Of these only 183
were ever executed. Clearly, the profiled probabilities make sense only when for these 183 DAGs.
Thus, when considering profiled probabilities, we shall restrict ourselves to the 183 DAGs which
are executed at least once. When considering balanced probabilities, we shall consider all 234
DAGs.
Strategies
Our scheduling algorithm is flexible in the basic block scheduling strategy we can use. Instead of
choosing one strategy for selecting basic blocks from which to move instructions into a frontier
block, we shall investigate the effects of different strategies. In particular, we shall look at the
following 9 strategies:
best-N choose N most-probable (connected) descendant basic-blocks, AT between 2 and 8.
rnpp choose descendant basic-blocks that are on most-probable path to end.
92
223
0
0
51
28
18
15
14			12
6
50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 90-95 95-100 100
%age Probability of Most Likely Direction
6
6
Figure 6.5: Distribution of Branch Probabilties
93
none choose only descendant basic-blocks that are not past a fork, i.e. those that are connad to
the frontier block through a sequence ofjoins.
It is interesting to see how the strategies we have picked compare with those used by j,Wious
work. Clearly, mpp captures the behavior of single-path fairly well. Both of them will allco+?ode
motion past at most one side of a branch, the side being determined by the probability. Si*ily,
of the strategies described above, best-8 is the closest to the behavior of unlimited strate?. It
could be that a more accurate picture of the behavior of unlimited strategies could be fo-I by
extrapolating the behaviour of best-9 and upwards from the behaviour of best-2 through kt-8;
however, it turns out not to be necessary.
The strategy used in the limited case, in particular Global Scheduling, resembles best-3?w-
ever, there is one difference2: Global Scheduling will pick both blocks below a branch. 1?the
balanced probability scenarios, so will best-3 (or best-2). However, in the case of skewed?ba
bilities, best-3 may pick only one of the blocks below a branch, if the descendants of that bl& are
more probable to be executed.
Summary
There are 36 different scenarios for which each DAG is scheduled and simulated. These co?ond
to the cross-product of the 2 machine models, the 2 branch probability models, and the 9 sti??jes
considered.
The DAGs are scheduled assuming no constraints on speculation. In particular, thisaans
that all instructions, including stores, can be speculated. Further, instructions speculano? not
constrained by the amount ofresources required to represent the speculation. Thus, the re? are
independent of any particular method ofimplementing speculation.
We have already described how we determine the average actual and optimal executionimes
for a DAGs. The number we report for each scenario is the sum of these average executioflimes
for over all the DAGs being considered. Note that we could have chosen to weight these ? the
relative importance (i.e. execution frequency) of the DAG. This does not really make senso+then
using our assumed balanced probabilities. Further, since we are trying to compare the beha'i?ir of
the scheduling algorithins, rather than measure the gain in performance, we have chosen toiport
the unweighted sums.
6.3.3 Evaluatirig Strategies
Optimaj Performance
The optimal performance will be different for the two branch probabilities, both becattt?ey
have different numbers of basic-blocks, and because the optimal performance of a DAG is ?th-
probability weighted average for the optimal performance for each path in the DAG. The -1 of
the optimal performance numbers for the DAGs being considered are:
2Global Scheduling also picks basic-blocks with equivalent control dependence, while we, in general, doramis
prevents a direct application ofthe numbers found by our algorithm to Global Scheduling.
94
Table 6.7: Actual Performance: Balanced Probabilities
Strategy			4 Issue			8 Issue
Cycles %age over Optimal Cycles %age over Optimal
best2			10799			26.3%			10035			17.6%
best 3			10633			24.6%			9833			13.0%
best 4			10599			23.9%			9776			14.3%
best5			10333			23.2%			9671			13.1%
best6			10589			23.8%			9651			12.8%
best7			10583			23.7%			9636			12.7%
best 8			10687			25.0%			9617			12A%
mpp			11258			31.6%			10549			23.3%
none			11951			39.7%			11475			34.2%
Table 6.8: Actual Performance: Profiled Probabilities
Strategy			4 Issue			8 Issue
Cycles %age over Optimal Cycles %age over Optimal
best 2			8664			28.7%			8037			19.4%
best 3			8663			28.7%			7936			17.9%
best 4			8659			28.6%			7889			17.2%
best 5			8641			28A%			7836			16A%
best 6			8679			28.9%			7837			l6A%
best 7			8712			29.4%			7777			15.5%
best8			8681			29.0%			7780			15.6%
mpp			8599			27.8%			8098			20.3%
none			9555			42.0%			9190			36.5%
Profiled Probabilities: 6731 cycles.
Balanced Probabilities: 8553 cycles.
Actual Performance
The actual performance numbers for the various scenarios are shown in Table 6.7 and Table 6.8.
Interestingly, even without speculation, we can achieve performance within 35%-42% of
optimal. This means that the best we can hope to do is improve the performance of the DAG
portion of the code by about 40%; in practice we can improve the DAG performance by an about
16%. This means that for code which spends, say 25%, of its total cycles executing DAGs, adding
speculation will yeild an average performance improvement of around 4%.
Not surprisingly, as we increase the issue-width of the processor, the performance goes up.
95
Note that the gap between the performance of speculation and no speculation also increases, by
about 6% of the --timal DAC1 performance. Clearly, speculation can make better use of av?able
up
processor resources.
As we increase the number of blocks considered in the best-N strategy, the performance of
the scheduled code begins to level-off and then decreases. For the 4-issue scenarios, performance
bottoms out at best-5, while for the 8-issue, unbalanced scenarios, there is a similar peak 8' best-
7. This occurs because the scheduler is finding and scheduling more speculative operafloiB than
the hardware can actually usefully exploit. These numbers suggest that considering an unlimited
number of blocks does not necessarily improve performance; a more restricted approach is more
beneficial.
Multi-path speculation strategies out-performmpp in all cases but the one where the issuewidth
is 4 and the branch probabilities are skewed. Even in this case the difference is marginal--H().6%.
This is the best scenario for the mpp strategy. Because the issue width is low, the processor cannot
exploit the additional parallelism to be found by speculating down the less probable paths. F?ther,
since the branches are skewed, there is less benefit to be had even when it is possible to expl?t this
parallelism. If either of these assumptions is changed, multi-path strategies out-perform mpp. In
the most extreem case, when neither assumption holds, it is possible for a multi-path strategy to be
11% closer to optimal than mpp.
Ideally, if we are constrained to pick one strategy, the strategy to pick would be some f?rni of
limited, best-N, strategy. A good choice for an 8-issue architecture would be best-7, while for a
4-issue architecture it would be best-5. As we shall see in the Section 6A, issues of implemeinfion
cost may cause us to restrict the amount of speculation possible. In that case, a best-3 sirategy
may be a good choice for the 4-issue architecture, since it yeilds over 99% of the performance of
a best-5 strategy. It may be an equally good strategy to pick for an 8-issue architecture as well,
where it delivers about 97% of the performance of the best-7 strategy.
CollcIusions
The experimental numbers suggest that:
o+ Single-path strategies are appropriate when compiling code with skewed branch probabilities
for machines with a low issue-width. However, on non-skewed code, or when compiling for
architectures with large amounts of resources, another strategy should be picked.
o+ Unlimited strategies are more apporpriate for machines with large amounts of resources.
However, even then they may cause some degradation in performance.
o+ Limited strategies appear to do reasonably well. In particular, best-3, while not optimal, is
always within 3% of the optimal.
The optimal strategy to pick appears to be intermediate between limited and unlimite? The
appropriate strategy is influenced both by the target architecture and the kind of code (more
specifically, its branch behavior) being compiled.
96
6.3.4 Evaluating Heuristics
The Variables
We have used a selection heuristic more complex than those reported in earlier work. To actually
justify this complexity, we shall compare it against the following other heuristics:
LATE Prefers instructions with larger LATE values.
Weighted LATE Uses the product of LATE and probability of execution for an instruction as the
selection heuristic. Basically, our heuristic without Case III.
Probability Compares instructions using LATE if they are from the same basic block, otherwise
chooses the instruction with the greatest probability of execution.
There are close correspondences between these heuristics, and those proposed earlier in the
literature. In particular:
LATE resembles the Late heuristic used by trace-based schedulers. However, the distance
to end they use is an unweighted distance to end along the most-probable path, while we use
a weighted distance along all paths.
weighted LATE is similar to speculative yield, in that it uses the product of probability and
distance to end. Again, the distance to end that they propose to use is not the weighted
distance to end we use; instead they use the maximum distance to end.
. probability resembles the DOS heuristic. There are two sources of difference:
- Instead ofusing probability information directly, they estimate it by counting the number
of branches between the instruction and the frontier block.
- They use either a global maximum distance to end, or use the distance to the end of the
block to break ties.
Performance
The results are summarized in Table 6.9. These results have been computed for 4-way issue using
balanced probabilities. The percentage is the improvement in performance as a percentage of
optrmal performance that is gained by using our full heuristic.
Note that our heuristic tends to outperform the others in most cases. The exception occurs in the
mpp strategy. Here, the probability heuristic tends to perform a little better than ours. However,
to obtain best performance, it is still necessary to use our heuristic.
6.3.5 Other Concerns
There are two major concerns about the code produced by a scheduler:
o+ By how much does it expand the program?
o+ How many registers does it require?
97
Table 6.9: Selection Heuristic Performance
Strategy			Full			late			Weighted late			Probability
Cycles			Cycles			%age			Cycles			%age			Cycles			%age
best2			10799			11613			7.5%			11582			7.3%			11604			7.5%
best3			10633			11129			4.5%			10951			2.8%			10983			3.1%
best4			10399			10982			3.6%			10760			1.5%			10848			2.3%
bestS			10535			10990			4.3%			10771			2.2%			10821			2.7%
best6			10589			10961			3.5%			10712			1.2%			10748			1.5%
best7			10583			11010			4.0%			10745			1.5%			10851			2.5%
best8			10687			11066			3.5%			10801			1.1%			10822			1.3%
mpp			11258			11255			-0.0%			11223			-0.3%			11188			-0.6%
none			11951			11961			0.1%			12002			0A%			12100			1.2%
Code Expansion
Table 6.10 shows the code expansions for the various strategies. As can be seen the code expands
by around 50-70%.
Register Pressure
A more serious concern is register pressure. Fig. 6.6 shows a histogram of the number of DAGs
versus maximum register pressure using the best 5 strategy on a 4 issue implementation, and
Fig. 6.7 shows the same information for the best 8 strategy on an 8 issue implementation. Both
consider schedules for balanced probabilities.
It seems, from Fig. 6.6, that most DAGs can be scheduled using a small number of registers;
in particular, with 40 registers, one can directly use the produced schedule for 90% of the DAGs.
However, this is deceptive. The schedules for the bigger DAGs tend to use more registers. Once
we weight the DAGs with their size, as shown in Fig. 6.8 and Fig. 6.9, the number of DAGs
schedulable at 40 registers with the best 5 strategy on 4 processors drops to
However, perhaps a more important figure of merit is the increasein the register pressure due
to speculation. This is show in Fig. 6.10 and Fig. 6.11 for 4 and 8 issue processors respectively.
The base-line is the maximum register pressure for the schedule produced using the none strategy
for that issue width. Note that even this strategy is fairly aggressive, and may lengthen life-times
more than some other non-speculative schedule; thus, the actual increase in register pressure may
be greater.
As can be seen, in some cases the register pressure can rise dramatically, with 10% of the DAGs
requiring more than 40 additional registers in the 8 issue case. However, the increase in register
pressure is generally more controlled. 80% of the DAGs require less than 20 additional registers
3More accurately, the 60% of all instructions are in a DAG that can be scheduled using less than 40 registers.
98
Table 6.10: Code Expansion
Strategy			Balanced			Profiled
4 issue 8 issue 4 issue 8 issue
best2			1.45			1A3			1.47			1A4
best3			1.54			1.51			1.56			1.53
best4			1.54			1.52			1.56			1.53
bestS			1.58			1.54			1.59			1.56
best6			1.65			1.61			1.66			1.64
best7			1.65			1.62			1.70			1.65
best8			1.71			1.65			1.71			1.68
tree			1.60			1.57			1.61			1.59
mpp			1.50			1A7			1.52			1.47
none			1.44			1.42			1.46			lA3
Co
0
0
40			.			.			. .
20
0
0			20			40			60			80			100			120			140			160			180			200
Live Values
Figure 6.6: DAG vs. Register Pressure: 4 issue, best 5
100
99
Co
0
0)
0,
cg
loo
0
0			20			40			60			80			100			120			140			160			180			200
Live Values
Figure 6.7: DAG vs. Register Pressure: 8 issue, best 8
`00
`Uu
80
a)
CI)
a)
a)
(0
0
20			,--H
4: .::::.I			_
0 20 40 60 80 100 120 140 160
Live Values
180			200
Figure 6.8: Weighted DAG vs. Register Pressure: 4 issue, best 5
101
a)
N
Co.-
a)
a)
Co
a
0
a)
loo
60			-
0
0			20			40			60			80			100			120			140			160			180			200
Live Values
Figure 6.9: Weighted DAG vs. Register Pressure: 8 issue, best 8
102
w
0
0,
loo
0
0			20			40			60			80			100			120			140			160			180			200
Live Values
Figure 6.10: DAG vs. Increase in Register Pressure: 4 issue, best 5 over none
103
Co
0
a)
loo
0
0			20			40			60			80			100			120			140			160			180			200
Live Values
Figure 6.11: DAG vs. Increase in Register Pressure: 8 issue, best 8 over none
104
in the 8 issue case while in a 4 issue schedule, 85% of the DAGs require less than 10 additional
registers.
`This increase in register pressure is not as big a concern it would be on a single-issue imple-
mentation. On such an implementation, every extra load and store operation added to support
spilling can cause performance degradation by displacing an instruction that would otherwise have
been issued in that slot. In a multiple-issue implementation, it is fairly likely that the extra spill
code will fit into existing gaps in the schedule. Thus, it seems fairly likely that a post-pass register
allocator will perform adequately.
6.4 Tag Usage
In the previous sections, we have shown how much performance improvement is to be gained
through the use of static speculation, and how much can be gained by using a particular strategy.
In this section, we shall correlate number of tags required to represent the speculation with the
various strategies for speculation, and thereby provide some insight into the number of tags that
should be provided by an architecture.
6.4.1 The Influence of Strategies
A best-N strategy moves code into a basic block from at most N other basic blocks. Assuming that
one tag is needed to represent the speculation from each block, only N tags will be live within a
basic block. However, there are two reasons why more than N tags can be needed.
First, even though only N tags are required to represent the speculations within a basic block,
there may be a tag live through the basic block. An instance where this can happen is shown in
Fig. 6.12. Assume that while scheduling basic-block A, an instruction was selected from E, and
moved into A. Clearly, the tag used by A to represent the speculation from E cannot be reused till
the branch at the end of E has been executed. Now, assume that while scheduling instructions in
X, E was not picked by the selection strategy. Now, x may need upto N tags to represent its own
speculative instructions; further, it will need an additional tag to represent the speculation from E
into A.
Second, we may need more than one tag to represent the speculation of instructions from a
single basic block. This arises when an instruction is speculated past at least two branches along
different paths, and its inputs are different along the different paths. Such a situation is illustrated
inFig. 6.13.
In practice, neither of these cases arise very often, and consequently, in most cases, a best-N
strategy can be implemented using N or less tags. This is illustrated in Fig. 6.14, which shows the
number oftags required to implement the schedule produced by the best-5 strategy in the balanced,
4-issue scenario. As can be seen exactly 3 DAGs require more than 5 tags.
Further, it is impossible for either of these situations to arise when using either a best-2 or
best-3 strategy with the whole-DAG scheduling algorithm as currently implemented4. Thus, if a
4Ths will be shown in Appendix B.
105
B
E
Th
B
Figure 6.12: Tag Live Through Block
,1
A			AT"
` ,
li
MW
flush 1,4
A
M
check 4,6
Figure 6.13: Two Tags for a Basic-Block
check 1
flush6
106
134
0
40
21
13
2
8
7
9
10			11			12			13
1			2			3			4			5			6
Maxinium Tags Needed
Figure 6.14: Tag Usage: best-5, 4-issue, balanced
107
best-3 strategy is adopted, at most 3 tags will be required to obtain the speedups described in the
previous section.
6.4.2 Architecture
The maximum number oftags available is determined by the architecture. Therefor, the architecture
should be designed with a sufficiently large number of tags. From the numbers described in the
previous section, the optimal strategies seem to be best-5 for a 4-issue architecture and best-7 for a
8-issue architecture. Further, best-5 and best-7 use at most 5 and 7 tags respectively, in about 99%
of the cases. Thus 5 and 7 tags seems to be good numbers.
However, picking the largest number of tags required may be over-kill. Consider using 5 tags
for a 4-issue architecture. Clearly, most instruction have to have extra bits attatched indicating
whether they are speculative, and if so, which tag they use. This information requires adding 3 bits
to each instruction5. If, however, we chose to provide only 3 tags, the overhead would be 2 bits per
instruction. This would force us to use the best-3 strategy. But the difference in DAG performance
between best-3 and the optimal strategy is less than a percent; thus, for resource reasons, it makes
sense to use only 3 tags on a 4-issue architecture.
Thre reasoning is not so clear-cut with an 8-issue architecture. Going from 7 to 3 tags represents
about a 3% loss in DAG performance, which, depending on how important DAGs are overall, may
or may not cause a signfficant drop in overall performance. There is little point in choosing some
number between 3 and 7; they will all add an overhead of 3 bits per instruction.
6.5 Conclusion
In this chapter, we described the results of experiments we conducted to measure the performance
of code scheduling for static speculation. Some of the major conclusions that we derived from the
results we obtained include:
Static Speculation
o+ Static speculation can improve the performance of numerical programs by as much
as 12% by increasing the performance of DAGs; however the performance gain is
generally low, because DAGs are usually an insignificant part of the entire program.
o+ When used with an appropriate scheduling algorithm, static speculation improves the
performance of DAGs by an average of 13%-I7%.
Scheduling Algorithm
o+ Ideally, the scheduling algorithm should move instructions from a limited number of
descendant DAGs, ranging between 3 and 7.
5Actually, the overhead is 1092(6) per instruction. It may be possible to achieve this, but it will complicate decoding.
108
o+ To gain full benefit from static speculation, the scheduling algorithm must move m-
structions past both sides of a branch.
o+ Our slack-based speculation heuristic is an improvement over other proposals; in par-
ticular, it improves DAG performance by about 3%.
Speculative Tags
o+ At lower issue widths, we can use 3 tags and not suffer a significant performance
penalty. This necessitates an overhead of 2 bits per instruction.
o+ At higher issue widths, we may need upto 7 tags. This would require an overhead of 3
bits per instruction.
Chapter 7
Dynamic Speculation
In this chapter, we shall use the perspective gained in implementing static speculation to derive a
dynamic speculation mechanism that also integrates register renaming and precise interrupts, and
can be implemented at low cost. The rest of the chapter is organized as follow. Section 7.1 summa-
rizes the state of the art of dynamic register renaming and speculation. Section 7.2 andSection 7.3
describe our approach to register renaming. Section 7.4 and Section 7.5 specify extensions for
dynamic speculation and precise interrupts. Section 7.6 sketches a design. Section 7.7 discusses
an optimized design that merges register renaming, dynamic speculation and precise interrupt
mechanisms. Section 7.8 elaborates the performance-critical portion of the design, and argues that
it should not impact cycle time. Section 7.9 compares our approach against previous approaches.
Finally, Section 7.10 summarizes the results of the paper.
7.1 Prior work
Tomasulo's Algorithm: The earliest implementation of register renaming was for the floating-
point unit of the IBM 360/91 [63]. The IBM 360 is a CISC architecture, but in the following
description, we will restrict our attention to register-to-register operations.
Every register is augmented by a busy bit. If the bit is off, the register holds the value needed
by any instruction that references it. If the bit is set, the register holds the tag of the instruction that
will produce the need value.
When an instruction enters the decode stage, it is assigned a tag. It reads each source register,
obtaining an input value or a tag, depending on the busy bit. The instruction writes its tag into
its output register and sets that register's busy bit. The instruction and its input values/tags are
buffered in one of several reservatioti statio7is, depending on which functional unit it will execute
At each cycle, a functional unit scans the instructions stored in its reservation station. An
instruction may be ready, i.e., have both input values available, or waiflflg, i.e., one or more of its
inputs is a tag, and the instruction cannot execute until some other instruction produces that input.
The functional unit will select a ready instruction and start executing it.
109
110
When an instruction completes, its result and tag are broadcast to all waiting instructions.
Each waiting instruction compares this tag against the tags it is waiting on. If the tags match, the
instruction overwrites the tag with the result. This may change the instruction's status from waiting
to ready. The result is also sent to the register file. If the tag in the output register is the same as
that of the instruction, the value is written to the register file.
Tomasulo's algorithm must be extended to implement precise interrupts [54]. Johnson [29]
describes one such extension, based on the fi?ti?re file mechanism [54]. The registers described
above correspond to the future file. There is also a duplicate register file called the i,i-order file,
and a reorder buffer. Instructions issue as described, using the future file'. Additionally, results are
written to the reorder buffer. The reorder buffer is a FIFO queue. When an instruction is issued,
it receives a slot at the bottom of the queue. When the instruction completes, its result is written
to the allocated slot. If the instruction at the top of the queue has completed, its result is used to
update the in-order file, and the queue is advanced.
It is easy to see that the reorder buffer updates the in-order file with the results ofthe instructions
in the order in which the instructions were issued. At the time an instruction's results are written to
the in-order file, any exception caused by it is signalled. Therefore, when an instruction causes an
exception, the in-order file contains the register state obtained by executing all instructions prior to
the excepting instruction--Hexactly that required by a precise interrupt.
Johnson's Algorithm: Johnson, in [29], describes another approach to register renaming. This
approach replaces the future-file and reorder buffer of the previous section with an associative
reorder buffe? As in the previous method, results are written first to the reorder buffer, which
releases them to the in-order file in program order. The reorder buffer is also extended in the
manner described below.
When an instruction is allocated a slot in the reorder buffer, its output register name and its tag
are stored in that slot. When the instruction completes, the tag is overwritten with the result value.
To obtain an input value, the in-order file and the reorder buffer are accessed in parallel. If the
reorder buffer contains no entry whose register matches the input register, the value returned by
the in-order file is used. If there is some entry in the reorder buffer with a matching register, the
tag or value of the youngest such entry is returned. This, of course, requires an associative lookup
prioritized by age, using the register name as the key.
Metafiow's DRIS: The Metafiow architecture [49] implements out-of-order execution and dy-
namic speculation as well as register renaming. Most of the mechanisms required are integrated
into the DRIS (deferred-scheduling, register-renaming instruction shelf). Unlike the approaches
described above, but like [12], instructions are not deferred in reservation stations at each execute
unit, but in one central structure. Every cycle, this centralized structure is searched for instructions
which can be executed and which are then dispatched to the appropriate execute unit. Apart from
that, register renaming using the DRIS is very similar in spirit to the previous approach.
1For simplicity, we ignore the changes made in [29] to support speculation.
ill
The DRIS is a queue, much like the reorder buffer. When an instruction is issued, it is allocated
a slot at the bottom of the DRIS. When an instruction completes, its result is written to the allocated
slot. If the top instruction in the DRIS has completed execution, it is popped from the DRIS, and
the register file is updated with the instruction's result.
Every time an instruction is added to the DRIS, the DRIS is searched for instructions whose
result registers match the incoming instruction's source registers. If none is found, the required
value can be read from the register file. If some match is found, then the source register is associated
with the youngest such instruction. When the associated instruction completes, the required value
can be read from that instruction's DRIS entry.
This mechanism differs from the tag-and-result broadcast ofTomasulo`s algorithm, in that only
the tag (i., the DRIS entry) is broadcast. It indicates that the result is available. An instructions
in the DRIS waiting for that result mark it as available, and may become ready. The actual value
is read from the DRIS entry2 when the instruction is dispatched to an execute unit.
RS/6000 Floating Point Renarning: The register renaming mechanism in the floating point
unit of the RSI6O()() [20] is quite different from those discussed above, both in motivation and in
implementation. One of the motives for removing dependences was to enable the fixed-point unit
to perform floating-point loads. Instead of having the fixed-point unit keep track of floating-point
false dependences, the architects chose to rename the destination register of the load to eliminate
me clepenoences.
A? implemented, the floating-point unit has more physical registers than register names. There
is a map-table containing a mapping of register names to physical registers. There is a queue
containing free physical registers (roughly, those that are not mapped). When a floating-point
instruction, other than a load, is decoded, its registers are renamed according to the map. When
a floating-point load is encountered, a free physical register is removed from the queue, and used
as the target of the load. The map is updated to reflect the fact that all subsequent references to
the target register name will use the newly allocated physical register. We will not discuss this
mechanism in more detail since it is subsumed by the mechanism discussed below, and by the one
we are proposmg.
ES/9000: The IBM ES/9OOO [39] extends the renaming mechanism described above to all
registers for most instructions. It also implements dynamic speculation past at most 2 conditional
branches.
As in the RS/6OOO, there are more physical registers than register names, and a map-table
(called the decode-time register assignment list, or DRAL) is used to keep track of the mapping of
logical register names to physical registers. In the decode stage, the source register names in every
instruction are renamed using the map. The instruction's output register is mapped to a new free
physical register, and the map is updated appropriately, displacing the old physical register mapped
by the output register.
2? from the register file, if the associated instruction has completed and heen removed from the DRIS hefore the
incoming instruction is issued.
112
Registers are controlled using information stored in an content-addressable structure called the
ACL (array control list). Whenever all instructions prior to an instruction have completed, the ACL
is associatively searched using the instruction's identifier, and the physical register displaced by
that instruction is freed.
`The value of the map at a branch that is speculated past is saved in structures called the
BRAL (backup register assignment list). On a mispredicted branch, the DRAL is restored to the
appropriate state by copying from the appropriate BRAL. When a branch is mispredicted and all
instructions prior to the branch have completed, the ACL is associatively searched to identify
physical registers allocated to instructions issued after the branch. These registers are then freed.
Finally, the ACL contains information for precise register state recovery. For every physical
register, it keeps track of whether the physical register contains a value that is part of the precise
state. On an interrupt, the precise state is recovered by searching the ACL for entries in the precise
state, and using them to update the map appropriately.
7.2 Our approach
The approach we are proposing implements register renaming in a direct fashion, using a single
storage area. There is no separation of storage locations into an architected register file and some
auxiliary storage. Instead, there is one set ofphysical registers, from which all storage is allocated
and into which all values are written. The architected registers are mapped onto some subset of
the available physical registers. This mapping changes as instructions are issued. The association
of architected register name with physical register is maintained in a mappilig table [31]. The
mapping table has an entry for each register name, which contains the physical register currently
associated with, or mapped by, that register name. There is afree pool of physical registers, used
to provide new physical registers when necessary.
After an instruction is fetched from the I-cache, but before it accesses the physical registers,
the instruction's source register names are mapped to physical registers. This is accomplished by
indexing into the mapping table using the source register name.
The output register name is mapped to some new physical register, allocated from the free
pool. The mapping table is modified to reflect this new association. The old physical register in
the output register name's entry is replaced by the new physical register. The register name is said
to have been remapped to the new physical register. The old physical register is said to have been
unmapped.
The mapping process is illustrated in Fig. 7.1. The processor used in this example is a 4 stage
pipelined processor with 4 register names r1-r4. The implementation uses 8 physical registers
pl-p8. Free contains a physical register from the free pool.
After the first instruction, r1 : r3+r2, is fetched, its source register names are translated using
the map to p3 and p2 respectively. The result of this instruction will be placed in the physical
register p5. The map is updated to reflect the fact that ri is now mapped to p5. The remaining
stages of the pipeline execute the instruction p5 : =p3+p2.
When the next instruction is fetched, it is translated using the modified map. Thus, the source
register r 1 is mapped to p5. The result is to be written to p6, the new physical register, and the
113
Physical
Map			Register
Free
m
L?l
m
1?R
12p
Insn-Fectch
I I
r1:=r3+r2
Mapdtop5
I			I
I			I
I			I
I			I
ri := ri + r4
Maprltop61
I I
I I
I I
I			I
r3 := r2 * r3
Map r3top7
I I
I I
I I
r2 := r3 - ri
Map r2 to 1)8:
I I
I I
I I
I I
I I
I			I
I			I
I I
I I
Op-Fetch
p5:=p3+p2
p6:=p5+p4
p7:=p2*p3
p8 :=p7-p6
IExecute			I
p5:=p3+p2
p6:=p5+p4
Ip7:=p2*p3
Write-Back I
p5:=p3+p2
Write p5
p6:=p5+p4 I
Figure 7.1: Mapping
114
map must be modified to show that ri is now associated with this physical register.
Notice that the mapping table has only to keep around the most recent physical register for
each register name. This information is adequate since instructions are issued to the op-fetch stage
in program order. A source register name, therefore, always refers to the value produced by the
closest, and therefore most recently processed, instruction with the same (output) register name.
Hence, the instruction uses the value stored in the physical register mapped most recently to that
register name--Hwhich is exactly what the mapping table stores.
After the mapping, the rest of the stages in the processor use the physical registers. They never
see the original register names. One way ofviewing this transformation is to look at the mapping as
a translation of instructions from an architecture with a small number (number of register names)
of registers to an architecture with a larger number (number of physical registers) of registers,
performed in a manner that increases instruction-level parallelism.
7.3 Reclaiming physical registers
Since there are only a finite number of physical registers, the processor must start reusing them at
some point. A physical register can be reused when it is guaranteed that the value in it can never
be used by any later instruction. This is true if all of the following hold:
1. The value has been written to the physical register.
2. All issued3 instructions that need the value have read it.
3. The physical register has been unmapped.
The motivation for the first two conditions should be obvious. The third condition is motivated
by the following: as long as the physical register is mapped by some register name, some later
instruction may have that (source) register name, and therefore require that value. However, once
that register name has been remapped (and, consequently, the physical register unmapped), no
instruction can access that value.
The process of reclamation is illustrated by the following code fragment:
(1) ri
(2)			:= ri +			.
(3)			ri
(4)			ri --H			.
When instruction (1) is executed, it is allocated a new physical register, say pil. This is the
register which the result of (1) will be written to. pil cannot be freed until (1) completes. The
next instruction refers to r1, so it will read out of p1i. Obviously, p1i cannot be reused until (2)
has been issued. Now, (3) again uses r 1 as an output. So, it will be allocated a new output register,
say p23. r1 will be mapped to p23, and pli is unmapped. Now, any subsequent reference to
3In this paper, an instruction is said to be issued when it enters the instruction-decode/operand-fetch stage.
115
ri, such as in (4), will use p23. Clearly, no further uses of pi1 can be produced. Sop11 can be
freed once (1) writes to it and (2) reads from it.
These conditions are easily detected using an approach which waits until the physical register
has been written to and has been unmapped, and then waits for all instructions that use the physical
register to complete. A counter per physical register is used to keep track of issued but unexecuted
instructions in the processor pipeline that read that physical register. Since the total number of
issued but unexecuted instructions in the processor is bounded and small, the counter does not have
to be very big. An instruction that uses the value increments the counter as soon as it is fetched,
and decrements the counter when it has read the value. A physical register can be reused when it
has been written to, unmapped and its counter is zero.
To summarize--Hthe processor maintains the following state for each physical register:
Complete Flag: The complete flag is false if the physical register has not been written into. It is
set to false when the physical register is removed from the free pool, and is set to true when
it is written to.
Unmap Flag: The unmap flag is false as long as the physical register is mapped by some register
name. It is set to false when the physical register is first added to the mapping table, and is
set to true when it is unmapped.
Counter: The counter keeps track ofthe number ofinstructions issued that need the value stored (or
to be stored) in the physical register, but which have not yet read it. Every time an instruction
is issued that needs this value, the counter is incremented. Every time an instruction reads
the value, it is decremented.
A physical register can be reclaimed if the complete and unmap flags are true, and the counter
value is zero.
Initially, each architected register name is mapped to the corresponding physical register. These
registers are in state F 0 T. The remaining registers are in the free register pool, with states T 0
Fig. 7.2 illustrates the process of reclaiming physical registers. The processor implementation
has been extended by the addition of the update flag, the counter, and the complete flag for each
physical register. If these fields have the value T 0 T, the register is reclaimable. One of the
reclairned register is inserted in the free slot.
The first instruction unmaps pi. pi is now reclaimable, since it has been written to, and there
are no outstanding reads to it. Register p5 is mapped by ri, soits unmap flag is set to false in cycle
1. The next instruction has r1 as a source register name, so the counter for p5 is incremented. In
cycle 3, the fetched instruction has r1 as an output register. This means that p5 is now unmapped,
and its unmap flag is set to true. Also, the result of the first instruction is forwarded to the second
instruction. Since the second instruction has read the value for p5, it decrements the counter for
p5. Finally, in cycle 4, the first instruction completes, writes its result to the p5, and sets the
complete flag for p5 to true. p5 can now be reclaimed, and is placed in the free slot.
116
Complete
Counter
Unmap
E?zi
w
w
Insn-Fetch
ri ---H r2 + r3
I Mapritop5
Unmappi
Incrp2,p3
r3 := ri xr2
Map r3 to p6
Unmapp3
Incr
p5,p2
I I
ri := r2 - r3
I Map ri to pi I
Unmap p5
Jncrp2,p6
I			I
ri := ri + ri
Mapritop3
Unmap pi
Incr pi, pi
I I
I I
I I
I			I
I			I
Op-Fetch
p5 := p2 + p3
Read p2, p3
Decr p2, p3
Execute			Write-Back
p6:=p5xp2			p5:=p2+p3
Read p5, p2			Forward p5
Decr p5, p2
pi :=p2-p6			p6:=p5xp2
Read p2, p6			Forward p6
Decr p2, p6
p3 := pi + pi			pi:= p2 - p6
p5:=p2+p3
Write p5
Complete p5
p6:=p5xp2
Figure 7.2: Reclaiming Registers
117
7.4 Branches and speculation
Branches are another cause of serialization. There may be an interval between the time a branch
is issued and the time it is resolved, i.e., the branch's direction and target are determined. Halting
instruction issue until a branch is resolved results in a serious performance loss. The alternative,
predicting a target for the branch, and continuing to issue instructions from that target before the
branch is resolved, can cause problems when the prediction proves to be incorrect. For best results,
the processor should be able to deal with several simultaneously unresolved branches. To simplify
the hardware, branches are assumed to be resolved in program order.
An instruction that is issued from below a branch before the branch is resolved is said to have
been speculatively issued. If a speculative instruction causes an exception, then the exception must
not be reported till it is determined whether the predicted target was correct. If it was not correct,
i.e., the branch was mispredicted, the interrupt must be ignored. Otherwise the interrupt must be
reported in a manner consistent with the program order.
Branches are mispredicted fairly often. The mechanism that recovers from a mispredicted
branch must be efficient. Prior work on recovery treats a mispredicted branch as a kind of interrupt,
and modifies the precise-interrupt recovery mechanism to implement mispredicted branches. But
mispredicted branches do not require the full power of the precise-interrupt recovery mechanisms.
Consider the following example:
ri r2 / r3
brgtO r5, r8
Assume that at the time the branch is resolved, the divide has not completed. Precise interrupt
recovery mechanisms would halt further execution till the divide completed. However, it is not
necessary to do so just to recover from a mispredicted branch.
What actions must be taken to recover from a mispredicted branch? Consider the following
example:
(1)
(2)
(3)
LlO:
(4)
ri			r2			+ r3
r3			:= r7			/ r4
brgtO r5,			LlO
ri			:= ri			- r4
ri r8 * ri
Assume that the branch was predicted (incorrectly) to be not taken. Then, to correctly execute
instruction (4) after resolution, source registers r 8 and r1 must contain the values they would
have had, assuming none of the speculative instructions had executed. This means that the value
read from ri must not be the value produced by (3), but the value produced by (1). Further,
all values produced by the speculative instructions should be discarded. Specifically, the value
produced by (3) could be used only by other speculative instructions, and since they, too, are
mispredicted, their reading the value is irrelevant. Lastly, all interrupts produced by speculative
instructions must also be discarded.
118
Modifying our design to satisfy these aims turns out to be fairly simple. Since multiple
occurrences ofthe same output register name are given different physical registers, values stored in
the register are never overwritten until the register is reclaimed and reassigned to another instruction.
Therefore, the design must be modified to ensure that a physical register is not reclaimed before the
branch is resolved. When the branch is discovered to have been mispredicted, the logical register
names must be set to map to the physical registers containing the old values. In our design, this
means that the mapping table must be restored to the state it was in at the time the branch was
issued. These changes are described in more detail below.
The mapping table is small enough that it can be saved each time we speculate past a branch.
If the branch is correctly predicted, the saved map is discarded. If the branch was mispredicted,
the saved map is used to restore the mapping table, and all other saved maps are discarded. The
depth of speculation in the scheme being presented is constrained by the number of maps that can
be saved. There is an alternative scheme, described later, that does not require the map to be saved
at each branch; instead it recovers the map from the information saved for precise interrupts. This
scheme has no a priori limits on the depth of speculation.
In addition, the unmapping rule must be modified. A physical register is now considered
4
unmapped if it is not mapped in the current mapping table, or any of the saved mapping tables
Under the previous definition, a physical register would be considered unmapped (and therefore
reclaimable) if its associated register name was the output register for some speculated instruction.
This case would arise in the example below (assume branch is predicted not taken)
(1)			r5			rlO/r2
(2)			ri			r2 + r3
brgtO			r5, r8
(3)			ri			r3 * r4
Reclaiming the physical register storing the result of (2) would be correct as long as the
branch was correctly predicted. However, if it was mispredicted, the value might be needed by
some instruction on the other path. Therefore, it cannot be reclaimed as long as the branch is
unresolved. The modified unmapping rule (in combination with the fact that saved maps are
discarded as soon as they are resolved) ensures that physical registers are not reclaimed too soon,
and that they can be reclaimed if necessary after the branch is resolved.
As long as branches are predicted correctly, the complete flag and the counters are leftuntouched.
Resetting them at a mispredicted branch is closely tied to the way the pipeline is flushed of the
speculative instructions. If, for instance, all mispredicted speculated instructions in the pipeline
are run to completion, and then discarded, complete flags and counters can be handled unchanged.
An interesting case is when the processor waits for all instructions issued before a mispredicted
branch to complete before starting recovery, and then flushes all instructions in the pipe. At that
point, since all instructions prior to the branch have completed execution, all values must have
been read, so all counters must be set to zero. Further, all instructions prior to the branch have
completed, so that all complete flags can be set to true. The speculative instructions do not matter,
since they are all being discarded.
4There is an equivalent, but more complex definition, that can be implemented more cheaply.
119
Insn-Fetch			Op-Fetch
brgtr3r2
Incrp3,p2
SaveMap
I			I
I			I
I			I
r2 := ri +r3			brgtp3,p2
Map r2 to p5			Read p3, p2
Incr pi, p3			Decr p3, p2
I			I
I			I
r2:=r2-r4			p5:=pl+p3
Map r2 to p6			Read pi, p3
Unmapp5			Decrpl,p3
Incrp5,p4
I			I
I			I
I			I
I			I
I			I
I			I
I			I
I			I
I			I
I			I
I			I
I			I
Execute
w
Ez'
rn
fl T			A
I?I 0 ITI 19? I
n ?
brgt p3, p2
p6:=p5-p4			p5 :=pl+p3
Read p5, p4			Forward p5
Decr p5, p4
Figure 7.3: Correctly Predicted Branch
p6 p5 - p4			p5			pi + p3
Write-Back I
brgtp3,p2
I			Pop Map			I
I			Unmap ?2			I
120
The examples in Fig. 7.3 and Fig. 7.4 illustrate the way our design handles speculation. The
architecture is further extended by the addition of a 2-entry map save stack. However, the current
map is always stored in this stack. Therefore, the processor, as shown, can speculate past at most
1 branch. Mispredicted branches are assumed to be issued only after all prior instructions have
completed. Counters and complete flags are handled according to the above scheme.
When the branch instruction is issued, the processor saves the current map on the map save
stack. The next instruction remaps r2 to p5. However, the previously mapped physical register
p2 is still not consider unmapped, since it is mapped in the map save stack. The next instruction
again remaps r2. p5 is considered unmapped, since it is no longer mapped by the current or any
of the saved maps. Finally, the branch is resolved in the 4th cycle.
Fig. 7.3 shows the processor behavior when the branch was correctly predicted. The saved map
is popped from the stack. p2 is now no longer mapped by any map, and its unmap flag is set to
true. In fact, it can now be reclaimed.
Fig. 7.4 shows the processor behavior if the branch was mispredicted. All but the top saved
map are flushed from the map save stack. This map is used to restore the mapping table. The
physical registers not mentioned in this map are readied for reclamation by setting their unmap and
complete flags, and zeroing their counters. Of course, the instructions currently in the pipeline are
flushed.
7.5 Precise interrupts
Any one of several precise interrupt mechanisms [54] can be adapted to the proposed design. The
major modification necessary is that when an interruptoccurs, not only must the register state reflect
the precise state, but the mapping table must be restored to its value at the time the interrupting
instruction was issued. Actually, it is not necessary for all the physical registers to be restored to
their value at the time the interrupting instruction was issued; it is enough to restore only those
registers that were actually mapped at that point.
For instance, if we wish to adopt a checkpoint-retry mechanism [26] then, at the checkpoint, we
would need to save the current map, and the values of the physical registers mapped by the current
map. On an interrupt, the checkpointed values would be used to reset the map and the relevant
physical registers.
The history buffer mechanism is a particularly attractive candidate. A history buffer is a
FIFO, to which every instruction is added when it is fetched. When the instruction writes to the
register file, the value it overwrites is saved in with it. Instructions are removed from the top of
the history buffer when they complete. Any exception caused by an instruction is reported only
when it comes to the top of the stack. At that point, the register file values are restored by "roll
back"--Hbasically, starting from the bottom of the buffer, all the saved register values are written
back to the register file. The straight-forward adaptation of a history buffer to our scheme will save
with each instruction the old physical register mapped by its output register. On an interrupt, both
the map and the register file are restored to their precise state by rolling back the history buffer.
121
?r n T
brgtr3r2
Incrp3,p2
Save Map
I			I
I			I
I			I
I			I
I			I
I I
r2 := r2 - r4
Map r2 to p6
Unmap p5
Incr p5, p4
I			I
I			I
I			I
I			I
I			I
Insn-Fetch			Op-Fetch
brgt p3, p2
Execute			Write-Back
m
m
rn
Figure 7.4: Mispredicted Branch
brgtp3,p2
Pop all but top Map
Complete p5,p6
Unmap p5,p6
Clear Counters
Restore Map
p6 := p5 - pi			p5 := pi + p3
Bubble			Bubble
r2:=ri+r3			brgtp3,p2
Map r2 to p5			Read p3, p2
Incr pi, p3			Decr p3, p2
p5 := pi + p3
Read pi, p3
Decrpi,p3
Bubble
rn
122
7.6 Summarizing the design
In this section, we shall bring ?ogether the different components that we have introduced above.
The different pieces comprising the design are summarized below;
Mapping Logic: The mapping hardware is responsible for translating each source register
specifier to the corresponding physical register. This component is discussed at length in Section
9.
Map Modify Logic: The map modify logic is responsible for updating the current map after
each cycle. Briefly, the mapping for each architected register is updated as follows:
o+ On a reset or a context switch, the entry is set to some default value, such as the entry number.
Thus, initially, architected register N maps to physical register iV.
0 On a mispredicted branch, the map is updated from the map at the top of the map save stack.
o+ Ifthe architectedregister is the same as the outputregister from one ofthe fetched instructions,
then it is replaced by the free physical register allocated to that instruction.
o+ Otherwise it is set to its old value.
Map Save Table: The map save table, as indicated before, is a FIFO that saves the map values at
each branch. On a branch, the current map is saved at the bottom ofthe table (ifpossible; otherwise
fetching is stalled). On a mispredicted branch, the table is flushed, and the current map set to the
top map in the table. On a correctly predicted branch, the top saved map is removed from the table.
Free Physical Register Logic: Freeing physical registers is possibly the most complex part
of the design. It involves keeping track of when an instruction has completed, the number of
outstanding uses, and whether it has been unmapped. Completion is easy; whenever a physical
register is written, the register's completion flag is set. Keeping track of outstanding reads involves
a multi-inputup-down counter, which is incremented by the number of instructions that are fetched
each cycle, and decremented by the instructions executed each cycle. Unmapping, as discussed
above, is implemented by setting the unmap flag for a register when it is not present in the current
or any of the saved tables. There exists a inexpensive implementation for detecting when a register
is no longer in the tables, and can be unmapped.
Precise Interrupt Logic: A mechanism, such as the history buffer, is needed to ensure that
interrupts are precise.
123
7.7 Optimization
The design described above is fairly general, and can be made even more general. For instance,
it could be adapted to handle branch prediction resolution in any order. However, providing this
generality requires extra hardware. In this section, we shall introduce certain constraints on the
implementation. These constraints enable us to greatly simplify the implementation, by combining
many of the different components listed above into one simple structure.
One major source of complexity is determining when physical registers can be reused. This
happens when the register has been written to, there are no outstanding reads and the associated
architected register has been remapped. These must all be true if all instructions upto and including
the instruction that displaced the physical register have completed. Clearly, the instruction that
was allocated the physical register must have completed, and therefore the register must have been
written to. All the instructions that could access this register mustprecede the displacing instruction,
and since they must have completed, all reads ofthe register must have been accomplished. And, of
course, the register must have been remapped. This criteria--Hall instructions upto and including the
instruction that remaps the physical register must have completed--His fairly simple to impltment.
The implementation can be built around a history buffer. An instruction is added to the bottom
ofthe history buffer when it is fetched. An instruction is removed from the top ofthe history buffer
after it completes. Thus, the history buffer contains instructions in program order. To support
register freeing, when an instruction is added to a history buffer, the physical register it remapped
is saved with it. When the instruction is removed from the top of the history buffer, this physical
register is freed.
A mechanism that frees registers as described above can be used to simplify precise state
recovery significantly. We only need to recover the values in those registers that were mapped at
the time the excepting instruction was issued. In our scheme, a register can only be overwritten
after it is freed, and it is freed only after all instructions that were issued while it was mapped
complete. Thus, all values which are part of the precise state are present in physical registers,
which have not been freed. Therefore, when an instruction traps, only the map has to be restored
to what it was when the instruction was issued; none of the registers that were mapped at that time
would have been freed and overwritten.
Interrupt recovery and misprediction recovery can be facilitated by maintaining an in-order
map, i.e., the map used by the instruction at the top of the history. This can be implemented by
saving, for each instruction in the history buffer, the physical register which the instruction's output
register was mapped to before being remapped by the instruction. When an instruction is retired
from the top of the history buffer, the in-order map is updated appropriately. If exceptions are
reported only when the excepting instruction reaches the top of the history buffer, the map can be
restored in a single cycle by copying the in-order map instead of rolling back the history buffer.
Similarly, ifmispredicted branches are handled only when the branch reaches the top of the history
buffer, the map can be reloaded using the in-order map.
The in-order map also suggests another optimization: there is no need to save the physical
register that was displaced in the history buffer to support freeing. Instead, the register to be freed
can be selected from the inorder map by using the architected output register of the instruction at
124
the top of the history buffer.
To summarize: the history buffer has an entry for every instruction. The information contained
in this entry includes the instruction, the physical register displaced from the current map by the
instruction when it was issued, and a bit indicating whether the instruction has completed. There
is an in-order map, which satisfies the invariant that it is the same as the current map used by the
instruction at the top of the history buffer. The actions performed are the following.
o+ When an instruction is fetched, it is added to bottom of the history buffer. The physical
register to which the instruction's output register is remapped is also added (assuming, of
course, that the instruction has an output register).
o+ When the instruction at the top ofthe history buffer completes, it is removed from the history
buffer. The physical register in the in-order map that corresponds to the architected output
register of the instruction is freed. It is replaced by the physical register saved with the
instruction being removed.
o+ If the instruction at the top of the history buffer is a mispredicted branch, then all registers
not in the in-order map are freed. The in-order map is copied to the current map.
o+ The behavior on an interrupt is identical to the behavior at a mispredicted branch.
Thus, it is possible to dispense with the counters, map save stack, register save stack etc., and
implement physical register freeing and mispredict recovery economically using the history buffer
mechanism described above. This mechanism also allows us to implement single-cycle precise
state recovery. A" this is accomplished at the cost of some generality.
7.8 Implementation
We have discussed the need for additional hardware, including logic to perform the mapping,
modify the map, free registers, etc. We also assumed the existence of logic for branch prediction
and an instruction dispatch mechanism5. However, lack of space prevents us from discussing the
detailed implementation of all of this logic. Instead, we shall concentrate on the mechanism for
actually performing the mapping, since this is the only logic which could potentially increase the
machine cycle time.
In the past, renaming has been implemented in the instruction dispatch stage. This stage is
considered to be one of the most critical stages in superscalar processors in determining cycle
time (23,64]. We propose to implement our renaming scheme in the instruction fetch stage, as
shown in Fig. 7.5. The critical paths are shown by the solid lines; paths shown by the dashed lines
represent feed-back paths, i.e., values used in the next cycle. As indicated in Fig 7.5, mapping
takes place before set selection; this means that all sets go through the mapping logic, which must
be replicated according to the associativity of the cache.
5We have deliherately not specified whether instruction dispatch uses reservation stations or a centralized mecha-
nism; our approach works with both.
123
Tag
Directory
Tag Match
Logic
Instruction Cache
Source Registers
Map Modify Logic
Map
Insns
Mappihg Logic
Remapped Sources
Th
Free Register
Pool
Instruction Dispatch Setup Logic
Figure7.5: InstructionFetchStage
Free Registers
126
The proposed scheme requires that each source register name be mapped using the mapping table
to its associated physical register. If there are 32 architected registers, the mapping is implemented
in the mapping logic as a 32:1 mux. Using a pass-transistor based multiplexor implementation in
CMOS, this will probably involve 4 logic stages. The source register fields must control several
multiplexors; this requires they be initially powered up to obtain a fanout estimated to be about 40.
As shown in the figure, the critical path is determined either by the tag match logic required for
cache operations or by the mapping logic discussed previously. The path through the tag match
logic is influenced by four factors, namely, the associativity of the cache, its size, the address size
and the fan-out requirements of the tag compare. For the purposes of this discussion, we assume
that the cache is not direct mapped and that the address size is 32 bits.
The tag match logic starts off by checking, depending on the cache size, possibly the high
18-24 bits of the address and the program counter for equality, which can be implemented in about
4 stages of logic stages. The result bits then serve as control for a number of multiplexors. If the
implementation attempts to issue 4 instructions per cycle, this implies a fanout of about 1286. This
requires a powering tree that will add 2-3 logic stages (assuming some of the powering is buried
in the compare).
Putting all the requirements together, we observe that both paths require about the same number
of logic stages: about 6-7. This suggests that, assuming CMOS technology and a set associative
cache, the proposed scheme may not increase the machine cycle time.
Notice that only the source registers go through the mapping logic; specifically, the time at
which the opcodes become available to the instruction decode stage does not change. This may
allow us to overlap some of the mapping with the initial phases of the instruction decode in the
decode stage, if necessary.
None of the other mechanisms required add to the cycle time. Updating the map to reflect
any remapped registers is not time critical. It can be done in the next cycle, in parallel with the
I-cache access. The mechanism to free physical registers is also not time-critical. All that matters
is that there be enough free physical registers, not which ones, i.e. throughput matters, not latency.
Adding an extra cycle to the time to free a physical register can be compensated by expanding the
number of physical registers by the number required per cycle.
So far in the discussion, we have assumed that all instructions are 3-address instructions. There
will be several instructions with no output register (e.g., branches), and some with only one input
register. Superscalar issue adds a related problem--Hifsome source register name is the same as the
result register name of some instruction fetched at the same time, but prior in the program order,
it must use the new physical register allocated to that register name, rather than the entry in the
mapping table. It would seem that we would need to do some decoding in the instruction-fetch
stage.
However, we can delay these decisions till the next cycle. All instructions are initially mapped
as though they were independent 3-address instructions. In the next cycle, as part of the remap
logic, the instructions are decoded to determine whether they actually had an output, and source
and result registers are compared for identity. If the instruction did not have an output, or multiple
6The fanout can be of this order, or worse, even if less instructions are issued per cycle, depending on the physical
organization and read out of the cache ?ays.
127
instructions had the same output register, the mapping is updated appropriately. Thus, decodes are
not added to the critical path.
A similar fixup has to be performed in the instruction decode stage. Determiningthe appropriate
register to use can be done in parallel with register access. If the physical register provided by
mapping table is incorrect, the value read can be discarded. This results in no loss in performance.
Either the instruction did not need that value, or the wrong register was accessed because some
other instruction fetched in parallel had the same output register name. In the second case, the
value needed is the one that will be provided by the other instruction. That instruction, obviously,
cannot have completed, so the instruction with the incorrectly mapped register cannot be issued
immediately, anyway. Thus, there is at least a full cycle in which to determine the correct register
mapping.
7.9 Comparison
The pertormance game(I by adding a new mechanism has to be balanced against the amount of
hardware required to implement the mechanism, and that implementation's potential impact on the
cycle-time of the processor. Both of these are extremely difficult to quantify without actually an
actual implementation, and even with such an implementation it is sometimes difficultto isolate the
impact caused solely by the new mechanism, or by its interactions with other processor elements.
So, the comparison of our scheme with other schemes that implement register renaming, dynamic
speculation and precise interrupts will be more qualitative than quantitative.
One critical path in register renaming is the path from the instruction fetch to execution. On a
vanilla RISC processor, this path is implemented in two stages, Instruction-Fetch and Instruction-
Decode (also called Operand-Fetch). Logically, it follows the following steps:
Fetch Decode ReadValues Execute
Adding renaming adds an extra step:
Fetch Decode Rename ReadValues Execute
If these steps are implemented serially, obviously either an extra stage will be required, as in [20,
49), or the cycle-time of the existing two stages will be increased. The only way to avoid this is to
implement some of the steps in parallel. [29] discusses several ways of folding together register
renaming with register read, all of which require content-addressable storage. The time to perform
an associative lookup is obviously going to be greater than that of a register access. It may turn out
that the associative access will end up increasing the cycle time.
Our approach chooses to fold renaming with instruction fetch. A preliminary evaluation
suggests that the mapping mechanism used can execute in parallel with elements of the instruction
fetch stage without impacting the critical path through that stage. The rest of the path should not
change from that in a vanilla RISC processor. In particular, reading values involves a normal
register access.
128
Associative lookup requires content-addressable storage to implement efficiently. `This storage
requires complex and expensive hardware. The designs proposed by [20] and [29] use associative
lookup to access register values. Similarly, the implementation described in [39] uses content-
addressable storage for register control. Our scheme, on the other hand, requires no such complexity.
Finally, our scheme, like [20] and [29], but unlike [39], enables us to recover precise register
state in a single cycle. This in turn lets us implement single cycle interrupt recovery and speculation
past an arbitrary number7 of branches.
7.10 Conclusion
In this paper, we have presented a proposal that is in line with the current thrust of processor
design and implementation. It exploits both the regular nature of RISC instructions and the fast,
pass-gate based, multiplexors available in CMOS technology to implement renaming efficiently in
the instruction fetch stage. This early implementation of renaming has simplifying repercussions
for the rest of the design.
The optimized implementation proposed here integrates speculation, renaming and precise
interrupts. The synergy resulting from this integration results in a simple and potentially cheap
ti?n e'fthe th
ee m??iaiii????.
It seems fairly clear that the hardware costs of our proposal are fairly modest, especially in the
context of the 3 million or more transistors being used to implement current microprocessors. The
open question, therefore, is the following: can the proposal be implemented without increasing the
cycle-time? We hope to add our mechanisms to an existing microprocessor, and demonstrate, at
least via simulation, that the addition will not impact the cycle-time of the processor.
7Not entirely arbitrary; it is limited to the number of slots in our hIstory buffer.
Chapter 8
Conclusions and Future Work
In this dissertation, we have examined issues related to the implementation and exploitation of
static speculation. Here, we summarize our research and present the significance of this work. We
also sketch out future directions of research in this area.
8.1 Conclusions and Contributions
In this dissertation, we showed that static speculation should be a feature of multiple-issue archi-
tectures with a high issue-width. This required addressing pragmatic concerns such as exception
handling in the presence of speculative instructions and the implementation of speculative instruc-
tions, as well as performance related questions.
Past work in the area of static speculation glossed over the problem of supporting exceptions.
The need for addressing exception handling was recognized, since it is key to implementingfeatures
like virtual memory; however, the solutions proposed were somewhat ad-hoc. We systematically
examined the interaction between exception handling and static speculation. Our investigation
showed that it is possible to produce code such that there is no difference in handling exceptions
for non-speculative and speculative programs. We derived the compiler modifications necessary
to make this possible.
Existing techniques for implementing speculative instructions are not uniformly applicable.
They impose restrictions on the kinds of speculation expressible, especially when multi-path
speculation is being performed. We introduced speculative tagging, a mechanism for implementing
speculative instructions that was designed with both error-recovery and multi-path speculation in
mind. Speculative tagging reduces the restrictions and cost of error-recovery. It is uniform in
that it allows all types of instruction to be speculated, and allows all multi-path speculations to be
expressed.
We developed a new algorithm, whole-DAG scheduling, for exploiting static speculation when
scheduling acyclic code. Our approach differs from previous work in that it provides fine grain
control over the amount of speculation desired. Also, it uses a significantly better instruction-
selection heuristic than previous algorithms. A performance evaluation shows that our scheduling
algorithm can achieve results that are within 10% of optimal.
129
130
8.2 Future Work
8.2.1 Non-Speculative Architectures
It is unlikely that speculative instructions will be added to current generation micro-processor
families. Note, however, speculative instructions are needed primarily to support instructions
that could cause exceptions. Clearly, there will exist instructions that will not cause exceptions
even when speculated, and thus can be moved above branches even on architectures without
speculative instructions. It would be interesting to investigate whether a compiler can detectenough
such instructions to justify implementing "safe" speculation in a compiler for non-speculative
architecture5.
8.2.2 Register Allocation
Register allocation is an aspect of code-generation that we have not addressed in this dissertation.
Speculative tagging makes it simple to spill registers, so a standard graph-coloring based post-pass
register allocator [9] may be adequate. However, it may be possible to do better. One possibility is
to implement an integrated scheduling and allocation algorithm, such as by adapting the integrated
algorithm described in [19]. Another possibility is to use a post-pass allocation algorithin that
substitutes register-to-register moves at basic block boundaries for spills, in a manner similar to the
"chameleon intervals" of [22].
8.2.3 May-Dependence Speculation
We saw earlier that performance gets retarded when the compiler cannot determine whether or
not two memory operations use the same address, i.e. when aliasing is present. `This results in a
may-dependence: the two instructions may or may not be dependent on each other. One way of
obtaining performance even in the presence ofmay-dependences is to split the instruction into two,
optimize one assuming the instructions are dependent and the other assuming they are independent,
and then use speculation to appropriately schedule the instructions. This process is illustrated in
Fig. 8.1.
Unfortunately, this approach leads to exponential code?xpansion. Consider an instruction
which is may-dependent on two other instructions. To circumventboth these dependencies rcquir?s
4 copies of the instruction, one for each assumption of dependence/independence. An alternative
solution might be to allow hardware to perform this kind of speculation. However, this also
apparently leads to hardware that is exponential in the number of may-dependences bypassed.
It would be interesting to investigate combined hardware/software techniques that can increase
pefformance in the presence of may-dependence5, without exponential code expansion.
8.2.4 Dynamic Multi-Path Speculation
In the near future, itmay be practical for a processor to dynamically speculate down multiple paths.
The number ofpaths theprocessor canfollow will be restricted--Hpossibly 2-4. The processor must
131
(1)			x[i] = t			(1) x[i? = t
(2)			z = x[ji + 1			if( I == j
(2) z = t + I
(a)
else
(3)
z = x?jj + 1
(3*)
(1)
(2*)
if
z = x[j] + 1
x[i] = t
z=t+l
I == j
commit (2)
else
commit (3*)
Figure 8.1: May-Dependence Speculation
allocate these paths in a manner that optimizes expected performance. Work should be performedto
discover the best strategies for allocating these paths; perhaps some ofthe work we have performed
to discover optimal static speculation strategies will be applicable.
Appendix A
Proof of Speculative Equivalence
In this chapter, we shall present a formal proof that a non-speculative program that is similar to a
speculative program is also speculatively equivalent to it, if the speculative program follows the
constraints outlined in Chapter 3.
In Section A.l, we describe the formal model of execution on which we will base the proof.
In Section A.2 through Section AA, we shall define similarity, the compiler constraints, and
speculative equivalence in terms of the model. Finally, in Section A.5, we shall give the formal
proof.
A.l The Model
In this section, we shall describe the components of our model. Their types are summarized in
Fig. A.l. A more elaborate description follows, including the restrictions we place on the various
atoms and functions, and their interaction.
A.1.1 Atoms
The atomic or primitive types used in the model are Va/u?, Locn, Lab?1, and Cnid. Value is
the set of values stored in locations and manipulated by commands. Locn, the set of locations, is
partitioned into two disjoint sets of speculative and non-speculative locations. Label is the set of
program labels. It is, as indicated, totally ordered. We shall make use of this total ordering; in
particular, we shall use SuCCLab?t, the successor function implied by the total ordering. Cmd, the
set of commands is also divided into speculative and non-speculative commands. We shall show
how the behavior of a command is described later in this section.
A.1.2 Map
For each speculative location, there exists a non-speculative location. The functions, M, defines
a (possibly many-to-one) mapping from speculative locations to non-speculative locations. This
function is defined for all speculative locations; i.e.,
132
Atoms:
u,v			C			Value
k,l			?			Loen
E			Label
c,d			E			Crnd
Primitive Objects:
?			Program
E			State
= SpecLoen u+ NonLoen
= SpeeCmd u+ NonCmd
Functions:
M : SpecLoen H iVonLoen
133
set of values
set of locations
totally ordered set of labels
set of commands
Label			H			Cmd
NonLoen			(Value Loen) I
SpecLoen			H			? (VaThe Loen), Label> I
Rd			:			Cmd x State Loen
Op			:			Cmd H (Value Loen) H (Value Locn)?
Swr			:			SpecCmd H (Value Loen)* H SpecLoen*
Nwr			NonCmd (Value Loen) Nrn?Locn
Semantic Functions:
g : Label x Cmd x State			State
AT : Label x Crnd x State H Label
Program x Numberx ? State, Label >?? State, Label>
Auxiliary Functions:
MapContent : (Value Locn) ?? (ValuelNonLoen)
Content			:			State x Locn H (Value Loen)
LabelOf			:			State x SpecLoen H Label
Pa?r Program x Numberx ? State, Label >?? Number I
Figure A.1: Atoms and Function Types
134
Vk' c SpecLoen : e NonLoen : k =
The associated auxiliary function, MapContent, reads a value or a location, and returns a value
of the respective type. However, if the input is (the address of) a speculative location, it converts
it to the corresponding non-speculative location, by M
Definition 1:
MapContent(u) = MU(u) uU ? Valt?e U NonLoen
SpecLoen
A.1.3 Program
A program is a mapping from labels to commands. Note that in our model, we the labels are a
totally ordered set. We assume that there exists a distinguished label Co in which the program is
supposed to start executing.
A.1.4 State
A state is a mapping from locations to values. However, we shall restrict ourselves to considering
legitimate states, where a legitimate state is one that satisfies the following properties. First, we
restrict the state so that only speculative locations to hold pointers to speculative locations.
Property 1:
Vk c NonLoen : a(k) ? SpecLoen
Further, we want to ensure that at most one speculative location updates a non-speculative
location at a commit. So, all legitimate states will satisfy the following property as well:
Property 2:
Vk',1' ? SpecLoen : (a(k') =? u,a> Aa(1') =? v,a > A??(k') =
? = 1'
We shall later show that any executing any command in a legitimate state produces a legitimate
state.
We use some auxiliary functions to isolate the values associated with locations in a state.
Content: returns the value contained in a location. In particular, if the location is a speculative
location, it extracts the value from the tuple associated with the location.
Definition 2:
u			k c NonLocn A a(k) = it
Content(a, k) =			v			k c SpecLoen A a(k) =? v, a>
I			el?e
LabelOf: returns the label associated with the value stored in a speculative location; i.e., the
second element of the tuple.
Definition 3:
LabelOf(a, k)' =			) aei;?) =? v, a>
135
It will be necessary to compare the values in the non-speculative locations of two states for
equality. Ifthey are equal, we say that the non-speculative components oftwo states are equivalent,
written ?=NonSpec?. Thus,
Definition 4:
?=--HNonSpec? ? Vk E NonLoen : 0(k)
A.1.5 Sequences
We shall, in several places, be using sequences of values and/or locations. A sequence will be
indicated with an vector overbar. Thus:
u E (Value Locn)*
? Loen*
The 1-th element of a sequence will be indicated by subscripting with an integer. ?T?rpically
the subscripting variable will be implicitly quantified as ranging between 1 and the length of the
sequences. Thus:
VI:Prop(ki) =--HVI,1 ? I ? I?I : Prop(ki)
Through an abuse of notation, we assume that each of the functions, when applied to a state
and a sequence of values of the appropriate type, returns a sequence of the same size containing
the function			to each			put. Forexample:
appl,ied			element of the in
Content(a, k) = 1 where VJ : ii = Content(a, ki)
A.1.6 Command
We shall not differentiate between most commands, beyond distinguishing speculative and non-
speculative commands. Instead, as shall be shown in the next section, we shall define their behavior
in abstract terms. We shall identify certain properties that these behaviors must satisfy, and use
those properties inour proofs.
However, two instructions are interesting in the way they modify state. The first is commit
and the other skip. Also, two instructions, if and goto are unique in the way that they modify the
control-flow through the program.
The execution of a statement is modeled using two functions. The first, ?, models the effect
executing a statement has on the program state. The second,)V, models the effect executing the
statement has on control-flow, i.e., the next statement to be executed.
Non-speculative Commands
The behavior of a general non-speculative command (i.e., not including commit or skip) is
summarized by the equation given below, with appropriate definitions for Rd, Op and Nwr:
Definition 5:
?(o,c c NonCrnd,a) = ?, where
k = Rd(c, a)
136
= Content(a, k)
t;= Op(c?u?
1 = Nwr(c, u?
?n)= aV)n)
fl = 1H
else
In particular, the command reads the values in locations identified by Rd. Using these values,
the command produces output values as specified by Op, and uses them to modify the locations
specified in Nwr
The flinctions defining the behavior of a non-speculative command must satisfy the following
properties:
o+ The command reads only non-speculative locations.
Property 3:
Ve e NonCrnd, a E Stale: k ? Rd(e, a) ? k E NonLoen
Since the command reads only non-speculative location, as a consequence of Prop. 1, the
values read by the command are either values or (the addresses of) non-speculative locations.
Property4:
Ve ? NonCmd, a C State : u E Content(a, Rd(c, a))
? k c NonLoen U VaThe
o+ The number of values produced and the number of locations modified by the command are
identical.
Property5:
e NonCrnd, a E State : ii = Content(a, Rd(c, a))
? IOp(eit7) = jNwr(e?u?j
The locations modified by the command are non-speculative. Note that this follows from the
type of Nwr.
Property6:
Ve E NonCmd, ii E (Value Loen)* :1 c Nwr(c, u? ? 1 c NonLoen
We want to ensure that any state produced as a result ofexecuting a non-speculative command
is legitimate (assuming, of course, that the command was executed in a legitimate state).
This requires that the following property hold:
Property7:
Vec NonCmd,u? (Value NonLoen)* : v C Op(c,?)
? v E NonLoen U Value
Additionally, commands must satisfy the property detailed below. It states that the 1-th location
read by a command depends only on the command and the values in the previous locations accessed.
Note that this property must be satisfied by both speculative and non-speculative commands.
137
Property 8:
Vc ? Cmd, a, ? C State:
let
in
= Rd(Q a)
1 = Rd(e,?)
Kilfo
v (k1 = li
A (VI:
(VH 1 ? il ? I: kH = lii A Content(a, k11) = Content(?, 7H))
? k1 = 1i))
Speculative Command
The specification ofa speculative command is much like that ofa non-speculative command, except
that the speculative locations in which the values produced by the command are to be stored till
committed need to be specified. This is modeled by Swr. Thus,
Definition 6:
C SpecCmd,a) = a', where
P = Rd(c',a)
= Content(a, k')
v' = Op(e',?T')
= Swr(c', u?)
? %%, &>			n =
I			n E SpecLoen A n ?
=			ALabelOf(a,n) = &			(1)
a(n)			else
Note (1) in particular: this ensures that the effects of any previous execution of the same
speculative instruction are discarded. This is exactly what is done by safe-bits; in boosting, the
discarding could occur earlier.
Certain restrictions imposed on the functions that model a speculative command include:
o+ The number of values produced and the number of locations modified by the command are
identical.
Property 9:
Ve' C SpecCrnd, a E State : t7' = Crn?tent(a, !?d(c', a))
? IOp(e',ii') = ISwr(e',u?)I
o+ All locations modified immediately by a speculative command are are speculative. Note that
this follows from the type of Swr.
138
Property 10:
Vc' ? SpecCmd, u? C (Value Loen) : m' C Swr(c', u?)
? rn' E SpecLoen
o+ We wish to ensure that the state produced as a result of executing an instruction satisfies
Prop. 2. To do so, we restrict Sw'r as follows:
Property ii:
? SpecCmd, u' E (VaThe Loen)* : V?', n' E Sw?(c', u?)
A4(??')=A4(n')?rn' =n'
Commit and Skip
The commit command moves the values produced by a speculative instruction from speculative
locations to non-speculative locations. Briefly,
Definition 7:
?(o, commit P, a) = a', where
MapContent(w)			rn' C SpecLoen A k =
=			Aa(rn') =? w, f3>
a(k)			else
The behavior of the commit function can be divided into two parts. One is to copy certain
values stored in speculative locations to non-speculative locations. The other has to deal with the
fact that there could be a pointer to some speculative location stored in one of speculative locations
being committed. Any such pointers are updated to point to their non-speculative location.
skip is the usual, do-nothing, instruction. Thus,
Definition 8:
?(a, skip, a) = a
Control Flow
Most of the commands. speculative and non-speculative, have no effect on the control flow; the
next statement is executed. The other two, if and goto are also as expected. goto changes control
flow unconditionally, while if does it conditionally. Note that both statements are non-speculative.
This implies that the location referred to in the if command, k is a non-speculative location.
Also, we have used succLabet, the successor function on Label, mentioned above. We assume the
existence of a value true, which controls the behavior of the if command.
The effect of the various instructions on control flow is summarized below:
Definition 9:
)V(a,ff(k)P,a) = ?, where
= s?ucc?a?ct?a? a(k) = true
else
139
Definition 10:
)\f(a,goto(P),a)			p
Definition 11:
= suCC?a?ct(a)
A.1.7 Program Execution
We are interested in a point by point comparison ofthe behavior of speculative and non-speculative
programs; therefore, instead oftrying to define the result ofexecution of a program, we shall define
the result ofexecuting I commands of the program. Thus:
Definition 12:
1?(?, 0, < ao, ao>) = ? ao, o?o>
Definition 13:
)?(?, I, ? ao, Qo>) =
let
zn
?p >= ?(?, J --H l,? a,?>)
<; g(?,??),Af?3,??) >
It will often be convenient to be able to extract either member of a State-Labeltuple. The
following auxiliary functions will be convenient:
Definition 14:
Label(? a,o>) =
Definition 15:
State(? a,a>) = a
Legitimate States
We had earlier restricted the set of states we considered to those that are legitimate, i.e. satisfy
Prop. 1 and Prop. 2. We have, therefore, to show that executing a command in a legitimate state
results in a legitimate state.
We shall first show that it is not possible for the execution of a command in a state that satisfies
Prop. 1 to result in a state that violates Prop. 1, i.e.
Vc,a:
let
= ?(?, c, a)
140
Vk E NonLocn : a(k) ? SpecLocn
? Vi E ATonLocn : ?(l) ? 5ThecLocn
The only commands that modify locations in NonLocn are the non-speculative commands
and commit. The non-speculative commands, from Prop. 7, cannot update the state by (the
address of) a speculative location. commit, as can be seen from Defn. 7 updates locations by
MapContent(m), which cannot be a speculative location. The unmodified locations, of course,
contmue to not contain speculative locations. Consequently, if any command is executed in a state
satisfying Prop. 1, the resultant state satisfies Prop. 1.
We shall now show that it is not possible for the execution of a command in a state that satisfies
Prop. 2 to result in a state that violates Prop. 2, i.e.
Vc, a:
let
(p= ?(?, c, a)
zn
c SpecLocn : (a(k') =? u?a > Aa(l') =< v,a >
AM(k') = M(l')) ? = 1')
? ((Vk',l' c SpecLoen : (?(M) =? u,a > A?(l') =? v,a>
AM(k') = M(l')) ? = 1')
The only commands that can modify speculative locations are commit and speculative-
commands. commit sets certain speculative locations to I, clearly not causing a violation of
Prop. 2. Executing a speculative command, by Prop. 11, also cannot cause Prop. 2 to be violated.
Hence, if a command is executed in a state satisfying Prop. 2, the resulting state also satisfies
Prop. 2.
Thus, if a command is executed in a legitimate state, the state produced is also legitimate.
Pairs
Whenever we encounter a commit, we will need to refer to the instruction that it is going to
commit. More precisely, we need to refer to a specific instance of that instruction. This instance is
defined for commit p, if the command at t3 was executed and no intervening commit p has been
executed (see (2)). Further, it is the closest such execution of the instruction at p (see (3)). This
is illustrated by Fig. A.2. The J-th command executed is the command being committed by the
commit instruction at I. No instruction in between can have either label a or
Definition 16:
Pair(?', 1, ? ao, ao>) =
let
zn
a = Label(9?(?',I,? ao,ao>))
if(?(a) = commit ?
A ai o ? j ? I: i3 = Labcl(9Z(?,J,? ao,ao>))
AVH,J < il ? I: ? ? Label(1?(?,Th,? ao,ao>))
(2)
141
Instructions
Executed
0
Label of			Instruction
Instruction from P'
J
I			0:			commit ?
Possible ?
Figure A.2: J Pair(P-', J, ? a0, o?>)
No ? or
commit ?
then
J
else
AVfl,J ? H ? I:
commit 73 ? P'(Label(?(?', II, ? a0,
(3)
A.2 Similarity
In this section, we shall define what we mean when we say that a non-speculative program is
speculatively equivalent to a speculative program. First, we shall define what we mean when we
say a non-speculative command c is similar to a speculative command c', written c?c'
Definition 17:
c?d' : c E NonCrnd, d' c SpeeCn?d ?
(Va, ? c State:
let
k = Rd(c, a)			= Rd(d', ?)
u = Content(a, k?			ji' = Content(?, k')
v = Op(c, u)			= Op(d', ii')
1 = Nwr(c, u?			ff?' = Swr(d', ?7')
142
MapContent(Content(?, P)) = Content(a, M (k7'))
k=M(k%A
u = MapContent(t7') A
v = MapContent(??') A
7= ??(ff?')
We use the definition of ? listed above to define a transformation. A non-speculative program
? is similar under transformation to a program?', written ???` if:
Definition 18:
dom(?) = dorn(?') A
Vc c ?: c C NonCmd A
VQ C dom(?'):
if(?'(a) E SpecCrnd) then ?(?) = skip
else ?f(?'(&) = commit p) then ?(a)??'(J3)
else ?(a) =
Note that, if it is possible to find a non-speculative similar instruction for every speculative
instruction, the Defn. 18 allows us, for every speculative program, to construct an non-speculative
program that is similar under transformation to it.
A.3 Compiler Coristraints
The compiler defines some distinguished initial status <a0, ?o> in which program execution
should be initiated. The compiler produces code so that it satisfies the conditions listed below if
execution is initiated by executing the instruction at label &0 in initial state a0. Note that the initial
state is assumed to be legitimate, i.e. satisfy both Prop. 1 and Prop. 2.
The compiler must ensure that, ifa commit is executed, then there must have been an execution
of the command being committed, without an intervening commit of that command; i.e., for every
commit, its Pair must exist.
Property 12:
VI:< a', a' >= 1?(?', I, < ao, ao>) A ?`(a') = commit ?
aJ: J = Pair(?', I, < a0,
The compiler must ensure that, if a command is being committed, then the values that it used
must be available. In particular, if the values were stored in non-speculative locations, they must
be available in the same locations, while if they were stored in the speculative locations, they must
have been previously committed to the appropriate non-speculative location. Further, if one of the
values read was a pointer to a speculative location, then that value must have been updated to point
to some non-speculative location.
Property 13:
143
VI :? a', ? >= ?(?, J, ? ao, <>o>) A ?(&) commit ?
let
J = Pa?r(?', I, ? ao,
? ?, p >= ?(?`, J, < a0, (?>)
Vk' ? Rd(?(p), ?)
MapContent(Content(?, k')) = Content(a', ?\4
Finally, the compiler must ensure that the appropriate actions occur at a commit. In particular,
this implies that all values that were produced by the speculative command are available in the
appropriate speculative locations, and will be copied to the locations determined when the command
was executed. Note that the values may be pointers to speculative locations; Some of these will
have been updated to point to the equivalent non-speculative location, the others will need to be
updated to point to the equivalent non-speculative location at the commit.
Property 14:
VI:< a',&' >= 1?(?,J, ? ao,&o>) A ?`(?) = commit? ?
let
J = Pa?r(?', I, < ao,
? ?,p >= ?(?`,J, <
= Rd(d', ?)
u? =Content(?,k?)
= Swr(d',?7')
VH : a(rn1,1) =? %%i,73> va(n?14) =? MapContent(i%%),iJ'>
A.4 Speculative Equivalence
We say that two programs ? and Q are speculatively equivalent with respect to initial state
<ao, ?o > written ? =<?,ao> Q when the following holds:
Definition 19:
=--H<ao,?o> Q ?
VI:
let
Q? >= 1?(?,I,? a0, a0>)
? a', a' >= ?(?-`, 1, ? a0, ao>)
a = a' A ?=?Nonspcc?'
144
A.5 Proof
We have to show that given a speculative program ?`, and a non-speculative program? that is
similar under transformation to it, i.e. ???`, then, if all the compiler constraints are satisfied, the
programs are speculatively equivalent, i.e. ? =--H<?o,ao> Q. Thus, we have to prove:
VI:
let
<Qa >= 1?(?,I,< a0, a0 >)
<&, a' >= ]Z(?', 1, < a0, a0>)
tn
a = a' A ?=Nonspcc?'
We shall proceed by induction on the value of 1.
Base: I = 0:
By Defn. 12
<a, a >= ?(?, 0, < a0, ao>) = < aO, a0>
<a', a' >= ?(?`, 0, < a0, a0>) = <a0, ao>
Consequenfly,
Vk ? NonLocr? : a(k) = a0(k) =
A a = a0 = a'
Induction Step:
Assume true for I, i.e.
let
<a1, a1> 1?(?,J,< ao,ao>)
<ai, a'1 >= 1?(?', I, < a0, ao>)
zn
a?(k)=??onspeca1,(k) A a1 = a'1
Toproveforl+ 1
<a1+1,a1+1 >= ?(?,I+ 1,< a?,ao>)
<aKi, a'1+1 >= ??(?`, I + 1, < a0, ao>)
By induction hypothesis
<a1, a1 >= 1?(?,1,< a0, a0>)
<a1,, a'1 >= ?(?`, 1, < a0, a0>)
By definition of 1? (Defn. 13)
a1+1 = ?(a1,?(a1),a1)
a1,+1 = ?(a1,?'(a1),a1,)
and
145
aI+1 = )V(ai,?(ai),ar)
a'1+1 = )V(Qi, P'(Qi), ai')
Thus, it is enough to prove
?(?i ?(ai), ai)=??onsp?c?(?i, ?`(?i), a11)
A Af(Qi, ?(a1), ai) = W(a1, ?(?i), a'j)
We shall split this proof into two steps. First, we shall show that aI+I =NonSpe??i'+i holds.
Then we shall show that a1+1 = holds
Step I: a1+I=?Nonspec?'i+i
By cases, based on type of ?`(ai)
let
c =
Case I: c' C SpecC??d
By Prop. 6, the only locations modified by c' are speculative locations
Vk c NonLoen : a11+1(k) =
From definition of
c = skip
By definition of skip:
Vk E Loen : a1+1(k) = a1(k)
By the induction hypothesis:
Vk c NonLoen : aj(k) = a11(k)
Consequently
Vk e NonLocn : a1+i(k) = a1(k) = a11(k) = a11+1(k)
and therefore
aI+1 =Nonspec?I'+
CaseII:c'=commit?
First, let us compute the value of a'?+?.
By Prop. 12, Pair(?', I t 1, ? ao, ao >) exists
let
and
J = Pair(?', I + 1, ? a0, Q0>)
? ?,p >= ?(?,J?? ao,&o>)
= Rd(d', ?)
= Content(?, P)
v' = Op(d',?7')
rn' = Swr(d', u?)
(4)
From the definition of commit, Defn. 7
146
MapContcni(w) m' E SpecLocn A n = M(rn')
Vn : a11+1(n) =			Aa1,(m') =? w,fl>
a11(n)
ByProp.14
VH: a11(ff?'11) =? v?,?> V&i(m'?) =? MapContent(v?'),t3>
Further, from the definition of speculative commands, Defn. 6, all locations with previous
value of ? w, ? > were set to I when d' was executed. From the definition of Pa?r, Defn. 16, no
intervening execution of a command with label f3 can have been executed. Therefore, these are the
only locations s.t. a1,(m') =? w, ?>. Thus:
VH : n = M(rn'H) ?
a11+1(n) = MapContent(Content(a11, m'?))
E fMapContent(v;'), Alapconten!(AiapContent(v4'1)))
= ? i1\1apCont?nt(??b) ?
and therefore
VH : n = ?\4(ffz;') ? a11+1(n) = AJapcontent(t;')
from which it follows that
Vn : ati+1(n) =			;1,)sconient(J;') n
eThe
Now, let us compute the value of a1+i.
From definition of?
c?d'
let
= 1?d(e,ai+i)
= Content(ai+i, k)
v = Op(c, u)
7= ivwr(c, u)
From Prop. 14
VH: MapContent(Content(?, f?)) = Content(a1,, M (kb))
where (p'is the state at the execution of the instruction being committed, from (4).
Further, by the induction hypothesis:
Vk ? NonLoen : a1(k) = a11(k)
Consequenfly, since
Vk': M(k') c NonLoen
from Prop. 13, it follows
VH: MapContenI(Content(?, kb)) = (?ontent(a1, ?\4
Now, by the definition of ?, Defa. 17:
MapContent(Content(?', k')) = (?rn?tent(ai, Al
k = M(k?) A
u = MapContent(u?) A
v = MapContent(?7") A
1= M(ffY)
147
Consequently
1H = MY?H)
= MapContent(v??)
By the definition of non-speculative commands, Defn. 5
Vn : aj+1(n) =			vH = MapCrn?tu?t(?A\1) n =			=
else
By the induction hypothesis:
Vk ? NonLoen : a1(k) =
Consequently
NonLocn : a1+1(n) = v41 = MapContent(v?') n = l?i =
Vn E			a1(n) = a1,(n)			else
which is the same as for a'1+1.
Thus, we can conclude:
Vn c NonLocn : a1+1(n) = &1+1(n)
and therefore
let
aI+1 =Nonspee?I+1
CaseIII:c'CNonCrnd
k7' = Rd(e',a1,)
= Content(a1,, k')
= Op(c', u?)
7' Nwr(&,u?)
FromDef?. 5
Vk: a1,+1(k) =			k =4'
else
By definition of%+
c =
let
k=1?d(c,ai)
= Content(ai, k)
??= Op(e,ii)
7= Nwr(? u?
By the induction hypothesis
Vk E NonLocn : ai(k) = a11(k)
From this it is clear that both c and c' will read the same values if they read from the same
non-speculative locations. By Prop. 3, they only read from non-speculative locations. Further,
from Prop. 8, it can be shown that, since the two commands are identical, they will read exactly
the same ?quence of values. Therefore:
k =
u = u
v = v
148
1=1'
Thus			k = kH = k'H
vH =
Vk c NonLocn : ?1+1(k) =			= &1(k) e1?e
Consequently
Vk ? NonLocn : a1+1(k) = a1,+1(k)
and therefore
aI+1 =?NonSpec?1,+
Thus, we have shown that, for all ?(?i) and ?`(ai)
?I+1=NonSpec?i+i
Step II: o?I+I = a'1+1
As shown above, to prove 01+1 = 0'i+i, given the induction hypothesis it is enough to show that
AT(oi, P(oi), ai) = AP(&i, ?(ai), a1,)
Again, we shall proceed by cases, this time distinguishing the if and goto commands.
Case I: ?`(01) = goto(P)
By the definition of N
= p
Since goto(P) is non-speculative, by the definition of?
= goto(P)
And
Ar(oj,goto(P),a1) = p
Thus
= Ar(a?,?(aj),ai)
CaseII:P'(a1)=ff(k)p
By the definition of?\f
)\F(ai+i, ff(k)p, a11) = s?ucc?a?et (oi+i) a11(k) = true
else
Since if(k)P is non-speculative, by the definition of?
= if(k)P
By the definition of AT
AT(01+1, ff(k)p, ai) = s?ucc?a?ei (01+1) a1(k) = true
else
By the induction hypothesis
Vk E NonLocn : a1(k) = a11(k)
Consequently, since k ? NonLocn
)\f(aI+I, ff(k)P,a11) = s?ucc?a?ci?a?+i?
a11(k) = al = true
else
And
Af(oi, ?`(oj), a11) = Af(oj, ?(oi), a1)
Caseffi:?`(oi)=someotherinstruction
By the definition of? if c' = ?`(ai) is not a goto or a if, neither is c = P(o1). Therefore, both c
149
and c' must execute the next instruction, i.
At(ai, ?`(ai), a11) = succ?a?ct(ai)
Thus, we have shown:
Af(?i, ?(oi), aj) = Af(a1, ?`(&,)? a11)
and therefore
a1+1 =
Conclusion
It follows, by induction, that the claim is true for all I, i.e.
VI:
let
<a, &!+I >= 1?(1?, 1, < a0 ?o
<a', ?i+i' >= 1?(?', I, < ao, a0>)
?=Nonspec?' A aj+j = aI+1
Thus, we have shown that, if the compiler constraints are satisfied for prowams executing with
initial state < a0, ?o > then,
p =--H<?o,ao> Q
Appendix B
Implementation of Whole-DAG Scheduling
In ChapterS, we had described our algorithm, without giving concrete details ofhow certain aspects
of the algorithm were realized. This was because the implementation we used relied intimately
on the way DFGs represented information. In this appendix, we shall provide those details. We
assume that the reader is familiar with the details of dependence flow graphs (DFG), single-entry
single-exit regions (SESE), and the program structure tree (PST).
Broadly, each function in the program is completely processed before going to the next one. For
each function, each of its DAGs is isolated, scheduled and completely simulated before processmg
the next one. This process is ouflined in Fig. B.l. Our description shall follow this order.
B.1 Preprocessing Functions
The first step is to pre-process the function so as to convert the program into a form which makes
scheduling easier, and then to isolate the DAGs that will be scheduled and simulated.
for each function in program
preprocess function
for( each acceptable DAG in function
compute EARLY/LATE
schedule DAG
simulate scheduled DAG
Figure B.l: Whole-DAG Scheduling and Simulation: Top-Level Structure
150
151
B.1.1 Splitting Control-Edges
The control structure of the program is modified so that there is no control-flow edge that is both a
member of a fork and a join. To accomplish this, we add dummy basic blocks to the control-flow
graph. This allows us to always be able to move code up past both sides ofajoin, without worrying
about the possibility of introducing it on a path where it was not akeady present.
Note that all branches in the control-flow graph as presented to us are either unconditional,
or are 2-way conditional branches. Consequenfly, the number of edges in the graph is at most
twice the number of instructions; even if we split every edge in the graph, we would no more than
quadruple its size.
B.1.2 Isolating DAGs
Mter the control flow graph has been altered, the maximal acyclic SESE regions in the graph are
identified. An SESE is maximally acyclic if it is itself acyclic, and is not contained in another
acyclic SESE region. Such regions are easily found by a top-down walk of the PST, as shown in
Fig. B.2.
These acyclic SESEs are then examined for control structure. Basically, if they have a condi-
tional branch (a switch in DFG parlance), then they are considered interesting.
However, this is not enough. Our simulation technique evaluates each path through the DAG
individually. Ifthe DAG contains too many control-flow paths, the simulation process will take too
long. So, what we do is enumerate the number of paths, and reject the DAG if the number crosses
1000. Note that this is not a restriction on the scheduling algorithin; it is a constraint imposed on
the selection process to control the time taken by the simulation process.
B.2 Computing EARLY and LATE
The scheduling heuristics use a metric called LATE. Determining it requires that we first compute
a fairly similar metric called EARLY. In this section, we shall describe how a LATE and EARLY
value is computed for each instruction.
Note that this the computation of LATE is done once before scheduling, by means of a single
pass through the DFG. These numbers are never re-evaluated for an instruction unless the instruction
is replicated or moved in the graph. In that case, updating the value of LATE for the instruction
can be done by examining the instructions neighbors in the DFG; it is not necessary to walk the
graph again.
B.2.1 Computing Early
The EARLY value of an instruction is the instruction's distance from the beginning ofthe DAG. In
the dependence-flow graph, it can be found by weighting all dependence out-edges with the latency
oftheir source instruction, and then finding the longest such path in the graph to the instruction. The
152
find_sese_regions(PST_Node node, Queue DAG_Q)
DAG *
if
/*
/*
dag;
node is acyclic && node is not empty &&
node has no sequentially composed predecessor I I
predecessor of node is cyclic )
found the beginning of a sequentially composed chain */
dag = NULL;
add node to dag;
while( sequentially composed successor exists
node = sequentially composed successor of node;
add node to dag;
add dag to DAG_Q;
else
go look in the children of the node */
for( child = child of node in PST )
find_sese_regions (child, DAG_Q);
Figure B.2: Maxinial Acyclic SESE
133
START
0
K
EARLY=O
EARLY =0
EARLY=2
EARLY=3
Figure B.3: EARLY
weight on an edge joining two instructions is also known as the distance between them. Similarly,
the distance between two instructions is the sum of the weights on the longest path joining them.
This is illustrated by the example in Fig. B.3. START has an EARLY value of 0, and a latency
of 0. Every other instruction has an EARLY value that is the maximum of the sum along the paths
to the instruction. In particular, I depends only on START, soit has an EARLY of 0. I has latency
2, so any instruction that depends on it must have an EARLY of at least 2. In particular J depends
only on I, and so has EARLY of 2. K depends on both I and J. The longest path to K is through I
and J with total length 3. Thus K has an EARLY of 3. We can also say that the distance from I to
Kis3.
Anti, Output and Flow
The rationale behind EARLY is that it represents the earliest point in the schedule that an instruction
can be scheduled. Thus, by weighting each dependence edge with the latency of its source
instruction, we are in effect saying that any instruction that depends on the source cannot be issued
till the source instruction completes. This is clearly true for both output and flow dependencies.
They are, therefore, weighted by the latency of the instruction.
However, If two instructions are joined by an anti dependence, then the dependent instruction
is not waiting for the instruction to complete, just for it to be issued. Thus, it can be issued in the
cycle after the source instruction is issued, and consequently its weight should be 1. These weights
or distances are summarized in Fig. B.4.
Merges
In a DAG, an instruction can have different distances from START for different control-flow paths
through the DAG. One must somehow combine these distances. One approach is to ignore the
154
DIST(START, I) =0
DIST(I, J) = latency(I)
DIST(I, J) = latency(I)
DIST(I, J) = 1
DIST(S, J) =0
DIST(I, M) =0
DIST(I, END) = latency(I)
I			J			K
L
(a)
V insns I that are successors of START
V insns J that are flow dependent on I
V insns J that are output dependent on I
V insns J that are anti dependent on I
V insns J that depend on switch 5
V merges M that depend on instruction I
V insns I on which END depends
Figure BA: DFO Edge Weights for EARLY
3			13
(b)
Figure B.5: EARLY using def-use chains
L
(c)
fact that the effect of control-flow and continue to take the maximum of all dependence-paths from
START to the instruction. However, if an EARLY for an instruction is very large on a very low
probability path, it will lead to bad choices when we use that value in our heuristic. Instead, we
would like to use a value of EARLY which takes the probability into account. One possible way
of computing a probability-weighted EARLY for an instruction is to compute the distance from
START separately for each of the control-flow paths on which the instruction lies, and then take a
weighted average of the values, where the weight is the probability of each path.
Unfortunately, with a def-use chain based dependence representation, there is no connection
between the dependence and the control structure representation of the program. Consider Fig. B.5
Assume that 1, J and K are each executed 50% of the time, and have an EARLY value of 0.
From the dependence edges shown in Fig. B.5(a), we cannot determine the control-flow structure
of the program. and therefore we cannot determine how we should average the EARLY values.
Depending on the control structure, the average EARLY value should be 13 (Fig. B.5(b)) or 8
(Fig. B.5(c)).
In the DFG representation, however, dependence edges that cross basic block boundaries' are
1Actually, its more complicated than that; b?ipassing allows us to avoid considering dependence edges where there
is no need for them. [27] contains the details.
155
mg			mg
L			L
(a)			(b)
Figure B.6: EARLY using DFG
"pinned" by switch and merge nodes. Thus, the dependences of Fig. B.5(b) and Fig. B.5(c) would
be represented as shown in Fig. B.6. Note how it is possible to identify the links between control
and dependence structure of the program using the DFG representation.
The Data-Flow Equations
The definition ofEARLY we posited in the last section is impractical. It could involves enumeratmg,
for each instruction, all ofthe control flow paths in a DAG, separately computing the EARLY value
for the instruction, and then averaging it. Since the number of control flow paths in a DAG is
potentially exponential in the size of the DAG, we clearly cannot use this definition directly, and
would like to come up with an approximation.
The approximations we make are very simple, and are an immediate consequence of the
structure of the DFG. Basically:
The EARLY value of a node2 other than a merge is found by first computing, for each of
its predecessors, the sum of EARLY for the predecessor node and distance between the two
nodes, and then using the maximum such value.
For a merge node, we first compute the value for each of the control flow directions separately,
i.e. for each control-flow direction being joined, we find the maximum EARLY plus distance
of all edges entering the merge from that direction. We then compute the weighted average
of these maximums, where the weight is the relative probability of each of the directions.
It is possible to compute the EARLY values for each instruction by solving the data-flow equa-
tions shown in Fig. B.7. We shall not discuss how these equations can be implemented efficiently;
see [27J for details. In a DAG, solving these equations involves visiting each dependence-edge
exactly once in a forward walk of the DAG. The order in which the edges are visited forms a
topological sort of the edges; i.e., no edge is visited until all its predecessor edges have been
visited, and the value of EARLY has been computed for all preceding nodes.
2Remember, a node in a DFG may be an instruction, a switch or a merge node
156
EARLY(START) =0
EARLY(J) = max(EARLY(N)+DIST(N,J))
V nodes N on which instruction J depends
EARLY(M) = Zt prob(Th)*(max(EARLY(N??) + DIST(Nj,M)))
V basic-blocks H? that are predecessors of merge M
where N? are nodes in Rj on which M depends.
EARLY(S) = max(EARLY(N)+DIST(N,S))
V nodes N on which switch 5 depends
EARLY(END) = max(EARLY(N)+DIST(N,END))
V nodes N on which END depends
Figure B.7: EARLY: Data Flow Equations
B.2.2 Computing LATE
LATEis the exactconverse ofEARLY; where EARLY ofan instruction is the path-averaged distance
from START, LATE of the instruction is its path-averaged distance to END. It will be computed
analogously, by solving a data-flow equation. However, the solution will involve a backward walk
through the DAG, in which every edge is processed once, in some reverse topological order. There
are some subtleties in the computation ofLATE that were not present in the computation ofEARLY
which shall be detailed in this section.
EARLY and branches
LATE is a measure of the distance of an instruction to END along the dependence edges. Now,
no instruction has a data-dependence from branches; thus, if we take only data-dependences into
account, a branch is not on the critical path, and can be scheduled as close to END as desired.
Correspondingly, any instructions that are used solely to compute the conditional for a branch can
also be executed fairly close to END.
However, we have to account for the effect of control dependence. Consider the situation
without speculation. Ifthe execution of a branch gets delayed, then the execution of all instructions
which are control-dependent on the branch will also get delayed. Therefore, branches and the
instructions that compute the conditionals for those branches should be scheduled early. The
question is, how soon before END should we execute a branch, so as to take into account the effects
of control dependence?
Assume that the DAG could be scheduled so that all instructions would execute as soon as
possible. Then, on the average, an instruction would be executed EARLY cycles after the start of
the computation. Further, the distance between the execution of the instruction and END would be
the difference between their EARLY values. This is the value we use for the LATE of a branch:
the difference between the EARLY of END and the EARLY of the branch.
157
LATE(START) = max(LATE(N)+DIST(N,START))
V nodes N which depend on START
LATE(BR) = EARLY(END) - EARLY(BR)
V branch instructions BR
LATh(I) = max(LATh(N)+DIST(I,N))
V nodes N which depend on instruction I, I not a branch
LATh(M) = max(LATh(N)+DIST(M,N))
V nodes which depend on merge M
LATE(S) = ?? wob(Th)*(max(LATE(N?) + DIST(Nj,S)))
V basic-blocks !?? that are successors of the fork
where Nj are nodes in Rj which depend on switch 5
LATE(END) =0
Switches
Figure B.8: LATE: Data Flow Equations
Just as we combined FARLYs values from different control flow paths at joins by using merge
nodes, so, while computmg LATE, we combine LATE values from different control flow paths at a
fork by using switch nodes. This can be seen in the data-flow equations we use to compute LATE,
shown in Fig. B.8. Note that it is the existence of both switch and merge nodes makes it possible
to solve both of these problems in a symmetric fashion.
B.3 Scheduling DAGs
Most of the details of scheduling DAGs were covered in Chapter 5. In this section, we shall skimp
on those parts, and focus on the areas not previously covered. In particular, we examine how we
determine when an instruction is ready, and how we compute LATE for instructions that are moved
or replicated.
B.3.1 Determining Readiness
An instruction will become ready only after all its predecessors have been scheduled. At this point,
it is possible to compute the exact cycle at which the instruction will become ready. So, basically,
every time we schedule an instruction, we examine all its data-dependent successors. If, for any
successor, this instruction was the last unscheduled predecessor, we compute the READY value
for that successor. The READY value of an instruction represents the earliest cycle at which the
instruction can be scheduled. It can be found by adding the SCHED time (i.e. the cycle at which
actually scheduled) of all predecessors to the distance between the predecessor and the instruction,
and then taking the maximum such value. This process is sketched in Fig. B.9.
158
sched_node(PST_Node sched, int cyle)
/* sched is to be scheduled this cycle */
/* Book-keeping, code-motion and other actions */
SCHED(sched) = cycle;
for( node in descendants of sched
if( all predecessors of node have been scheduled
ready = 0;
for( pred in predecessors of node
ready = max(ready, SCHED(pred)+DIST(pred, node))
READY (node) = ready;
if( node is switch or node is merge
/* schedule node */
SCHED(node) = ready;
set_ready successors (node, ready);
else
add node to appropriate ready queue.
Figure B.9: Computing READY
Note that we handle switches and merges specially. If an instruction is ready, all other instruc-
tions that depend on it may become ready. These instructions may be in different basic blocks,
in which case the there will be an intervening DFG node such as a switch or a merge. So, if
scheduling the instruction makes the switch or merge node ready, we immediately start processing
all successors of the node for readiness. We had mentioned in Chapter 5 that an instruction is
not considered ready till all its predecessors on all control-paths were ready. As can be seen from
Fig. B.9, we do not consider a merge node ready till all its predecessors on all control paths are
ready; thus, any instruction that depends on the node will be examined only after the merge node
is ready, and, consequently, the instructions predecessors on all paths are themselves ready.
B.3.2 Recomputing LATE
During the scheduling process, instructions are moved and/or replicated. As a consequence, their
LATE values may need to be recomputed. Note that when an instruction is moved up through
139
I LATE=3
to
LATE=1 I			I LATE =5
LATE=O J			K LATE=O			fi			K?5
(a)			(b)
Figure B.lO: Recomputing LATE
a join, its LATE value remains the same. Consequently, the LATE value of any instructions (as
opposed to merge nodes) that it depends on also remain untouched. Similarily, when an instruction
is moved up through a branch, its LATE value remains the same. When an instruction is moved
up through a branch, it is being speculated. Consequently, there are no additional dependences
introduced, to instructions on the other side of the branch. Thus, its path(s) to end remain the same,
and consequently its LATE value. Further, since the only time an instruction is moved through a
switch is when it is being scheduled, all of its predecessors must alsready have been scheduled.
There is no need to update their LATE values.
The only time an instruction's LATE value needs to be updated is when an instruction is moved
down past a branch, and replictated on both sides. As shown in Fig. B.lO, instead of there being
two paths to end, that need to be merged at the switch nodes, there are two instructions, each with
one path. Updating the LATE values is fairly trivial; it mearly requires examining the immediate
successors of the node, and applying the equations given in Fig. B.8.
When moving multiple instructions below a branch, we first move those instructions that do not
depend on any instruction that is to be moved. Mter moving those, and recomputing their LATE
values, we then move those instructions that do not depend on any of the remaining instructions to
be moved, and so on. This bottom-up order of moving instructions ensures that all LATE values
will be correctly updated.
B.4 Best-3 and Tags
In this section, we shall briefly argue why the speculations performed using the best-3 strategy can
be implemented using 3 tags, without requiring any spills. As we had mentioned in Chapter 6,
there are two reasons why a best-N strategy might require more than N tags for some basic block:
Duplicate speculation
Live-through
We shall argue that neither of these cases can arise with the best-3 strategy m the current
implementation.
160
A			A
c			I-			c
Figure B.ll: Duplicate Speculation: I
B.4.1 Duplicate Speculation
For duplicate speculation to arise, an instruction must be speculated past two branches. Thisi the
situation shown in Fig. B.1 1. Note that J*j will succeed if the both branches are true, w? the
1*2 will succeed if the first branch is false and the second is true. As was discussed in Ch?r 6,
we need two tags to represent the speculation of I; one to represent the true-true speculanca and
the other the true-false speculations.
On the basis of Fig. B.l 1, it may appear that the best-3 strategy can give rise to du?ate
speculation; simply pick blocks A, B, and C. Notice, however, that the X to B edge is a
member of a fork and a join. During the preprocessing step, we would have split this ed?into
two, giving rise to the DAG shown in Fig. B.12. As shown in that figure, if blocks A, B, andChad
been picked, and I had been speculated, it would have resulted in 1*2 being copied into D. ?ince
D is not a block picked by the best-3 selection strategy for code motion into X, the 1*2 cansever
be moved into A. A similar situation would arise if block D, B, and C were chosen; in thMcase,
one of the copies of the I would be moved into A, and then no further. Clearly, we need to ? all
4 blocks, A, B, C, and D, before duplicate speculation becomes possible.
B.4.2 Live-through
For a speculated instruction, 1*, to be live through a block X, there must be some earlier Mock
A into which the instruction was moved from its original home block H. The speculationt 1*
must be unresolved through X. Thus, there must be a path from A through X to H. The *cks
picked for speculation into X itself, however, must not include H. This implies that there lust,
for the best-N strategy, be N descendants of X that are connected to X by paths not includ?? N.
Further, the best-N strategy picks the best descendants of a block. If a best-N strategy for Xdces
not pick H, then the best-N strategy for A will pick H only if there is an alternate path frosiA to
H (not including X) on which H has a higher probability. The smallest situation in which aIl?ese
conditions can be satisfied is shown in Fig. B.13.
161
mA
B
c I'			c
Figure B.l2: Duplicate Speculation: II
A			A
x			Th
H `?			H
Figure B.l3: Live-through: I
Note, however, that it contains 2 edges that are members of both a fork and a join. Splitting
them would yield the graph shown in Fig. B.14. Now, if A picks B, H and some other descendant,
possibly X, the resulting speculation as impThmented would not be as shown. In particular, the
speculation would not be live through X. This is a consequence of our redundancy elimination
algorithm. It halts as soon as it reaches a branch that does not lead to another node. Therefore, it
will not eliminate 1*2, and thereby make 1* live through X.
Not that if we were using a best-4 strategy, and A picked B, H, X and C, then the elimination
would occur, yeilding the situation shown in Fig. B.15.
162
H1
B
H
Figure B.14: Live-through: II
H
Figure B.15: Live-through: III
Bibliography
[1] A. Aho, R. Sethi, and J. Uliman. Compilers: Principles, Techniques and Tools. Addison-
Wesley, Reading, MA, 1986.
[2]
A. Aiken and A. Nicolau. Optimal loop parallelization. In Prnceedfligs ofthe 1988 SIGPLAN
Conference on Programming Language Design and !mpThmentation, pages 308-317, June
1988.
[3] M. Auslander and M. Hopkins. An overview of the PL.8 compiler. In Proceedings of the
1982 SIGPLAN Symposium on Compiler Constnwtion, pages 22-31, June 1982.
[4]
t Ball and J.R. Larus. Branch prediction for free. In Proceedings of the 1993 SIGPLAN
Conference on Programming Language Design and lmpThmenrntion, pages 300-313, June
1993.
[5] M.C. Becker, M.S. Allen, C.R. Moore, J.S. Muhich, and D.P. Tuttle. Powerpc 601 micropro-
cesor. IEEE Micro, 13(5):54-68, October 1993.
[6]
[7]
[8]
D. Bernstein and M.R. Rodeh. Global instruction scheduling for superscalar machines.
In Prnceedings of the 1991 SlGPLAN Conftrence on Prngrnmming Language Design and
ImpThmentation, pages 241-253, June 1991.
D.G. Bradlee, S.J. Eggers, and R.R. Henry. Integrating register allocation and instruction
scheduling for RISCs. In Fourth Internafional Symposium on Architectural Sitipport for
Programming Languages and Operating Systems, pages 122-13 l, April 1991.
M. Butler, t Yeh, Y. Patt, M. Alsup, M. Shebanow, and H. Scales. Single instruction level
parallelism is greater than two. In Proceedings of the 18th Annual h?tenwtiotuzl Symposium
on Computer Architecture, pages 276-286, May 1991
[9] GJ. Chaitin. Register allocation and spilling via graph coloring. In Proceedings of the 1982
SIGPLAN Symposium on Compiler Constn?cnon, pages 98-105, June 1982.
[10] E.G. Coffman, editor. Computer and Job-shop Scheduling Theory. John Wiley and Sons,
New York, NY, 1976.
[11] K. Diefendorff and M. Allen. Organization of the motorola 88110 superscalar risc micropro-
cessor. IEEE Micro, 12(2):40-63, April 1992.
163
164
[12]
[13]
H. Dwyer and H.C. Tomg. Out-of-order superscalar processor with speculative execution
and fast, precise interrupts. In Prnceedings of the 25th Annual International Symposium on
Micrnarchitecfl?re, pages 272--H281, 1992.
K. Ebcioglu. A compilation techmque for software pipelining of loops with conditional
branches. In Proceedings of the 20th Annual Micrnprngrnmming Workshop, pages 69-197,
October 1987.
[14] K. Ebcioglu. Some design ideas for a VLIW architecture for sequential natured software. In
Parallel Processing, pages 3--H21, April 1988.
[15] J.R. Ellis. Bulldog: A Compilerfor VLIW Architectures. MIT Press, Cambridge, MA, 1986.
[16]
J.A. Fisher. Global code generation for instruction-level parallelism: Trace scheduling-2.
Technical Report HPL-93-43, Computer Research Center, Hewlen-Packard Company, June
1993.
[17] C. Foster and E. Riseman. Percolation ofcode to enhance parallel dispatching and execution.
IEEE Transactions on Computers, C-21(10): 1411--H1415, December 1972.
[18]
[19]
Phillip B. Gibbons and Steven 5. Muchnick. Efficient instruction scheduling for a pipelined
architecture. In Proceedings of the 1986 SJGPLAN Symposium on Compiler ConstructThn,
pages 11-16, 1986.
J.R. Goodman and W.-C. Hsu. Code scheduling and register allocation in large basic blocks.
In Proceedings of the 1988 biternational Conference on Supercomputing, pages 442--H452,
July 1988.
[20] G.F. Grohoski. Machine organization ofthe IBM RISC System/6000 processor. IBM Journal
ofResearch atulDeveThpment, 34(1):37-58, January 1990.
[21] T.R. Gross and J.L. Hennessy. Optimizing delayed branches. In Prnceedings of the IEEE
Micrn-15, pages 114-120, October 1982.
[22]
LJ. Hendren, G.R. Gao, E.R. Aitman, and C. Mukerji. Register allocation using cyclic
interval graphs: A new approach to an old problem. Jourmal of Programming Languages,
1(3), 1993. Draft.
[23] J.L. Hennessy and N.P. Jouppi. Computer technology and architecture: An evolving interac-
tion. Computer, 24:18-29, September 1991.
[24] Richard Huff. Lifetime-sensitive modulo scheduling. In Prnceedings of the SIGPLAN `93
Conference on Programming Language Design and Implementation, pages 78-89, June 1993.
W.M. Hwu, S.A. Mahlke, W.Y. Chen, RP. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouel-
lette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm, and D.M. Lavery. The superblock:
An effective technique for VLIW and superscalar compilation. Joun?l of Supereomputing,
7(1-2):229-248, May 1993.
[25]
163
[26]
W.M. Hwu and Y.N. Patt. Checkpoint repair for out-of-orderexecution machines. InProceed-
ings of the 14th Annual !nteniatioiial Sjvmposium 071 Computer Architecfl?re, pages 18--H26,
June 1987
[27] R. Johnson. Dependence Flow Graphs. Phi) thesis, Department of Computer Science,
Cornell University, to appear 1994.
[28]
Richard Johnson and Keshav Pingali. Dependence-based program analysis. In PrnceedThgs
of the SlGPLAN `93 Conference 071 Programming Language Design a?ul impThmenrntion,
pages 78-89, June 1993.
[29] W.M. Johnson. Superscaiar Microprocessor Design. Prentice-Hall, Englewood Cliffs, NJ,
1991.
[30]
N. Jouppi and D. Wall. Available instruction-level parallelism for superscalar and super-
pipelined machines. In Third h?ternational Symposium on Architecturni Support for Pro-
gramming Languages arut Opernfing Systems, pages 272-282, April 1989.
[31] R.M. Keller. Look-ahead processors. Compuflng Sur"eys, 7(4):177-195, December 1973.
[32] P.M. Kogge. The Architecture ofPipelined Computers. McGraw-Hill, New York, 1989.
[33] 5. Krishnamurthy. A brief survey of papers on scheduling for pipelined processors. ACM
SIGPLAN Notices, 25(7):97-1[)6, July 1990.
[34] M.S. Lam. A Systolic Array Opflmizi'tg Compiler. Kluwer Academic Publisher, Norwell,
MA, 1989.
[33] M.S. Lam and R.P. Wilson. Limits of control flow on parallelism. In Proceedings ofthe 19th
Annual h?ternaflb,uzl Symposium on Computer Architecture, pages 46-37, June 1992.
[36] Eugene Lawler, Jan Karel Lenstra, Charles Martel, and Barbara Simons. Pipeline scheduling:
A survey. Technical Report RI 3738, IBM Research Division, San Jose, CA., 1987.
[37] J.K.F. Lee and AJ. Smith. Branch prediction strategies and branch target buffer design.
Computer, 17(1):6-22, January 1984.
[38]
J.L. Linn. Srdag comapction - a generalization oftrace scheduling to increase the use ofglobal
context information. In PrnceedThgs ofthe 19th Annual Micrnprngrnmmii? Workshop, pages
11-20, October 1983.
[39] J.S. Liptay. Design of the IBM Enterprise System/9000 high-end processor. IBM Journal of
Research and DeveThpment, 36(4):71 3--H731, July 1992.
P.0. Lowney, S.M. Freudenberger, T.J. Karzes, W.D. Lichtenstein, R.R Nix, J.S O'Donnell,
and J.C. Ruttenberg. The multiflow trace scheduling compiler. Jounuzl of Supercomputh?g,
7(1-2):51-142, May 1993.
[40]
166
[41]
S.A. Mahike, D.C. Lin, W.Y. Chen, R.E. Hank, and R.A. Bringmann. Effecflve compiler
support for predicated execution using the hyperbiock. In Proceedings of the 25th Annual
Symposium on Micrnachitecnre, pages 45-54, December 1992.
[42] N. Malik, R. Eickemeyer, and 5. Vassiliadis. Interlock collapsing ALU for increased
instruction-level parallelism. In Prnceedings of the 25th Annual Internaflbnal Symposium
on Micrnarchitecn?re, pages 149-157, 1992.
[43] R.K. Montoye, E. Hokenek, and S.L. Runyon. Design ofthe IBM RISC Systeni/6000 floating-
point execution unit. lfiMjouriwl ofResearch andDeveThpment, 34(1):59-70, January 1990.
[44] S.-M. Moon. Compile-time Parallelization ofNon-numencal Code; VL!W and Superscalar.
PhD thesis, Department of Computer Science, University of Maryland, May 1993.
[45]
S.-M. Moon and K. Ebcioglu. An efficient resource-constrained global scheduling technique
for superscalar and VLIW processors. In Proceedings of the 25th Annual Thter,iatio,?l
Symposh?m on Micrnarchitecture, pages 55-71, 1992.
[46] M. Moudgill, K. Pingali, and 5. Vassiliadis. Pipelining and precise interrupts. Technical
report, Department of Computer Science, Cornell University, to appear 1993.
[47] J.C.H. Park and M. Schlansker. On predicated execution. Technical report, Hewlett-Packard
Labs, May 1991.
[48]
K. Pingali, M. Beck, R. Johnson, M. Moudgill, and P. Stodghill. Dependence Flow Graphs:
An algebraic approach to program dependencies. In Proceedings ofthe 18th ACM Symposium
on Principles ofProgrnmmThg Languages, pages 67-78, January 1991.
[49] V. Popescu, M. Schultz, J. Spracklen, G. Gibson, B. Lightner, and D. Isaman. The Metaflow
architecture. IEEE Micro, 1 1(3):10--H13,63--H73, June 1991.
[50] B.R. Rau and J.A. Fisher. Instruction-level parallel processing: History, overview and
perspective. Journal ofSupercompuflng, 7(1-2):9--H50, May 1993.
[51]
B.R. Rau and C.D. Glaeser. Some scheduling techniques and an easily schedulable horiwntal
architecture for high performance scientific computing. In Proceedings of the 14th Annual
Micrnprngrnmming Workshop, pages 184-197, October 1981.
[52] lE. Riseman and C. Foster. The inhibition of potential parallelism by conditional jumps. IEEE
Trnnsaaions on Computers, C-21(10):1405-141 1, December 1972.
S.A.Mahlke, WY. Chen, W.W. Hwu, B.R. Rau, and M.S. Schlansker. Sentinel scheduling
for VUW and superscalar processors. In Fifth internaflonal Symposium 011 Architecfl?ral
Support for Programming Languages and Operating Systems, pages 238-247, November
1992.
[53]
167
[54] J.E. Smith and A.R. Pleszkun. Implementation of precise interrupts in pipelined processors.
In Proceedings ofthe 12th Annual !nten?ational Symposhm on ComputerA7chitecfl?re, pages
36-44, June 1985.
[55] M.D. Smith. SupportforSpeculative Execunbn in High-Pefformance Processors. PhD thesis,
Department of Electrical Engineering, Stanford University, November 1992.
[36]
[57]
[58]
M.D. Smith, M.S. Lam, and M.A. Horowitz. Boosting beyond static scheduling in a super-
scalar processor. In Prnceedings of the 17th Annual Inteniational Symposium on Computer
Architecture, pages 344-354, May 1990.
M.D. Smith, M.S. Lam, and M.A. Horowitz. i:?fncient superscalar performance through
boosting. In Fifth International Symposh?m on Architectural Support for Programming
Languages and Operating Systems, pages 248-259, November 1992.
0. Sohi and 5. Vajapeyam. Instruction issue logic for high-performance, interruptable
pipelined processors. In Proceedings of the 14th Annual internanbizal Symposium on Com-
puwr Architecture, pages 27-34, June 1987.
[59] B. Su, 5. Ding, and L. Jin. An extension to urcr for software pipelining. In Proceedings of
the 19th Arniual Microprogramming Workshop, pages 94-103, October 1986.
[60] J.B. Thornton. Design of a Computer--HThe Control Data 6600. Scott, Foresman and Co.,
Glenview, IL, 1970.
[61] P. Tirumalai, M. Lee, and M. Schlansker. Parallelization of loops with exits on pipelined
architectures. In IEEE Proceedings ofSupercomputing `90, pages 200-212, November 1990.
[62] 0.5. Tjaden and M.J. Flynn. Detection and parallel execution of independent instructions.
IEEE Iransactions 071 Computers, C-19(10):889-895, October 1970.
[63] R.M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal
ofResearch and Development, 11(1 ):25-33, January 1967.
[64] 5. Vassiliadis, B. Blaner, and R. Eickemeyer. SCISM: A scalabale compound instruction set
machine. IBM Journal ofResearch and Development, to appear 1993.
[65] 5. Vassiliadis and J. Phillips. Proofofcorrectness ofhigh performance 3-1 interlock collapsing
ALUs. IBM Journal ofResearch andDeveThpment, 37(l):l2-21, January 1993.
[66] 5. Vassiliadis, J. Phillips, and B. Blaner. Interlock collapsing ALUs. IEEE Transactions on
Computers, 42(7):825-839, July 1993.
[67] D.W? Wall. Limits of instruction-level parallelism. In Fourth Thtenuinonal Symposium 07Z
Architecturnl Supportfor Prngramming Langz?ges and Opernting Systems, pages 176-188,
April 1991.
168
NJ. Warter, J.W. Bockhaus, G.E. Haab, and K. Subramanian. Enhanced modulo scheduling
for loops with conditional branches. In Proceediiigs of the 25th Annual Symposium on
Micrnachitecfl?re, pages 170--H179, December 1992.
[68]
