Pipelining and Hazards

Prof. Hakim Weatherspoon
CS 3410, Spring 2015
Computer Science
Cornell University

See P&H Chapter: 4.6-4.8
Announcements

Prelim next week

Tuesday at 7:30.

Go to location based on netid

[a-g]* → MRS146: Morrison Hall 146
[h-l]* → RRB125: Riley-Robb Hall 125
[m-n]→ RRB105: Riley-Robb Hall 105
[o-s]* → MVRG71: M Van Rensselaer Hall G71
[t-z]* → MVRG73: M Van Rensselaer Hall G73

Prelim reviews

TODAY, Tue, Feb 24 @ 7:30pm in Olin 255
Sat, Feb 28 @ 7:30pm in Upson B17

Prelim conflicts

Contact Deniz Altinbuken <deniz@cs.cornell.edu>
Announcements

Prelim1:

- Time: We will start at 7:30pm sharp, so come early
- Location: on previous slide
- Closed Book
  - Cannot use electronic device or outside material
- Practice prelims are online in CMS

- Material covered everything up to end of this week
  - Everything up to and including data hazards
  - Appendix B (logic, gates, FSMs, memory, ALUs)
  - Chapter 4 (pipelined [and non] MIPS processor with hazards)
  - Chapters 2 (Numbers / Arithmetic, simple MIPS instructions)
  - Chapter 1 (Performance)
  - HW1, Lab0, Lab1, Lab2, C-Lab0, C-Lab1
Goals for Today

RISC and Pipelined Processor: Putting it all together

Data Hazards

• Data dependencies
• Problem, detection, and solutions
  – (delaying, stalling, forwarding, bypass, etc)
• Hazard detection unit
• Forwarding unit

Next time

• Control Hazards
  What is the next instruction to execute if a branch is taken? Not taken?
MIPS Design Principles

Simplicity favors regularity
  • 32 bit instructions

Smaller is faster
  • Small register file

Make the common case fast
  • Include support for constants

Good design demands good compromises
  • Support for different type of interpretations/classes
## Recall: MIPS instruction formats

All MIPS instructions are 32 bits long, has 3 formats

### R-type

<table>
<thead>
<tr>
<th>Field</th>
<th>Width</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>op</code></td>
<td>6 bits</td>
</tr>
<tr>
<td><code>rs</code></td>
<td>5 bits</td>
</tr>
<tr>
<td><code>rt</code></td>
<td>5 bits</td>
</tr>
<tr>
<td><code>rd</code></td>
<td>5 bits</td>
</tr>
<tr>
<td><code>shamt</code></td>
<td>5 bits</td>
</tr>
<tr>
<td><code>func</code></td>
<td>6 bits</td>
</tr>
</tbody>
</table>

### I-type

<table>
<thead>
<tr>
<th>Field</th>
<th>Width</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>op</code></td>
<td>6 bits</td>
</tr>
<tr>
<td><code>rs</code></td>
<td>5 bits</td>
</tr>
<tr>
<td><code>rt</code></td>
<td>5 bits</td>
</tr>
<tr>
<td><code>immediate</code></td>
<td>16 bits</td>
</tr>
</tbody>
</table>

### J-type

<table>
<thead>
<tr>
<th>Field</th>
<th>Width</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>op</code></td>
<td>6 bits</td>
</tr>
<tr>
<td><code>immediate</code></td>
<td>26 bits</td>
</tr>
</tbody>
</table>
Recall: MIPS Instruction Types

Arithmetic/Logical

• R-type: result and two source registers, shift amount
• I-type: 16-bit immediate with sign/zero extension

Memory Access

• load/store between registers and memory
• word, half-word and byte operations

Control flow

• conditional branches: pc-relative addresses
• jumps: fixed offsets, register absolute
Recall: MIPS Instruction Types

**Arithmetic/Logical**
- ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU
- ADDI, ADDIU, ANDI, ORI, XORI, LUI, SLL, SRL, SLLV, SRLV, SRAV, SLTI, SLTIU
- MULT, DIV, MFLO, MTLO, MFHI, MTHI

**Memory Access**
- LW, LH, LB, LHU, LBU, LWL, LWR
- SW, SH, SB, SWL, SWR

**Control flow**
- BEQ, BNE, BLEZ, BLTZ, BGEZ, BGTZ
- J, JR, JAL, JALR, BEQL, BNEL, BLEZL, BGTZL

**Special**
- LL, SC, SYSCALL, BREAK, SYNC, COPROC
Pipelining

Principle:
Throughput increased by parallel execution
Balanced pipeline very important
Else slowest stage dominates performance

Pipelining:
• Identify pipeline stages
• Isolate stages from each other
• Resolve pipeline hazards (this and next lecture)
Basic Pipeline

Five stage “RISC” load-store architecture

1. Instruction fetch (IF)
   - get instruction from memory, increment PC
2. Instruction Decode (ID)
   - translate opcode into control signals and read registers
3. Execute (EX)
   - perform ALU operation, compute jump/branch targets
4. Memory (MEM)
   - access memory if needed
5. Writeback (WB)
   - update register file
Pipelined Implementation

- Each instruction goes through the 5 stages
  - Each stage takes one clock cycle
    - So slowest stage determines clock cycle time
The pipeline achieves

A) Latency: 1, throughput: 1 instr/cycle
B) Latency: 5, throughput: 1 instr/cycle
C) Latency: 1, throughput: 1/5 instr/cycle
D) Latency: 5, throughput: 5 instr/cycle
E) None of the above
Latency: 5 cycles
Throughput: 1 instr/cycle
Concurrency: 5

CPI = 1
Pipelined Implementation

- Each instruction goes through the 5 stages
  - Each stage takes one clock cycle
    - So slowest stage determines clock cycle time

- Stages must share information. How?
  - Add pipeline registers (flip-flops) to pass results between different stages
Pipelined Processor

Fetch → Decode → Execute → Memory → WB

- Memory
- Register File
- ALU
- Control
- New PC
- Extend
- Compute Jump/Branch Targets
- Addr
- Memory
Each instruction goes through the 5 stages
- Each stage takes one clock cycle
  - So slowest stage determines clock cycle time

Stages must share information. How?
- Add pipeline registers (flip-flops) to pass results between different stages

And is this it?
Not quite....
Hazards

3 kinds

• Structural hazards
  – Multiple instructions want to use same unit

• Data hazards
  – Results of instruction needed before ready

• Control hazards
  – Don’t know which side of branch to take

Will get back to this

First, how to pipeline when no hazards
Example: Sample Code (Simple)

```
add     r3, r1, r2;
nand    r6, r4, r5;
lw      r4, 20(r2);
add     r5, r2, r5;
sw      r7, 12(r3);
```
Example: Sample Code (Simple)

Assume eight-register machine

Run the following code on a pipelined datapath

```
add    r3    r1    r2 ; reg 3 = reg 1 + reg 2
nand   r6    r4    r5 ; reg 6 = ~(reg 4 & reg 5)
lw     r4    20    (r2) ; reg 4 = Mem[reg2+20]
add    r5    r2    r5 ; reg 5 = reg 2 + reg 5
sw     r7    12    (r3) ; Mem[reg3+12] = reg 7
```
At time 1, Fetch

add r3 r1 r2
Fetch:
nand 6 4 5

Bits 11-15
Bits 16-20
Bits 26-31

Time: 2/3

IF/ID  ID/EX  EX/MEM  MEM/WB

nop  add  nand  add

nand 6 4 5  add 3 1 2
Fetch:
lw 4 20(2)

nand 6 4 5
add 3 1 2

lW 4 20(2)

nand (18 \cdot 7)
18 = 01 0010
7 = 00 0111
------------------
-3 = 11 1101

Time: 3/4
1216
<table>
<thead>
<tr>
<th>Instruction</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>nop</td>
<td></td>
</tr>
<tr>
<td>sw</td>
<td>7 12(3)</td>
</tr>
<tr>
<td>add</td>
<td>5 2 5</td>
</tr>
<tr>
<td>lw</td>
<td>4 20(2)</td>
</tr>
</tbody>
</table>

---

**Register File**

- R0: 0
- R1: 36
- R2: 9
- R3: 45
- R4: 99
- R5: 7
- R6: -3
- R7: 22

**Bits**

- Bits 11-15: 99
- Bits 16-20: 45
- Bits 26-31: 0

**alu:**

- ALU: 45
- MUX: 0

**Data Memory:**

- Dest: 99

**Time:** 7

---

**IF/ID:**

- PC: 32
- R0: 0
- R1: 36
- R2: 9
- R3: 45
- R4: 99
- R5: 7
- R6: -3
- R7: 22

**ID/EX:**

- MUX: 0
- MUX: 7

**EX/MEM:**

- MUX: 20
- MUX: 16

**MEM/WB:**

- MUX: 4
- MUX: 5

---

**No more instructions**
Takeaway

Pipelining is a powerful technique to mask latencies and increase throughput

- Logically, instructions execute one at a time
- Physically, instructions execute in parallel
  - Instruction level parallelism

Abstraction promotes decoupling

- Interface (ISA) vs. implementation (Pipeline)
Hazards

See P&H Chapter: 4.7-4.8
Hazards

3 kinds

• Structural hazards
  – Multiple instructions want to use same unit

• Data hazards
  – Results of instruction needed before

• Control hazards
  – Don’t know which side of branch to take
Next Goal

What about data dependencies (also known as a data hazard in a pipelined processor)?

i.e. add \( r3, r1, r2 \)
    sub \( r5, r3, r4 \)

Need to detect and then fix such hazards
Data Hazards

- register file reads occur in stage 2 (ID)
- register file writes occur in stage 5 (WB)
- next instructions may read values about to be written
  - i.e. instruction may need values that are being computed further down the pipeline
  - *in fact, this is quite common*
IF  ID  MEM  WB

Clock cycle
1  2  3  4  5  6  7  8  9

Data Hazards

add r3, r1, r2
sub r5, r3, r4
lw r6, 4(r3)
or r5, r3, r5
sw r6, 12(r3)

time
How many data hazards due to r3 only

A) 1
B) 2
C) 3
D) 4
E) 5
```
sub r5, r3, r4
lw  r6,  4(r3)
or  r5, r3, r5
sw  r6, 12(r3)
```

```
add r3, r1, r2
r3 = 10
r3 = 20
r6 = Mem[r3 + 4]
```
Data Hazards

- Register file reads occur in stage 2 (ID).
- Register file writes occur in stage 5 (WB).
- Next instructions may read values about to be written.

I.e., `add r3, r1, r2`

```
sub r5, r3, r4
```

How to detect?
Detecting Data Hazards

IF/ID. Ra ≠ 0 &&
(IF/ID.Ra==ID/Ex.Rd
(IF/ID.Ra==Ex/M.Rd
(IF/ID.Ra==M/W.Rd)
Data Hazards

• register file reads occur in stage 2 (ID)
• register file writes occur in stage 5 (WB)
• next instructions may read values about to be written

How to detect? Logic in ID stage:

\[ \text{stall} = (\text{IF/ID.Ra} \neq 0 \land \land \\
(\text{IF/ID.Ra} = \text{ID/EX.Rd} \lor \lor \\
\text{IF/ID.Ra} = \text{EX/M.Rd} \lor \lor \\
\text{IF/ID.Ra} = \text{M/WB.Rd})) \\
\lor \lor \text{(same for Rb)} \]
Detecting Data Hazards

add r3, r1, r2
sub r5, r3, r5
or r6, r3, r4
add r6, r3, r8
Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.
Next Goal

What to do if data hazard detected?
Resolving Data Hazards

What to do if data hazard detected?

A) Wait/Stall
B) Reorder in Software (SW)
C) Forward/Bypass
D) All the above
E) None. We will use some other method
Resolving Data Hazards

What to do if data hazard detected?

A) Wait/Stall

B) Reorder in Software (SW)  

C) Forward/Bypass

D) All the above

E) None. We will use some other method

Discuss today
Next Goal

What to do if data hazard detected?

Options

• Nothing
  – Change the ISA to match implementation

• Stall
  – Pause current and subsequent instructions till safe

• Forward/bypass
  – Forward data value to where it is needed
Stalling

How to stall an instruction in ID stage

• prevent IF/ID pipeline register update
  – stalls the ID stage instruction

• convert ID stage instr into \texttt{nop} for later stages
  – innocuous “bubble” passes through pipeline

• prevent PC update
  – stalls the next (IF stage) instruction
Detecting Data Hazards

IF/ID
add r3, r1, r2
sub r5, r3, r5
or r6, r3, r4
add r6, r3, r8

PC

+4

inst

OP

Rd A
D B
Ra Rb

detect hazard

WE=0
MemWr=0
RegWr=0

ID/EX

EX/MEM

MEM/WB

OP

Rd

OP

Rd

OP

Rd

addr

d_in d_out

mem

add r3, r1, r2
sub r5, r3, r5
or r6, r3, r4
add r6, r3, r8
add r3, r1, r2

sub r5, r3, r5

or r6, r3, r4

add r6, r3, r8
### Clock cycle

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td></td>
<td></td>
<td>ID</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Ex</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>M</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>W</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Time

- **r3 = 10**
  - `add r3, r1, r2`
- **r3 = 20**
  - `sub r5, r3, r5`
  - `or r6, r3, r4`
  - `add r6, r3, r8`

### Stalling

- **3 Stalls**
  - Cycle 5

### Diagram

- Blue arrow indicates time progression.
- Yellow ellipses highlight cycles impacted by stalls.
NOP = \text{If}(\text{IF}/\text{ID}.rA \neq 0 \land \land (\text{IF}/\text{ID}.rA == \text{ID}/\text{Ex}.Rd \\
\text{IF}/\text{ID}.rA == \text{Ex}/\text{M}.Rd \\
\text{IF}/\text{ID}.rA == \text{M}/\text{W}.Rd))$
NOP = \text{If}(\text{IF/ID}.rA \neq 0 \text{ && } \text{IF/ID}.rA==\text{ID/Ex}.Rd
\text{ && } \text{IF/ID}.rA==\text{Ex/M}.Rd
\text{ && } \text{IF/ID}.rA==\text{M/W}.Rd))
NOP = If(IF/ID.rA ≠ 0 && 
(IF/ID.rA==ID/Ex.Rd
IF/ID.rA==Ex/M.Rd
IF/ID.rA==M/W.Rd))

sub r5, r3, r5

or r6, r3, r4

/stall

NOP = If(IF/ID.rA ≠ 0 && 
(IF/ID.rA==ID/Ex.Rd
IF/ID.rA==Ex/M.Rd
IF/ID.rA==M/W.Rd))

add r3, r1, r2
Clock cycle
1 2 3 4 5 6 7 8

add r3, r1, r2
sub r5, r3, r5
or r6, r3, r4
add r6, r3, r8

r3 = 10
r3 = 20

IF ID Ex M W
3 Stalls
Stalls

Stalling

time
Stalling

How to stall an instruction in ID stage

• prevent IF/ID pipeline register update
  – stalls the ID stage instruction
• convert ID stage instr into \texttt{nop} for later stages
  – innocuous “bubble” passes through pipeline
• prevent PC update
  – stalls the next (IF stage) instruction
Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.

Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards.

Stalling introduces NOPs ("bubbles") into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file.

*Bubbles in pipeline significantly decrease performance.*
What to do if data hazard detected?

A) Wait/Stall

B) Reorder in Software (SW)

C) Forward/Bypass
Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register).

Three types of forwarding/bypass

- Forwarding from Ex/Mem registers to Ex stage (M→Ex)
- Forwarding from Mem/WB register to Ex stage (W→Ex)
- RegisterFile Bypass
Three types of forwarding/bypass

- Forwarding from Ex/Mem registers to Ex stage (M→Ex)
- Forwarding from Mem/WB register to Ex stage (W → Ex)
- RegisterFile Bypass
Three types of forwarding/bypass

- Forwarding from Ex/Mem registers to Ex stage (M→Ex)
- Forwarding from Mem/WB register to Ex stage (W→Ex)
- RegisterFile Bypass
Forwarding Datapath 1

Ex/MEM to EX Bypass

- EX needs ALU result that is still in MEM stage
- Resolve:
  Add a bypass from EX/MEM.D to start of EX

How to detect? Logic in Ex Stage:

\[
\text{forward} = (\text{Ex/M.WE} \land \text{EX/M.Rd} \neq 0 \land \text{ID/Ex.Ra} = \text{Ex/M.Rd})
\]

\[
\lor (\text{same for Rb})
\]
Three types of forwarding/bypass

- Forwarding from Ex/Mem registers to Ex stage (M→Ex)
- Forwarding from Mem/WB register to Ex stage (W→Ex)
- RegisterFile Bypass
add r3, r1, r2

sub r5, r3, r1
Forwarding Datapath 2

Mem/WB to EX Bypass

• EX needs value being written by WB
• Resolve:
  Add bypass from WB final value to start of EX

How to detect? Logic in Ex Stage:

\[
\text{forward} = (\text{M/WB.WE} \land \text{M/WB.Rd} \neq 0 \land \text{ID/Ex.Ra} = \text{M/WB.Rd} \land \neg (\text{Ex/M.WE} \land \text{Ex/M.Rd} \neq 0 \land \text{ID/Ex.Ra} = \text{Ex/M.Rd}) \\
| | \text{(same for Rb)}
\]

Check pg. 311
Three types of forwarding/bypass

- Forwarding from Ex/Mem registers to Ex stage (M → Ex)
- Forwarding from Mem/WB register to Ex stage (W → Ex)
- RegisterFile Bypass
add r3, r1, r2
sub r5, r3, r1
or r6, r3, r4
Register File Bypass

• Reading a value that is currently being written

Detect:

\[(Ra == \text{MEM/WB.Rd}) \text{ or } (Rb == \text{MEM/WB.Rd})\]
and (WB is writing a register)

Resolve:

Add a bypass around register file (WB to ID)

Better: (Hack) just negate register file clock
  – writes happen at end of first half of each clock cycle
  – reads happen during second half of each clock cycle
Three types of forwarding/bypass:

- Forwarding from Ex/Mem registers to Ex stage (M→Ex)
- Forwarding from Mem/WB register to Ex stage (W → Ex)
- RegisterFile Bypass
add r3, r1, r2
sub r5, r3, r1
or r6, r3, r4
add r6, r3, r8
r3 = 10
add r3, r1, r2

r3 = 20
sub r5, r3, r5

or r6, r3, r4

add r6, r3, r8

---

Forwarding Example

Clock cycle

1 2 3 4 5 6 7 8

IF ID Ex M W

IF ID Ex M W

IF ID Ex M W

IF ID Ex M W

IF ID Ex M W

IF ID Ex M W

IF ID Ex M W
<table>
<thead>
<tr>
<th></th>
<th>Clock cycle</th>
<th>time</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>IF</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>ID</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>Ex</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>M</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>W</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>IF</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>ID</td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>Ex</td>
<td></td>
</tr>
</tbody>
</table>

**add r3, r1, r2**

- **IF**
- **ID**
- **Ex**
- **M**
- **W**

**sub r5, r3, r4**

- **IF**
- **ID**
- **Ex**
- **M**
- **W**

**lw r6, 4(r3)**

- **IF**
- **ID**
- **Ex**
- **M**
- **W**

**or r5, r3, r5**

- **IF**
- **ID**
- **Ex**
- **M**
- **W**

**sw r6, 12(r3)**

- **IF**
- **ID**
- **Ex**
- **M**
- **W**
Three types of forwarding/bypass

- Forwarding from Ex/Mem registers to Ex stage (M→Ex)
- Forwarding from Mem/WB register to Ex stage (W→Ex)
- Register File Bypass
Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.

Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs ("bubbles") into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Bubbles (nops) in pipeline significantly decrease performance.

Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling.
Data Hazard Recap

Stall

- Pause current and all subsequent instructions

Forward/Bypass

- Try to steal correct value from elsewhere in pipeline
- Otherwise, fall back to stalling or require a delay slot

Tradeoffs?