

# **Pipelining**

# Hakim Weatherspoon CS 3410

Computer Science Cornell University



[Weatherspoon, Bala, Bracy, McKee, and

# Review: Single Cycle Processor



# Review: Single Cycle Processor

- Advantages
  - Single cycle per instruction make logic and clock simple
- Disadvantages
  - Since instructions take different time to finish, memory and functional unit are not efficiently utilized
  - Cycle time is the longest delay
    - Load instruction
  - Best possible CPI is 1 (actually < 1 w parallelism)</li>
    - However, lower MIPS and longer clock period (lower clock frequency); hence, lower performance

### Review: Multi Cycle Processor

- Advantages
  - Better MIPS and smaller clock period (higher clock frequency)
  - Hence, better performance than Single Cycle processor
- Disadvantages
  - Higher CPI than single cycle processor
- Pipelining: Want better Performance
  - want small CPI (close to 1) with high MIPS and short clock period (high clock frequency)

# Improving Performance

- Parallelism
- Pipelining
- Both!

# The Kids Alice

Bob



They don't always get along...

# The Bicycle



### The Materials



#### The Instructions

N pieces, each built following same sequence:



### Design 1: Sequential Schedule



Alice owns the room

Bob can enter when Alice is finished
Repeat for remaining tasks
No possibility for conflicts

## Sequential Performance



Latency: 4 hours/task

Throughput: 1 task/4 hrs

Concurrency: 1

CPI = 4

# Design 2: Pipelined Design Partition room into *stages* of a *pipeline*



One person owns a stage at a time 4 stages

4 people working simultaneously Everyone moves right in lockstep

Partition room into stages of a pipeline



One person owns a stage at a time

- 4 stages
- 4 people working simultaneously
- Everyone moves right in lockstep

It still takes all four stages for one job to complete

Partition room into stages of a pipeline



One person owns a stage at a time

- 4 stages
- 4 people working simultaneously
- Everyone moves right in lockstep

It still takes all four stages for one job to complete

Partition room into stages of a pipeline



One person owns a stage at a time

- 4 stages
- 4 people working simultaneously
- Everyone moves right in lockstep
- It still takes all four stages for one job to complete

Partition room into stages of a pipeline



One person owns a stage at a time

- 4 stages
- 4 people working simultaneously
- Everyone moves right in lockstep

It still takes all four stages for one job to complete

### Pipelined Performance



17

### Pipelined Performance Time



What if drilling takes twice as long, but gluing and paint take ½ as long?

Latency:

Throughput:

CPI =



What if drilling takes twice as long, but gluing and paint take ½ as I

Latency: 4 cycles/task

Throughput: 1 task/2 cycles CPI = 2

#### Lessons

- Principle:
- Throughput increased by parallel execution
- Balanced pipeline very important
  - Else slowest stage dominates performance
- Pipelining:
  - Identify pipeline stages
  - Isolate stages from each other
  - Resolve pipeline hazards (next lecture)

# Single Cycle vs Pipelined Processor

### Single Cycle → Pipelining

#### Single-cycle

insn0.fetch, dec, exec insn1.fetch, dec, exec

#### **Pipelined**

insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec

### Agenda

- 5-stage Pipeline
- Implementation
- Working Example







#### Hazards

- Structural
- Data Hazards
- Control Hazards

# Review: Single Cycle Processor



# Pipelined Processor





# Time Graphs



Latency: 5 cycles

Throughput: 1 insn/cycle

Concurrency: 5

CPI = 1

## Principles of Pipelined Implementation

- Break datapath into multiple cycles (here 5)
  - Parallel execution increases throughput
  - Balanced pipeline very important
    - Slowest stage determines clock rate
    - Imbalance kills performance
- Add pipeline registers (flip-flops) for isolation
  - Each stage begins by reading values from latch
  - Each stage ends by writing values to latch
- Resolve hazards



# Pipeline Stages

| Stage     | Perform Functionality                                                                                      | Latch values of interest                                                                                       |
|-----------|------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| Fetch     | Use PC to index Program Memory, increment PC                                                               | Instruction bits (to be decoded) PC + 4 (to compute branch targets)                                            |
| Decode    | Decode instruction, generate control signals, read register file                                           | Control information, Rd index, immediates, offsets, register values (Ra, Rb), PC+4 (to compute branch targets) |
| Execute   | Perform ALU operation Compute targets (PC+4+offset, etc.) in case this is a branch, decide if branch taken | Control information, Rd index, <i>etc.</i> Result of ALU operation, value in case this is a store instruction  |
| Memory    | Perform load/store if needed, address is ALU result                                                        | Control information, Rd index, <i>etc.</i> Result of load, pass result from execute                            |
| Writeback | Select value, write to register file                                                                       |                                                                                                                |

# Instruction Fetch (IF)

Stage 1: Instruction Fetch

#### Fetch a new instruction every cycle

- Current PC is index to instruction memory
- Increment the PC at end of cycle (assume no branches for now)

#### Write values of interest to pipeline register (IF/ID)

- Instruction bits (for later decoding)
- PC+4 (for later computing branch targets)

# Instruction Fetch (IF)



# Instruction Fetch (IF)



### Decode

- Stage 2: Instruction Decode
- On every cycle:
  - Read IF/ID pipeline register to get instruction bits
  - Decode instruction, generate control signals
  - Read from register file
- Write values of interest to pipeline register (ID/EX)
  - Control information, Rd index, immediates, offsets, ...
  - Contents of Ra, Rb
  - PC+4 (for computing branch targets later)



## Execute (EX)

- Stage 3: Execute
- On every cycle:
  - Read ID/EX pipeline register to get values and control bits
  - Perform ALU operation
  - Compute targets (PC+4+offset, etc.) in case this is a branch
  - Decide if jump/branch should be taken
- Write values of interest to pipeline register (EX/MEM)
  - Control information, Rd index, ...
  - Result of ALU operation
  - Value in case this is a memory store instruction



## MEM

- Stage 4: Memory
- On every cycle:
  - Read EX/MEM pipeline register to get values and control bits
  - Perform memory load/store if needed
    - address is ALU result
- Write values of interest to pipeline register (MEM/WB)
  - Control information, Rd index, ...
  - Result of memory operation
  - Pass result of ALU operation



## WB

- Stage 5: Write-back
- On every cycle:
  - Read MEM/WB pipeline register to get values and control bits
  - Select value and write to register file





Consider a non-pipelined processor with clock period C (*e.g.*, 50 ns). If you divide the processor into N stages (*e.g.*, 5), your new clock period will be:

- A. C
- B. N
- C. less than C/N
- D. C/N
- E. greater than C/N

Consider a non-pipelined processor with clock period C (*e.g.*, 50 ns). If you divide the processor into N stages (*e.g.*, 5), your new clock period will be:

A. C

B. N

C. less than C/N

D. C/N

E. greater than C/N

# Takeaway

- Pipelining is a powerful technique to mask latencies and increase throughput
  - Logically, instructions execute one at a time
  - Physically, instructions execute in parallel
    - Instruction level parallelism
- Abstraction promotes decoupling
  - Interface (ISA) vs. implementation (Pipeline)

# RISC-V is designed for pipelining

- Instructions same length
  - 32 bits, easy to fetch and then decode
- 4 types of instruction formats
  - Easy to route bits between stages
  - Can read a register source before even knowing what the instruction is
- Memory access through lw and sw only
  - Access memory after ALU

# Agenda

## 5-stage Pipeline

- Implementation
- Working Example







#### Hazards

- Structural
- Data Hazards
- Control Hazards

# Example: Sample Code (Simple)

```
add x3 \leftarrow x1, x2<br/>
nand x6 \leftarrow x4, x5<br/>
lw x4 \leftarrow x2, 20<br/>
add x5 \leftarrow x2, x5<br/>
sw x7 \rightarrow x3, 12
```

Assume 8-register machine



At time 1, Example: Start State @ Cycle 0 **Fetch** add x3 x1 x2 M 0 0 0 x1 36 regA Add regE 0 Nand nop x3 12 Lw 0 Add х5 0 Data SW mem data dest extend 0 Bits 7-11 Initial 0 0 State Bits 0-6 nop nop nop IF/ID ID/EX EX/MEM MEM/WB







Cycle 4: Fetch add, Decode lw, ...





Cycle 6: Decode sw, ...









Pipelining is great because:

- A. You can fetch and decode the same instruction at the same time.
- B. You can fetch two instructions at the same time.
- C. You can fetch one instruction while decoding another.
- D. Instructions only need to visit the pipeline stages that they require.
- E. C and D

Pipelining is great because:

- A. You can fetch and decode the same instruction at the same time.
- B. You can fetch two instructions at the same time.
- C. You can fetch one instruction while decoding another.
- D. Instructions only need to visit the pipeline stages that they require.
- E. C and D



## Agenda

## 5-stage Pipeline

- Implementation
- Working Example







#### Hazards

- Structural
- Data Hazards
- Control Hazards

### Hazards

Correctness problems associated w/ processor design

#### 1. Structural hazards

Same resource needed for different purposes at the same time (Possible: ALU, Register File, Memory)

#### 2. Data hazards

Instruction output needed before it's available

#### 3. Control hazards

Next instruction PC unknown at time of Fetch

## Dependences and Hazards

#### **Dependence**: relationship between two insns

- Data: two insns use same storage location
- Control: 1 insn affects whether another executes at all
- Not a bad thing, programs would be boring otherwise
- Enforced by making older insn go before younger one
  - Happens naturally in single-/multi-cycle designs
  - But not in a pipeline

# Hazard: dependence & possibility of wrong insnorder

- Effects of wrong insn order cannot be externally visible
- Hazards are a bad thing: most solutions either complicate the hardware or reduce performance

#### Data Hazards

- register file (RF) reads occur in stage 2 (ID)
- RF writes occur in stage 5 (WB)
- RF written in ½ half, read in second ½ half of cycle

```
x10: add x3 \leftarrow x1, x2
```

x14: sub  $x5 \leftarrow x3$ , x4

- 1. Is there a dependence?
- 2. Is there a hazard?

A) Yes

B) No

C) Cannot tell with the information given.

#### Data Hazards

- register file (RF) reads occur in stage 2 (ID)
- RF writes occur in stage 5 (WB)
- RF written in ½ half, read in second ½ half of cycle

x10: add  $(x3) \leftarrow x1$ , x2

x14: sub  $x5 \leftarrow (x3)$ , x4

- 1. Is there a dependence?
- 2. Is there a hazard?

A) Yes for both

- B) No
- C) Cannot tell with the information given.

# iClicker Follow-up

Which of the following statements is true?

- A. Whether there is a data dependence between two instructions depends on the machine the program is running on.
- B. Whether there is a data hazard between two instructions depends on the machine the program is running on.
- C. Both A & B
- D. Neither A nor B

# iClicker Follow-up

Which of the following statements is true?

- A. Whether there is a data dependence between two instructions depends on the machine the program is running on.
- B. Whether there is a data hazard between two instructions depends on the machine the program is running on.
- C. Both A & B
- D. Neither A nor B

## Where are the Data Hazards?





## iClicker

How many data hazards due to x3 only

add x3, x1, x2

sub x5, x3, x4 (A) 1

lw v6 v2 4

lw x6, x3, 4

or x5, x3, x5

C) 3

B) 2

D) 4

E) 5

sw x6, x3, 12







### **Data Hazards**

- register file reads occur in stage 2 (ID)
- register file writes occur in stage 5 (WB)
- next instructions may read values about to be written

i.e. add (x3), x1, x2 sub x5, (x3), x4

How to detect?



#### Data Hazards

#### Data Hazards

- register file reads occur in stage 2 (ID)
- register file writes occur in stage 5 (WB)
- next instructions may read values about to be written

```
How to detect? Logic in ID stage:

stall = (IF/ID.Rs1 != 0 &&

(IF/ID.Rs1 == ID/EX.Rd ||

IF/ID.Rs1 == EX/M.Rd ||

IF/ID.Rs1 == M/WB.Rd))

|| (same for Rs2)
```



## Takeaway

Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.

## **Next Goal**

What to do if data hazard detected?

## iClicker

What to do if data hazard detected?

- A) Wait/Stall
- B) Reorder in Software (SW)
- C) Forward/Bypass
- D) All the above
- E) None. We will use some other method

## Possible Responses to Data Hazards

### 1. Do Nothing

- Change the ISA to match implementation
- "Hey compiler: don't create code w/data hazards!"

(We can do better than this)

#### 2. Stall

 Pause current and subsequent instructions till safe

### 3. Forward/bypass

 Forward data value to where it is needed (Only works if value actually exists already)

### How to stall an instruction in ID stage

- prevent IF/ID pipeline register update
  - stalls the ID stage instruction
- convert ID stage instr into nop for later stages
  - innocuous "bubble" passes through pipeline
- prevent PC update
  - stalls the next (IF stage) instruction









**♥E/ID.Rs1==ID/Ex.Ba ←STALL CONDITION MET** 

IF/ID.Rs1==Ex/M.Rd

IF/ID.Rs1==M/W.Rd))



NOP =  $If(IF/ID.Rs1 \neq 0 \&\&$ 

(IF/ID.Rs1==ID/Ex.Rd)

✓F/ID.Rs1==Ex/M.Ro STALL CONDITION MET 88 IF/ID.Rs1==M/W.Rd))





### How to stall an instruction in ID stage

- prevent IF/ID pipeline register update
  - stalls the ID stage instruction
- convert ID stage instr into nop for later stages
  - innocuous "bubble" passes through pipeline
- prevent PC update
  - stalls the next (IF stage) instruction

## Takeaway

Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.

Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards.

Stalling introduces NOPs ("bubbles") into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. \*Bubbles in pipeline significantly decrease performance.

## Possible Responses to Data Hazards

### 1. Do Nothing

- Change the ISA to match implementation
- "Compiler: don't create code with data hazards!"

(Nice try, we can do better than this)

#### 2. Stall

 Pause current and subsequent instructions till safe

### 3. Forward/bypass

 Forward data value to where it is needed (Only works if value actually exists already)

## Forwarding

- Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register).
- Three types of forwarding/bypass
  - Forwarding from Ex/Mem registers to Ex stage (M→Ex)
  - Forwarding from Mem/WB register to Ex stage (W→Ex)
  - RegisterFile Bypass



# Add the Forwarding Datapath



Forwarding Datapath



Three types of forwarding/bypass

- Forwarding from Ex/Mem registers to Ex stage (M→Ex)
- Forwarding from Mem/WB register to Ex stage (W → Ex)
- RegisterFile Bypass

## Forwarding Datapath 1: Ex/MEM → EX



Problem: EX needs ALU result that is in MEM stage

Solution: add a bypass from EX/MEM.D to start of EX

## Forwarding Datapath 1: Ex/MEM → EX



## Detection Logic in Ex Stage:

forward = (Ex/M.WE && EX/M.Rd != 0 && ID/Ex.Rs1 == Ex/M.Rd) || (same for Rs2)

## Forwarding Datapath 2: Mem/WB > EX



Problem Exxerce daily beling whitten by WB

Solution Add by by peats on I Was vialual weather to start of EX

## Forwarding Datapath 2: Mem/WB→ EX





Problem Exxessorious being whiten by WB

Solutions Add by besseronically like the text and te

Solution Add by by prafer in 16/14 five that was treet to 15 fart of EX

## Forwarding Datapath 2: Mem/WB→ EX



#### **Detection Logic:**

Register File Bypass



Problem: Reading a value that is currently being written Solution: just negate register file clock

- writes happen at end of first half of each clock cycle
- reads happen during second half of each clock cycle

Register File Bypass



## Agenda

## 5-stage Pipeline

- Implementation
- Working Example







#### Hazards

- Structural
- Data Hazards
- Control Hazards

# Forwarding Example 2 time Clock cycle



# Forwarding Example 2

| time           | CI<br>1  | ock cyc<br>2 | cle<br>3 | 4  | 5  | 6  | 7  | 8 |     |
|----------------|----------|--------------|----------|----|----|----|----|---|-----|
| add x3, x1, x2 | IF       | ID           | Ex       | M  | W  |    |    |   |     |
| sub x5, x3, x5 |          | IF           | ID       | Ex | M  | W  |    |   |     |
| [w x6, x3, 4]  |          |              | IF       | ID | Ex | M  | W  |   |     |
| or x5, x3, x6  |          |              |          | IF | ID | Ex | М  | W |     |
| sw x6, x3, 12  |          |              |          |    | IF | ID | Ex | M | W   |
|                | <b>,</b> | 1            |          |    |    |    | ]  |   | 106 |

Forwarding Example 2

| time >         | CT<br>1 | ock cyc | cle <u>/</u> | backw<br>4 | ards a<br>5 | arrows<br>6 | require<br>7 | e time<br>8 | trave |
|----------------|---------|---------|--------------|------------|-------------|-------------|--------------|-------------|-------|
| add x3, x1, x2 | IF      | ID      | Ex           | M          | W           |             |              |             |       |
| sub x5, x3, x5 |         | IF      | ID           | Ex         | М           | W           |              |             |       |
| [lw x6, x3, 4] |         |         | IF           | ID         | Ex          | M           | W            |             |       |
| or x5, x3, x6  |         |         |              | IF         | ID          | Ex          | M            | W           |       |
| sw x6, x3, 12  |         |         |              |            | IF          | ID          | Ex           | M           | W     |
|                | ,       |         | l            |            |             |             | <u> </u>     |             | 107   |

## Load-Use Hazard Explained



### Data dependency after a load instruction:

- Value not available until after the M stage
- → Next instruction cannot proceed if dependent

#### THE KILLER HAZARD

# Load-Use Stall



Load-Use Stall (1)



Load-Use Stall (2)



Load-Use Stall (3)



## Load-Use Detection



Stall = If(ID/Ex.MemRead && IF/ID.Rs1 == ID/Ex.Rd

Incorrectly Resolving Load-Use Hazards



Most frequent 3410 **non-solution** to load-use hazards Why is this "solution" so so so so so awful?

# iClicker Question

Forwarding values directly from Memory to the Execute stage without storing them in a register first:

- A. Does not remove the need to stall.
- B. Adds one too many possible inputs to the ALU.
- C. Will cause the pipeline register to have the wrong value.
- D. Halves the frequency of the processor.
- E. Both A & D

# iClicker Question

Forwarding values directly from Memory to the Execute stage without storing them in a register first:

- A. Does not remove the need to stall.
- B. Adds one too many possible inputs to the ALU.
- C. Will cause the pipeline register to have the wrong value.
- D. Halves the frequency of the processor.
- E. Both A & D

# Resolving Load-Use Hazards

#### RISC-V Solution: Load-Use Stall

- Stall must be inserted so that load instruction can go through and update the register file.
- Forwarding from RAM is not an option.
- In some cases, real world compilers can optimize to avoid these situations.

# Takeaway

Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.

Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs ("bubbles") into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Bubbles (nops) in pipeline significantly decrease performance.

Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling.

Find all hazards, and say how they are resolved:

```
add x3, x1, x2
nand x5, x3, x4
add x2, x6, x3
lw x6, x3, 24
sw x6, x2, 12
```

Find all hazards, and say how they are resolved:





Find all hazards, and say how they are resolved:



#### Stall

+ Forwarding from M/W→Ex (W→Ex)



Find all hazards, and say how they are resolved:



Hours and hours of debugging!

# Data Hazard Recap

## Delay Slot(s)

Modify ISA to match implementation

#### Stall

Pause current and all subsequent instructions

## Forward/Bypass

- Try to steal correct value from elsewhere in pipeline
- Otherwise, fall back to stalling or require a delay slot

#### Tradeoffs?

# Agenda

## 5-stage Pipeline

- Implementation
- Working Example







#### Hazards

- Structural
- Data Hazards
- Control Hazards

## A bit of Context

```
i = 0;
do {
   n += 2;
                                                  i \rightarrow x1
   i++;
                                                  Assume:
} while(i < max)</pre>
                                                   n \rightarrow x2
i = 7;
                                                   max \rightarrow x3
n--;
x10
               addi x1, x0, 0
                                      \# i = 0
                                      # n += 2
x14
       Loop: addi x2, x2, 2
                                      # j++
x18
               addi x1, x1, 1
x<sub>1</sub>C
               blt x1, x3, Loop
                                      # i<max?
                                      \# i = 7
x20
               addi x1, x0, 7
x24
               subi x2, x2, 1
                                      # n--
```

## **Control Hazards**

#### **Control Hazards**

- instructions are fetched in stage 1 (IF)
- branch and jump decisions occur in stage 3 (EX)
  - → next PC not known until 2 cycles after branch/jump

```
x1C blt x1, x3, Loop
x20 addi x1, x0, 7
x24 subi x2, x2, 1

Branch <u>not</u> taken?
No Problem!
Branch taken?
Just fetched 2 insns
→ Zap & Flush
```

Zap & Flash

- prevent PC update
- clear IF/ID latch



| 1C blt x1,x3,L    | טו |
|-------------------|----|
| 20 addi x1,x0,7   | E  |
| 24 subi x2,x2,1   |    |
| 14 L:addi x2,x2,2 |    |
|                   |    |

|   | IF | ID | Ex | M   | W   |     |     |    |    |
|---|----|----|----|-----|-----|-----|-----|----|----|
|   |    | IF |    | NOP |     |     |     |    |    |
|   |    |    | IF | NOP | NOP | NOP | NOP |    |    |
|   |    |    |    | IF  | ID  | Ex  | M   | W  |    |
| ٧ |    |    |    |     |     |     |     | 12 | 27 |



# Reducing the cost of control hazard

#### 1. Resolve Branch at Decode

- Some groups do this for Project 3, your choice
- Move branch calc from EX to ID
- Alternative: just zap 2<sup>nd</sup> instruction when branch taken

#### 2. Branch Prediction

 Not in 3410, but every processor worth anything does this (no offense!) Problem: Zapping 2 insns/branch



# Soln #1: Resolve Branches @ Decode



## **Branch Prediction**

## Most processor support Speculative Execution

- Guess direction of the branch
  - Allow instructions to move through pipeline
  - Zap them later if guess turns out to be wrong
- A must for long pipelines

# Speculative Execution: Loops

## Pipeline so far

• "Guess" (predict) that the branch will not be taken

#### We can do better!

- Make prediction based on last branch
- Predict "take branch" if last branch "taken"
- Or Predict "do not take branch" if last branch "not taken"
- Need one bit to keep track of last branch

# Speculative Execution: Loops

What is accuracy of branch predictor?
Wrong twice per loop!
Once on loop enter and exit
We can do better with 2 bits

```
While (x3 \neq 0) {.... x3--;}
Top: BEQ x3, x0, End \sim \sim \sim
J Top
```

End:

While 
$$(r3 \neq 0) \{.... r3--;\}$$
  
Top2: BEQ x3, x0, End2

J Top

End2:

## Speculative Execution: Branch Execution



# Summary

#### Control hazards

- Is branch taken or not?
- Performance penalty: stall and flush

#### Reduce cost of control hazards

- Move branch decision from Ex to ID
  - 2 nops to 1 nop
- Branch prediction
  - Correct. Great!
  - Wrong. Flush pipeline. Performance penalty

# Hazards Summary

Data hazards

#### Control hazards

#### Structural hazards

- resource contention
- so far: impossible because of ISA and pipeline design

# Hazards Summary

#### Data hazards

- register file reads occur in stage 2 (IF)
- register file writes occur in stage 5 (WB)
- next instructions may read values soon to be written

#### Control hazards

- branch instruction may change the PC in stage 3 (EX)
- next instructions have already started executing

#### Structural hazards

- resource contention
- so far: impossible because of ISA and pipeline design

# Data Hazard Takeaways

Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. Pipelined processors need to detect data hazards.

Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs ("bubbles") into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Nops significantly decrease performance.

Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling.

# **Control Hazard Takeaways**

Control hazards occur because the PC following a control instruction is not known until control instruction is executed. If branch is taken  $\rightarrow$  need to zap instructions. 1 cycle performance penalty.

We can reduce cost of a control hazard by moving branch decision and calculation from Ex stage to ID stage.



# Have a great February Break!!

