# **Data and Control Hazards** CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: 4.6-4.8 ## **Announcements** Prelim next week Tuesday at 7:30 Upson B17 [a-e]\*, Olin 255[f-m]\*, Philips 101 [n-z]\* Go based on netid **Prelim reviews** Friday and Sunday evening. 7:30 again. Location: TBA on piazza **Prelim conflicts** Contact KB, Prof. Weatherspoon, Andrew Hirsch Survey Constructive feedback is very welcome ## **Administrivia** #### Prelim1: - Time: We will start at 7:30pm sharp, so come early - Loc: Upson B17 [a-e]\*, Olin 255[f-m]\*, Philips 101 [n-z]\* - Closed Book - Cannot use electronic device or outside material - Practice prelims are online in CMS - Material covered everything up to end of this week - Everything up to and including data hazards - Appendix B (logic, gates, FSMs, memory, ALUs) - Chapter 4 (pipelined [and non] MIPS processor with hazards) - Chapters 2 (Numbers / Arithmetic, simple MIPS instructions) - Chapter 1 (Performance) - HW1, Lab0, Lab1, Lab2 ## **Hazards** ## 3 kinds - Structural hazards - Multiple instructions want to use same unit - Data hazards - Results of instruction needed before - Control hazards - Don't know which side of branch to take - 1) Structural: say only 1 memory for instruction read and the MEM stage. That is a structural hazard. We have designed the MIPS ISA and our implementation of separate memory for instruction and data to avoid this problem. MIPS is designed very carefully to not have any structural hazards. - 2) Data hazards arise when data needed by one instruction has not yet been computed (because it is further down the pipeline). We need a solution for this. - 3) Control hazards arise when we don't know what the PC will be. This makes it not possible to push the next instruction down the pipeline. Till we know what the issue is. ## How to handle data hazards - What to do if data hazard detected? - Options - Nothing - Change the ISA to match implementation - Stall - Pause current and subsequent instructions till safe - Slow down the pipeline (add bubbles to pipeline) - Forward/bypass - Forward data value to where it is needed Nothing: Modify ISA to match implementation (not great as an option, violates abstraction. Requires restructuring by application. Might not always be possible). Stall: Pause current and all subsequent instruction Forward/Bypass: Steal correct value from elsewhere in pipeline # **Forwarding** Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register) # **Forwarding** Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register) Three types of forwarding/bypass - Forwarding from Ex/Mem registers to Ex stage (M→Ex) - Forwarding from Mem/WB register to Ex stage (W→Ex) - RegisterFile Bypass | Note that the arrow is not going backward in time. | | | |----------------------------------------------------|--|--| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | See where the hazard detection is. Forward unit is in the execute stage, because that is where you actually need the data. Why is the forward unit here. Because it does not need to be earlier. # Forwarding Datapath 1 #### Ex/MEM to EX Bypass - EX needs ALU result that is still in MEM stage - Resolve: Add a bypass from EX/MEM.D to start of EX How to detect? Logic in Ex Stage: ``` forward = (Ex/M.WE && EX/M.Rd != 0 && ID/Ex.Ra == Ex/M.Rd) || (same for Rb) ``` assumes (WE=0 implies rD=0) everywhere, similar for rA needed, rB needed (can ensure this in ID stage) # **Forwarding Datapath 2** ## Mem/WB to EX Bypass - EX needs value being written by WB - Resolve: Add bypass from WB final value to start of EX How to detect? Logic in Ex Stage: forward = (M/WB.WE && M/WB.Rd != 0 && ID/Ex.Ra == M/WB.Rd && || (same for Rb) Is this it? Not quite! assumes (WE=0 implies rD=0) everywhere, similar for rA needed, rB needed (can ensure this in ID stage) # Forwarding Datapath 2 ``` add r3, r1, r2 add r3, r1, r2 sub r5, r3, r5 or r6, r3, r4 or r6, r3, r4 add r6, r3, r8 add r6, r3, r8 ``` How to detect? Logic in Ex Stage: ``` M/WB (WE on, Rd != 0) and (M/WB.Rd == ID/Ex.Ra) also NOT(Ex/M.Rd == ID/Ex.Ra) and (WE, Rd!= 0)) ``` Rb same as Ra assumes (WE=0 implies rD=0) everywhere, similar for rA needed, rB needed (can ensure this in ID stage) # **Register File Bypass** ## **Register File Bypass** Reading a value that is currently being written #### Detect: ``` ((Ra == MEM/WB.Rd) or (Rb == MEM/WB.Rd)) and (WB is writing a register) ``` #### Resolve: Add a bypass around register file (WB to ID) #### Better: just negate register file clock - writes happen at end of first half of each clock cycle - reads happen during second half of each clock cycle # Are we done yet? add r3, r1, r2 lw r4, 20(r8) or r6, r3, r4 add r6, r3, r8 # **Memory Load Data Hazard** What happens if data dependency after a load word instruction? ## **Memory Load Data Hazard** - Value not available until after the M stage - So: next instruction can't proceed if hazard detected often see lw ...; nop; ... we will stall for our project 2 | requires one bubble | | |---------------------|--| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Need to go back in time | | |-------------------------|--| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | requires one bubble | | |---------------------|--| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It happens in an earlier stage. You stall since you don't want to go ahead from the ID stage unless you are ready. Walk through what happens with a # **Memory Load Data Hazard** #### Load Data Hazard - Value not available until WB stage - So: next instruction can't proceed if hazard detected #### Resolution: - MIPS 2000/3000: one delay slot - ISA says results of loads are not available until one cycle later - Assembler inserts nop, or reorders to fill delay slot - MIPS 4000 onwards: stall - But really, programmer/compiler reorders to avoid stalling in the load delay slot #### For stall, how to detect? Logic in ID Stage - Stall = ID/Ex.MemRead && (IF/ID.Ra == ID/Ex.Rd || IF/ID.Rb == ID/Ex.Rd) often see lw ...; nop; ... we will stall for our project 2 # Quiz add r3, r1, r2 nand r5, r3, r4 add r2, r6, r3 lw r6, 24(r3) sw r6, 12(r2) Find all hazards, and say how they are resolved ``` Quiz add r3, r1, r2 nand r5, r3, r4 add r2, r6, r3 lw r6, 24(r3) sw r6, 12(r2) ``` Find all hazards, and say how they are resolved Find all hazards, and say how they are resolved # **Data Hazard Recap** ## Delay Slot(s) • Modify ISA to match implementation #### Stall • Pause current and all subsequent instructions ## Forward/Bypass - Try to steal correct value from elsewhere in pipeline - Otherwise, fall back to stalling or require a delay slot - 1: ISA tied to impl; messy asm; impl easy & cheap; perf depends on asm. - 2: ISA correctness not tied to impl; clean+slow asm, or messy+fast asm; impl easy and cheap; same perf as 1. - 3: ISA perf not tied to impl; clean+fast asm; impl is tricky; best perf # Why are we learning about this? Logic and gates Numbers & arithmetic States & FSMs Memory A simple CPU Performance **Pipelining** Hazards: Data and Control ## **Control Hazards** #### What about branches? A control hazard occurs if there is a control instruction (e.g. BEQ) and the program counter (PC) following the control instruction is not known until the control instruction computes if the branch should be taken e.g. 0x10: beq r1, r2, L 0x14: add r3, r0, r3 0x18: sub r5, r4, r6 0x1C: L: or r3, r2, r4 Q: what happens with branch/jump in branch delay slot? A: bad stuff, so forbidden Q: why one, not two? A: can move branch calc from EX to ID; will require new bypasses into ID stage; or can just zap the second instruction Q: performance? A: stall is bad ## **Control Hazards** #### **Control Hazards** - instructions are fetched in stage 1 (IF) - branch and jump decisions occur in stage 3 (EX) - i.e. next PC is not known until **2 cycles** after branch/jump What happens to instr following a branch, if branch taken? #### Stall (+ Zap/Flush) - prevent PC update - clear IF/ID pipeline register - instruction just fetched might be wrong, so convert to nop - allow branch to continue into EX stage Q: what happens with branch/jump in branch delay slot? A: bad stuff, so forbidden Q: why one, not two? A: can move branch calc from EX to ID; will require new bypasses into ID stage; or can just zap the second instruction Q: performance? A: stall is bad # Reduce the cost of control hazard? Can we forward/bypass values for branches? - We can move branch calc from EX to ID - will require new bypasses into ID stage; or can just zap the second instruction What happens to instructions following a branch, if branch taken? • Still need to zap/flush instructions Is there still a performance penalty for branches Yes, need to stall, then may need to zap (flush) subsequent instructions that have already been fetched Q: what happens with branch/jump in branch delay slot? A: bad stuff, so forbidden Q: why one, not two? A: can move branch calc from EX to ID; will require new bypasses into ID stage; or can just zap the second instruction Q: performance? A: stall is bad ## **Control Hazards** #### **Control Hazards** - instructions are fetched in stage 1 (IF) - branch and jump decisions occur in stage 3 (EX) i.e. next PC is not known until 2 cycles after branch/jump Can optimize and move branch and jump decision to stage 2 (ID) i.e. next PC is not known until 1 cycles after branch/jump ### Stall (+ Zap) - prevent PC update - clear IF/ID pipeline register - instruction just fetched might be wrong one, so convert to nop - allow branch to continue into EX stage Q: what happens with branch/jump in branch delay slot? A: bad stuff, so forbidden Q: why one, not two? A: can move branch calc from EX to ID; will require new bypasses into ID stage; or can just zap the second instruction Q: performance? A: stall is bad # **Takeaway** Control hazards occur because the PC following a control instruction is not known until control instruction computes if branch should be taken or not If branch taken, then need to zap/flush instructions. There still a performance penalty for branches: Need to stall, then may need to zap (flush) subsequent instructions that have already been fetched We can reduce cost of a control hazard by moving branch decision and calculation from Ex stage to ID stage. This reduces the cost from flushing two instructions to only flushing one. # Reduce cost of control hazard more? ## **Delay Slot** - ISA says N instructions after branch/jump always executed - MIPS has 1 branch delay slot - i.e. whether branch taken or not, instruction following branch is *always* executed | Pipeline is humming | | |---------------------|--| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## **Control Hazards** #### **Control Hazards** - instructions are fetched in stage 1 (IF) - branch and jump decisions occur in stage 3 (EX) i.e. next PC is not known until 2 cycles after branch/jump • Can optimize and move branch and jump decision to stage 2 (ID) i.e. next PC is not known until **1** cycles after branch/jump #### Stall (+ Zap) - prevent PC update - clear IF/ID pipeline register - instruction just fetched might be wrong one, so convert to nop - allow branch to continue into EX stage #### **Delay Slot** - ISA says N instructions after branch/jump always executed - MIPS has 1 branch delay slot Q: what happens with branch/jump in branch delay slot? A: bad stuff, so forbidden Q: why one, not two? A: can move branch calc from EX to ID; will require new bypasses into ID stage; or can just zap the second instruction Q: performance? A: stall is bad GPUs have such long pipelines it is just not realistic to add delay slots. So the penalty of branching is very high typically. Various architectural approaches to address that. ## **Takeaway** Control hazards occur because the PC following a control instruction is not known until control instruction computes if branch should be taken or not. If branch taken, then need to zap flush instructions. There still a performance penalty for branches Need to stall, then may need to zap (flush) subsequent instructions that have already been fetched. We can reduce cost of a control hazard by moving branch decision and calculation from Ex stage to ID stage. This reduces the cost from flushing two instructions to only flushing one. Delay Slots can potentially increase performance due to control hazards by putting a useful instruction in the delay slot since the instruction in the delay slot will *always* be executed. Requires software (compiler) to make use of delay slot. Put nop in delay slot if not able to put useful instruction in delay slot. # Reduce cost of Ctrl Haz even further? ## **Speculative Execution** - "Guess" direction of the branch - Allow instructions to move through pipeline - Zap them later if wrong guess - Useful for long pipelines Q: How to guess? A: constant; hint; branch prediction; random; ... # **Speculative Execution: Loops** ### Pipeline so far • "Guess" (predict) that the branch will not be taken ### We can do better! - Make prediction based on last branch - Predict "take branch" if last branch "taken" - Or Predict "do not take branch" if last branch "not taken" - Need one bit to keep track of last branch # **Speculative Execution: Loops** What is accuracy of branch predictor? Wrong twice per loop! Once on loop enter and exit We can do better with 2 bits While (r3 ≠ 0) {.... r3--;} Top: BEQZ r3, End J Top End: Sates is a 2 bit prediction scheme # **Summary** ### Control hazards - Is branch taken or not? - Performance penalty: stall and flush #### Reduce cost of control hazards - Move branch decision from Ex to ID - 2 nops to 1 nop - Delay slot - Compiler puts useful work in delay slot. ISA level. - Branch prediction - Correct. Great! - Wrong. Flush pipeline. Performance penalty # **Hazards Summary** ## Data hazards ## **Control hazards** ### Structural hazards - resource contention - so far: impossible because of ISA and pipeline design Q: what if we had a "copy 0(r3), 4(r3)" instruction, as the x86 does, or "add r4, r4, 0(r3)"? A: structural hazard