### **Processor**

CS 3410, Spring 2014
Computer Science
Cornell University

See P&H Chapter: 4.1-4.4, 1.4, Appendix A



# Iclicker

How many stages of a datapath are there in our single cycle MIPS design?

- A) 1
- B) 2
- C) 3
- D) 4
- E) 5











## **Takeaway**

The datapath for a MIPS processor has five stages:

- 1. Instruction Fetch
- 2. Instruction Decode
- 3. Execution (ALU)
- 4. Memory Access
- 5. Register Writeback

This five stage datapath is used to execute all MIPS instructions

# Iclicker

There are how many types of instructions in the MIPS ISA?

- A) 1
- B) 3
- C) 5
- D) 200
- E) 1000s



### **MIPS Instruction Functions**

### Arithmetic/Logical

- R-type: result and two source registers, shift amount
- I-type: 16-bit immediate with sign/zero extension

### **Memory Access**

- load/store between registers and memory
- word, half-word and byte operations

#### **Control flow**

- conditional branches: pc-relative addresses
- jumps: fixed offsets, register absolute

### **Arithmetic Instructions: Shift**

# 

op - rt rd shamt func
6 bits 5 bits 5 bits 5 bits 6 bits

R-Type

| ор  | func | mnemonic          | description                         |
|-----|------|-------------------|-------------------------------------|
| 0x0 | 0x0  | SLL rd, rt, shamt | R[rd] = R[rt] << shamt              |
| 0x0 | 0x2  | SRL rd, rt, shamt | R[rd] = R[rt] >>> shamt (zero ext.) |
| 0x0 | 0x3  | SRA rd, rt, shamt | R[rd] = R[rt] >> shamt (sign ext.)  |

ex: r8 = r4 \* 64 # SLL r8, r4, 6 r8 = r4 << 6





For addi, sign extension and can do positive and negative. Will signal overflow. For addiu, still use sign extension, but no overflow

Add vs. addu: no overflow for addu









or, extra input to alu B mux from before/after extend (or, extra mux after alu)

# **Goals for today**

### MIPS Datapath

- Memory layout
- Control Instructions

### **Performance**

- How fast can we make it?
- CPI (Cycles Per Instruction)
- MIPS (Instructions Per Cycle)
- Clock Frequency

## **MIPS Instruction Types**

#### Arithmetic/Logical

- R-type: result and two source registers, shift amount
- I-type: 16-bit immediate with sign/zero extension

### Memory Access

- load/store between registers and memory
- word, half-word and byte operations

#### **Control flow**

- conditional branches: pc-relative addresses
- jumps: fixed offsets, register absolute





| Memory Instructions |                                             |                    |                           |                                     |  |
|---------------------|---------------------------------------------|--------------------|---------------------------|-------------------------------------|--|
|                     | op rs rd offset                             |                    |                           |                                     |  |
| 6                   | 5 bits                                      | 5 bits             | 5 bits                    | 16 bits                             |  |
| ор                  | mner                                        | nonic              |                           | description                         |  |
| 0x20                | LB rc                                       | LB rd, offset(rs)  |                           | R[rd] = sign_ext(Mem[offset+R[rs]]) |  |
| 0x24                | LBU                                         | LBU rd, offset(rs) |                           | R[rd] = zero_ext(Mem[offset+R[rs]]) |  |
| 0x21                | LH ro                                       | LH rd, offset(rs)  |                           | R[rd] = sign_ext(Mem[offset+R[rs]]) |  |
| 0x25                | LHU                                         | LHU rd, offset(rs) |                           | R[rd] = zero_ext(Mem[offset+R[rs]]) |  |
| 0x23                | LW r                                        | LW rd, offset(rs)  |                           | R[rd] = Mem[offset+R[rs]]           |  |
| 0x28                | SB ro                                       | SB rd, offset(rs)  |                           | Mem[offset+R[rs]] = R[rd]           |  |
| 0x29                | SH ro                                       | SH rd, offset(rs)  |                           | Mem[offset+R[rs]] = R[rd]           |  |
| 0x2b                | SW rd, offset(rs) Mem[offset+R[rs]] = R[rd] |                    | Mem[offset+R[rs]] = R[rd] |                                     |  |
|                     |                                             |                    |                           |                                     |  |

sw r1 4(r5) sb r1, 3(r5)

If you don't make sure half word accesses are half word aligned, or word accesses are word aligned, there will be an error signaled. We will talk about traps and exceptions later.



Comes from Gulliver travels

| Examples (big) little endian): |      |            |  |  |  |
|--------------------------------|------|------------|--|--|--|
| # r5 contains 5 (0x00000005)   |      | 0x00000000 |  |  |  |
| # 13 contains 3 (execessor)    |      | 0x00000001 |  |  |  |
| SB r5, 2(r0)                   | 0x05 | 0x00000002 |  |  |  |
| LB r6, 2(r0)                   |      | 0x00000003 |  |  |  |
|                                |      | 0x00000004 |  |  |  |
| # R[r6] = 0x05                 |      | 0x00000005 |  |  |  |
| SW r5, 8(r0)                   |      | 0x00000006 |  |  |  |
|                                |      | 0x00000007 |  |  |  |
| LB r7, 8(r0)                   | 0x00 | 0x00000008 |  |  |  |
| LB r8, 11(r0)                  | 0x00 | 0x00000009 |  |  |  |
| # R[r7] = 0x00                 | 0x00 | 0x0000000a |  |  |  |
| # R[r8] = 0x05                 | 0x05 | 0x0000000b |  |  |  |
|                                |      |            |  |  |  |

Big Endian means store MSB (most significant byte) first

## **MIPS Instruction Types**

#### Arithmetic/Logical

- R-type: result and two source registers, shift amount
- I-type: 16-bit immediate with sign/zero extension

### **Memory Access**

- load/store between registers and memory
- word, half-word and byte operations

#### Control flow

- conditional branches: pc-relative addresses
- jumps: fixed offsets, register absolute



Where • is used to concatenate

Why should the offset be left shifted by 2. To keep the jump address word aligned.

AND to not waste two bits that we \*know\* are going to be 0.

#### **Control Flow: Absolute Jump** 00001010100001001000011000000011 immediate op J-Type 26 bits 6 bits Mnemonic Description op $PC = (PC+4)_{31..28} \bullet target \bullet$ 0x2 J target Absolute addressing for jumps (PC+4)<sub>31..28</sub> will be the same • Jump from 0x30000000 to 0x20000000? But: Jumps from 0x2FFFFFFF to 0x3xxxxxxx are possible, but not reverse • Trade-off: out-of-region jumps vs. 32-bit instruction encoding MIPS Quirk: • jump targets computed using *already incremented* PC

Where • is used to concatenate





op rs - - - func R-Type
6 bits 5 bits 5 bits 5 bits 6 bits

| ор  | func | mnemonic | description |
|-----|------|----------|-------------|
| 0x0 | 0x08 | JR rs    | PC = R[rs]  |

ex: JR r3



# **Examples**

E.g. Use Jump or Jump Register instruction to jump to 0xabcd1234

But, what about a jump based on a condition? # assume  $0 \le r3 \le 1$  if (r3 == 0) jump to 0xdecafe00 else jump to 0xabcd1234



Why should the offset be left shifted by 2. To keep the jump address word aligned.

AND to not waste two bits that we \*know\* are going to be 0.



In the book, the ALU is used to determine the branch comparison. We are choosing to do it separately.

Anyway, can't use ALU for both branc comparison and PC determination. At most, we can use it once. The book uses it for branch

### **Control Flow: More Branches**

op rs subop offset
6 bits 5 bits 5 bits 16 bits

almost I-Type

signed

| ор          | subop | mnemonic        | description          |                       |
|-------------|-------|-----------------|----------------------|-----------------------|
| 0x1         | 0x0   | BLTZ rs, offset | if R[rs] < 0 then PC | C = PC+4+ (offset<<2) |
| <b>O</b> x1 | 0x1   | BGEZ rs, offset | if R[rs] ≥ 0 then PC | C = PC+4+ (offset<<2) |
| 0x6         | 0x0   | BLEZ rs, offset | if R[rs] ≤ 0 then PC | C = PC+4+ (offset<<2) |
| 0x7         | 0x0   | BGTZ rs, offset | if R[rs] > 0 then PC | C = PC+4+ (offset<<2) |

ex: BGEZ r5, 2

If(R[r5]  $\geq$  0) then PC = PC+4 + 8



#### **Control Flow: Jump and Link** Why? Function/procedure calls 00001100000001001000011000000010 immediate op J-Type 6 bits 26 bits Discuss later mnemonic description r31 = PC+8 (+8 due to branch delay slot) 0x3 JAL target $PC = (PC+4)_{31..28} \bullet target \bullet 00$ ex: JAL 0x1000000 r31 = PC+8 $PC = (PC+4)_{31...28} \bullet 0x4000000$ mnemonic description op $PC = (PC+4)_{31..28} \bullet target \bullet 00$ 0x2 J target



## **Goals for today**

#### MIPS Datapath

- Memory layout
- Control Instructions

### Performance

- How to get it?
- CPI (Cycles Per Instruction)
- MIPS (Instructions Per Cycle)
- Clock Frequency

### **Pipelining**

Latency vs throughput

## Questions

How do we measure performance?
What is the performance of a single cycle CPU?

How do I get performance?

See: P&H 1.4

# What instruction has the longest path

- A) LW
- B) SW
- C) ADD/SUB/AND/OR/etc
- D) BEQ
- E) J



# **Performance**

How do I get it?

Parallelism

Pipelining

Both!

### Performance: Aside

Speed of a circuit is affected by the number of gates in series (on the *critical path* or the *deepest level of logic*)



# **4-bit Ripple Carry Adder**



- Carry ripples from lsb to msb
- First full adder, 2 gate delay
- Second full adder, 2 gate delay
- ...



## **Adding**

Main ALU, slows us down Does it need to be this slow?

#### Observations

- Have to wait for Cin
- Can we compute in parallel in some way?
- CLA carry look-ahead adder

# **Carry Look Ahead Logic**

Can we reason independent of Cin?

• Just based on (A,B) only

When is Cout == 1, irrespective of Cin

If Cin == 1, when is Cout also == 1



# **1-bit Adder with Carry**



### Full Adder

- Adds three 1-bit numbers
- Computes 1-bit result and 1-bit carry
- Can be cascaded

| Α | В | C <sub>in</sub> | C <sub>out</sub> | S |
|---|---|-----------------|------------------|---|
| 0 | 0 | 0               | 0                | 0 |
| 0 | 1 | 0               | 0                | 1 |
| 1 | 0 | 0               | 0                | 1 |
| 1 | 1 | 0               | 1                | 0 |
| 0 | 0 | 1               | 0                | 1 |
| 0 | 1 | 1               | 1                | 0 |
| 1 | 0 | 1               | 1                | 0 |
| 1 | 1 | 1               | 1                | 1 |

### 1-bit CLA adder



Create two terms: propagator, generator

g = 1, generates Cout: g = AB

• Irrespective of Cin

p = 1, propagates Cin to Cout: p = A + B

p and g generated in 1 cycle delay

R is 2 cycle delay after we get Cin







### Full Adder

- Adds three 1-bit numbers
- Computes 1-bit result and 1-bit carry
- Can be cascaded

| Α | В | C <sub>in</sub> | C <sub>out</sub> | S |
|---|---|-----------------|------------------|---|
| 0 | 0 | 0               | 0                | 0 |
|   | 1 | 0               | 0                | 1 |
| 1 | 0 | 0               | 0                | 1 |
| 1 | 1 | 0               | 1                | 0 |
| 0 | 0 | 1               | 0                | 1 |
| 0 | 1 | 1               | 1                | 0 |
| 1 | 0 | 1               | 1                | 0 |
| 1 | 1 | 1               | 1                | 1 |









### Performance

Ripple carry adder vs carry lookahead adder for 8 bits

• 2 x 8 vs. 5

# **Performance**

How do I get it?

Parallelism

Pipelining

Both!