## Processor

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: 4.1-4.4, 1.6, Appendix B

## Announcements

Project Partner finding assignment on CMS

No official office hours over break

Lab1 due tomorrow

HW1 Help Sessions Wed, Feb 18 and Sun, Feb 21

### Announcements

Make sure to go to <u>your</u> Lab Section this week Lab2 due in class this week (it is *not* homework) Lab1: Completed Lab1 due *tomorrow* Friday, Feb 13<sup>th</sup>, *before* winter break Note, a <u>Design Document</u> is due when you submit Lab1 final circuit Work **alone** 

#### Save your work!

- *Save often*. Verify file is non-zero. Periodically save to Dropbox, email.
- Beware of MacOSX 10.5 (leopard) and 10.6 (snow-leopard)

#### Homework1 is out

Due a week before prelim1, Monday, February 23rd Work on problems incrementally, as we cover them in lecture (i.e. part 1) Office Hours for help Work **alone** 

Work alone, **BUT** use your resources

- Lab Section, Piazza.com, Office Hours
- Class notes, book, Sections, CSUGLab

### Announcements

Check online syllabus/schedule

- http://www.cs.cornell.edu/Courses/CS3410/2015sp/schedule.html
- Slides and Reading for lectures
- Office Hours
- Pictures of all TAs
- Homework and Programming Assignments
- Dates to keep in Mind
  - Prelims: Tue Mar 3rd and Thur April 30th
  - Lab 1: Due this Friday, Feb 13th before Winter break
  - Proj2: Due Thur Mar 26th before Spring break
  - Final Project: Due when final would be (not known until Feb 14t

### Schedule is subject to change

# Collaboration, Late, Re-grading Policies

"Black Board" Collaboration Policy

- Can discuss approach together on a "black board"
- Leave and write up solution independently
- Do not copy solutions

Late Policy

- Each person has a total of *four* "slip days"
- Max of *two* slip days for any individual assignment
- Slip days deducted first for *any* late assignment, cannot selectively apply slip days
- For projects, slip days are deducted from all partners
- <u>25%</u> deducted per day late after slip days are exhausted

**Regrade policy** 

- Submit written request to lead TA, and lead TA will pick a different grader
- Submit another written request, lead TA will regrade directly
- Submit yet another written request for professor to regrade.

## **Full Datapath**



# Iclicker

How many stages of a datapath are there in our single cycle MIPS design?



## Stages of datapath (1/5)



## Stages of datapath (2/5)



## Stages of datapath (3/5)



## Stages of datapath (4/5)



## Stages of datapath (5/5)



## Takeaway

The datapath for a MIPS processor has five stages:

- **1**. Instruction Fetch
- 2. Instruction Decode
- **3.** Execution (ALU)
- 4. Memory Access
- **5.** Register Writeback

This five stage datapath is used to execute all MIPS instructions

# Iclicker

There are how many types of instructions in the MIPS ISA?



## **MIPS Instruction Functions**

### Arithmetic/Logical

- R-type: result and two source registers, shift amount
- I-type: 16-bit immediate with sign/zero extension

#### **Memory Access**

- load/store between registers and memory
- word, half-word and byte operations

### **Control flow**

- conditional branches: pc-relative addresses
- jumps: fixed offsets, register absolute

### **MIPS** instructions

All MIPS instructions are 32 bits long, has 3 formats



# Goals for today

### **MIPS** Datapath

- Memory layout
- Control Instructions

### Performance

- How fast can we make it?
- CPI (Cycles Per Instruction)
- MIPS (Instructions Per Cycle)
- Clock Frequency

# **MIPS Instruction Types**

### Arithmetic/Logical

- R-type: result and two source registers, shift amount
- I-type: 16-bit immediate with sign/zero extension

#### **Memory Access**

- load/store between registers and memory
- word, half-word and byte operations

### **Control flow**

- conditional branches: pc-relative addresses
- jumps: fixed offsets, register absolute

# **Memory Instructions**



ex: = Mem[4+r5] = r1 # SW r1, 4(r5)

# **Memory Operations**



# **Memory Instructions**

### 

|      | С | р     | rs        | rd     | offset                              |
|------|---|-------|-----------|--------|-------------------------------------|
|      | 6 | oits  | 5 bits    | 5 bits | 16 bits                             |
| ор   |   | mner  | nonic     |        | description                         |
| 0x20 |   | LB rc | , offset  | (rs)   | R[rd] = sign_ext(Mem[offset+R[rs]]) |
| 0x24 |   | LBU   | rd, offse | et(rs) | R[rd] = zero_ext(Mem[offset+R[rs]]) |
| 0x21 |   | LH ro | d, offset | (rs)   | R[rd] = sign_ext(Mem[offset+R[rs]]) |
| 0x25 |   | LHU   | rd, offse | et(rs) | R[rd] = zero_ext(Mem[offset+R[rs]]) |
| 0x23 |   | LW r  | d, offset | :(rs)  | R[rd] = Mem[offset+R[rs]]           |
| 0x28 |   | SB ro | , offset  | (rs)   | Mem[offset+R[rs]] = R[rd]           |
| 0x29 |   | SH ro | d, offset | (rs)   | Mem[offset+R[rs]] = R[rd]           |
| 0x2b |   | SW r  | d, offset | t(rs)  | Mem[offset+R[rs]] = R[rd]           |

## Endianness

Endianness: Ordering of bytes within a memory word Little Endian = least significant part first (MIPS, x86)



Big Endian = most significant part first (MIPS, networks)1000100110021003as 4 bytes0x120x340x560x78as 2 halfwords0x12340x5678as 1 word0x12345678

Examples (big/little endian): # r5 contains 5 (0x0000005) SB r5, 2(r0) LB r6, 2(r0) # R[r6] = 0x05SW r5, 8(r0) LB r7, 8(r0) LB r8, 11(r0) # R[r7] = 0x00 # R[r8] = 0x05

|      | 0x00000000 |
|------|------------|
|      | 0x00000001 |
| 0x05 | 0x00000002 |
|      | 0x00000003 |
|      | 0x00000004 |
|      | 0x00000005 |
|      | 0x00000006 |
|      | 0x00000007 |
| 0x00 | 0x0000008  |
| 0x00 | 0x00000009 |
| 0x00 | 0x0000000a |
| 0x05 | 0x0000000b |
|      | • • •      |

# **MIPS Instruction Types**

### Arithmetic/Logical

- R-type: result and two source registers, shift amount
- I-type: 16-bit immediate with sign/zero extension

#### **Memory Access**

- load/store between registers and memory
- word, half-word and byte operations

### Control flow

- conditional branches: pc-relative addresses
- jumps: fixed offsets, register absolute

## **Control Flow: Absolute Jump**

|                        | op<br>6 bits      |     | immediate<br>26 bits |         |           |                     | J-Ty   | ype  |      |    |
|------------------------|-------------------|-----|----------------------|---------|-----------|---------------------|--------|------|------|----|
| op Mnemonic            |                   |     | Description          |         |           |                     |        |      |      |    |
| 0x2                    | J target          |     |                      |         | PC = (PC- | -4) <sub>3128</sub> | • targ | et•C | )()  |    |
| (PC+4) <sub>3128</sub> |                   |     | t                    | arget   |           |                     | 00     |      |      |    |
|                        | 4 bits            |     |                      | 26 bits |           |                     | 2 bits |      |      |    |
| (PC+                   | 4) <sub>312</sub> | 28  | 01                   | 0000    | 0000      | 0000                | 0000   | 0000 | 0001 | 0( |
|                        | 0x1(              | າດດ | <u>101</u>           |         |           |                     |        |      |      |    |

PC = ((PC+4) & 0xf000000) | 0x04000004

## **Control Flow: Absolute Jump**

| 0<br>6 b |          | immediate<br>26 bits |                        |        | ر ب<br>ب | Туре |
|----------|----------|----------------------|------------------------|--------|----------|------|
| ор       | Mnemonic |                      | Description            |        |          |      |
| 0x2      | J target |                      | $PC = (PC+4)_{3128}$ • | target | •        | 00   |

Absolute addressing for jumps (PC+4)<sub>31.28</sub> will be the same

- Jump from 0x30000000 to 0x20000000?
  - But: Jumps from 0x2FFFFFF to 0x3xxxxxx are possible, but not reverse
- Trade-off: out-of-region jumps vs. 32-bit instruction encoding

### MIPS Quirk:

• jump targets computed using *already incremented* PC

# Absolute Jump



### **Control Flow: Jump Register**

| op   |          | 5 -      | _      | _         | func   | R-Type |
|------|----------|----------|--------|-----------|--------|--------|
| 6 bi | ts 5 bit | s 5 bits | 5 bits | 5 bits    | 6 bits |        |
| ор   | func     | mnemoni  | ic d   | escript   | tion   |        |
| 0x0  | 0x08     | JR rs    | P      | C = R[rs] |        |        |

ex: JR r3

# Jump Register



ex: JR r3

| ор  | func | mnemonic | description |
|-----|------|----------|-------------|
| 0x0 | 0x08 | JR rs    | PC = R[rs]  |

## Examples

E.g. Use Jump or Jump Register instruction to jump to 0xabcd1234

But, what about a jump based on a condition? # assume 0 <= r3 <= 1 if (r3 == 0) jump to 0xdecafe00 else jump to 0xabcd1234

### Control Flow: Branches 000100001010000100000000000011



ex: BEQ r5, r1, 3 If(R[r5]==R[r1]) then PC = PC+4 + 12 (i.e. 12 == 3<<2)

# **Control Flow: Branches**



|     | Co                                                     | ntrol F                                 | low    | /: More Br               | anches                |  |  |  |  |  |
|-----|--------------------------------------------------------|-----------------------------------------|--------|--------------------------|-----------------------|--|--|--|--|--|
|     | Conditional Jumps (cont.)                              |                                         |        |                          |                       |  |  |  |  |  |
|     | 00000                                                  | 000001001010000100000000000000000000000 |        |                          |                       |  |  |  |  |  |
|     | ор                                                     | rs s                                    | ubop   | offset                   | almost I-Type         |  |  |  |  |  |
|     | 6 bits                                                 | 5 bits                                  | 5 bits | 16 bits                  | signed                |  |  |  |  |  |
| ор  | subop                                                  | mnemoni                                 | C      | description              | offsets               |  |  |  |  |  |
| 0x1 | 0x0                                                    | BLTZ rs, of                             | fset   | if R[rs] < 0 then P      | C = PC+4+ (offset<<2) |  |  |  |  |  |
| 0x1 | 0x1                                                    | BGEZ rs, o                              | ffset  | if $R[rs] \ge 0$ then PO | C = PC+4+ (offset<<2) |  |  |  |  |  |
| 0x6 | 0x0                                                    | BLEZ rs, of                             | fset   | if $R[rs] \le 0$ then P( | C = PC+4+ (offset<<2) |  |  |  |  |  |
| 0x7 | 0x0                                                    | BGTZ rs, o                              | ffset  | if R[rs] > 0 then Po     | C = PC+4+ (offset<<2) |  |  |  |  |  |
|     | ex: BGEZ r5, 2<br>If(R[r5] $\ge$ 0) then PC = PC+4 + 8 |                                         |        |                          |                       |  |  |  |  |  |

# **Control Flow: More Branches**



# **Control Flow: Jump and Link**

| ор     |        |       | immediate                                                                   |    | J-Type      |
|--------|--------|-------|-----------------------------------------------------------------------------|----|-------------|
| 6 bits |        |       | 26 bits                                                                     | Di | scuss later |
| ор     | mnem   | onic  | description                                                                 |    |             |
| 0x3    | JAL ta | arget | r31 = PC+8 (+8  due to brance)<br>PC = (PC+4) <sub>3128</sub> • target • 00 |    | ay slot)    |

ex: JAL 0x1000001 r31 = PC+8

| ор  | mnemonic | $PC = (PC+4)_{31.28} \bullet 0x4000004 \\ description$ |
|-----|----------|--------------------------------------------------------|
| 0x2 | J target | $PC = (PC+4)_{3128} \bullet target \bullet 00$         |

# Jump and Link



## Goals for today

#### **MIPS** Datapath

- Memory layout
- Control Instructions

#### Performance

- How to get it?
- CPI (Cycles Per Instruction)
- MIPS (Instructions Per Cycle)
- Clock Frequency

Pipelining

Latency vs throughput

### Questions

How do we measure performance? What is the performance of a single cycle CPU?

How do I get performance?

See: P&H 1.6

### What instruction has the longest path

A) LW

B) SW

C) ADD/SUB/AND/OR/etc

D) BEQ

E) J

# **Memory Operations**

Mem[4+r5]

r5 Prog. inst Reg. Mem File r5+4 r1 +4 addr 5 5 5 Write Enable PC W Data control Mem[4+r5] Mem 4 imm ext ex: = r1 = Mem[4+r5] # LW r1, 4(r5)

## Performance

How do I get it? Parallelism Pipelining Both!

### **Performance: Aside**

Speed of a circuit is affected by the number of gates in series (on the *critical path* or the *deepest level of logic*)



# **4-bit Ripple Carry Adder**



Carry ripples from lsb to msb

First full adder, 2 gate delay •

Second full adder, 2 gate delay •

# Adding

Main ALU, slows us down Does it need to be this slow?

Observations

- Have to wait for C<sub>in</sub>
- Can we compute in parallel in some way?
- CLA carry look-ahead adder

### Carry Look Ahead Logic

Can we reason C<sub>out</sub> independent of C<sub>in</sub>?

• Just based on (A,B) only

When is C<sub>out</sub> == 1, irrespective of C<sub>in</sub>?

### If $C_{in} == 1$ , when is $C_{out}$ also == 1



#### Full Adder

- Adds three 1-bit numbers
- Computes 1-bit result and 1-bit carry
- Can be cascaded





#### Full Adder

- Adds three 1-bit numbers
- Computes 1-bit result and 1-bit carry
- Can be cascaded



### 1-bit CLA adder



Create two terms: propagator, generator

- g = 1, generates C<sub>out</sub>: g = AB
  - Irrespective of C<sub>in</sub>
- p = 1, propagates  $C_{in}$  to  $C_{out}$ : p = A + B

p and g generated in 1 gate delay S is 2 gate delay *after* we get C<sub>in</sub>

## 4-bit CLA



• CLA takes p,g from all 4 bits, C<sub>0</sub> and generates all Cs: 2 gate delay

## 4-bit CLA



- Given A,B's, all p,g's are generated in 1 gate delay in parallel
- Given all p,g's, all C's are generated in 2 gate delay in parallel
- Given all C's, all S's are generated in 2 gate delay in parallel

Sequential operation is made into parallel operation!!

#### Performance

Ripple Carry vs Carry Lookahead Adder for **8 bits** 

• 2 x 8 vs. 5 gate delays = 16 vs. 5 gate delays

Ripple Carry vs. Carry Lookahead Adder for 32 bits

• 2 x 32 vs. 5 gate delays = 64 vs. 5 gate delays!

## Performance

How do I get it? Parallelism Pipelining Both!

### Goals for today

#### **MIPS** Datapath

- Memory layout
- Control Instructions

#### Performance

- How to get it? Parallelism and Pipeline!
- CPI (Cycles Per Instruction)
- MIPS (Instructions Per Cycle)
- Clock Frequency

Pipelining

Latency vs throughput

Next Time