## Processor Implementation

#### Overview

- Start with common datapath design concepts
- Simple processor implementation
- Move to more complex implementation
- Finally, pipelined implementation

#### We'll look at how modern hardware works:

- Clocking, combinational logic, etc
- Pipelining
- Dealing with complexity

a.k.a. how do they build those 10 million transistor chips??





# Designing Complex FSMs

How many states do we need for a MIPS CPU?

- MIPS has 32 32-bit registers
- Each register could be in one of  $2^{32}$  states
- We need at least  $31 \times 2^{32}$  states! (register 0 is 0)

... so we clearly don't want to draw the FSM or write the truth-table!

Idea: exploit FSM structure





## Datapath Concepts

#### Datapath:

- Part of an architecture that manipulates data
- Tends to be "regular"

Example: ALU, MUX, etc.

#### Control:

Tells datapath what to do

Example: alu opcode, MUX select input, etc.





## Datapath Design

Steps in designing a processor: (or any other piece of hardware)

- Start with high-level specification processor: the ISA
- Identify major storage elements and signals processor: registers, memory, ...
- Translate specification
   determine operations on storage elements
   (RTL, register transfer language)





### Datapath Design

#### More steps...

- Pick computation blocks processor: alu
- Determine connections
   data-dependencies among different blocks
   i.e., the datapath
- Determine control inputs to datapath blocks
- ... and then put everything together.
- Cross-cutting issue: clocking strategy when do storage elements get updated?





### Datapath Design Example

#### Example: (unsigned integers)

```
x = 0;
for (i=1; i <= N; i++)
  x = x + i;</pre>
```

Here N is an input and we should produce the result that is stored in  $\mathbf{x}$ .

First step: make the specification precise in terms of signals and clocks.





## Datapath Design Example

#### Better specification of I/O behavior:

- Input N arrives on a bus that is 4 bits wide and N is non-zero
- Output x is 7 bits wide
- Output xdone (1 bit) is set to 1 when the x output holds the correct data
- xdone stays high for 1 cycle, after which the next N input is read and the computation proceeds as before
- Signals change at the positive edge of the clock





### Pick Storage Elements

There is normally a choice here. For our example:

- i: 4 bit storage element
- N: 4 bit storage element
- x: 7 bit storage element
- xdone: 1 bit output signal

Another option: rewrite the loop and eliminate one of the storage elements!

```
x = 0;
for (i=N; i >= 1; i--)
x = x + i;
```





## Translate Specification

Pick states, and determine operations in each state.

A systematic way: translate program using gotos...

```
initial:
    xdone = 0; x = 0; i = N;
    goto loop;
loop:
    xdone = 0; x = x + i; i = i - 1;
    if (i == 0) goto done;
    else goto loop;
done:
    xdone = 1;
    goto initial
```





## Translate Specification

#### Idea:

- Each label is a state
- Advance from one state to the next every cycle
- End of program fragment at each label is a goto all paths lead to a goto!
   no intervening labels

RTL: normally uses "<-" for assignment





### What about Concurrency?

Everything happens in parallel in hardware...

```
loop:
    xdone = 0; x = x + i; i = i - 1;
    if (i == 0) goto done;
    else goto loop;
```

- variables are updated when the state changes
- ullet the state-holding element for i holds the value of i at the beginning of the state
- computation blocks have to handle this





## Computation Blocks

Operations on storage elements/signals:

i

$$- i <- N, i <- i - 1, i == 0$$

X

$$- x <- 0, x <- x + i$$

xdone

You can see why x and i are state-holding elements...

Blocks: "subtract 1," adder, compare to zero





### Computation Blocks



or...



Reduce, reuse, recycle...







For each storage element, find dependencies.

- i
  - set to N, output of sub1 block
  - used by sub1 block, adder
- X
  - set to 0, output of adder
  - used by adder

(In general case, determine how computation blocks are interconnected too.)

What is the data flow?







i can be set in two different ways...









blue: state-holding elements that are implemented with posflops.

red: control

Use MUXes...







Connections for i.







Connections for x.







Finally: control





```
initial:
   xdone <- 0;</pre>
   x < -0;
   i <- N;
   goto loop;
loop:
   xdone <- 0; x <- x + i;
   i <- i - 1;
   if (i == 0) goto done;
   else goto loop;
done:
   xdone <- 1;</pre>
   goto initial
```

Specify operations on data by using control signals!





```
initial:
   xdone <- 0;</pre>
   xwrite <- 1; xmux <- 1;
   iwrite <- 1; imux <- 1;
   goto loop;
loop:
   xdone <- 0; xwrite <- 1; xmux <- 0;
   iwrite <- 1; imux <- 0;</pre>
   if (izero == 1) goto done;
   else goto loop;
done:
   xdone <- 1; xwrite <- 0; iwrite <- 0;</pre>
   goto initial
```

Specify operations on data by using control signals!





#### Important points:

- What about xmux, imux in the done state?
   ⇒ iwrite, xwrite are 0, so don't cares.
- Note: each signal is set in each state!
   ⇒ we can generate them using combinational logic!
- If this is not the case, need a state-holding element to remember what the old value of the variable was...

(sometimes referred to as an implied flop/latch)





Format: izero/xdone xwrite xmux iwrite imux



Next step: state assignment/state table





Format: izero/xdone xwrite xmux iwrite imux

s0 s1



Next: state tables, logic design





#### After some logic synthesis...

- xdone = s0
- xwrite =  $\overline{s0}$
- imux =  $\overline{s0} \cdot \overline{s1}$
- $xmux = \overline{s0} \cdot \overline{s1}$
- iwrite = 1
- News0 =  $\overline{s0} \cdot s1 \cdot izero \cdot \overline{Reset}$
- News1 =  $(\overline{s0} \cdot \overline{s1} + s1 \cdot \overline{izero}) \cdot \overline{Reset}$



