Lec 28: Conclusions

Kavita Bala CS 3410, Fall 2008

Computer Science Cornell University

#### **Announcements**

- Pizza party was fun
  - Winner: Andrew Cameron and Ross Anderson
- Final project out tomorrow afternoon
  - Demos: Dec 16 (Tuesday)
- Prelim 2: Dec 4 Thursday
  - Hollister 110, 7:30-10:00

#### Prelim 2 Topics

- · Cumulative, but newer stuff:
  - Physical and virtual memory, page tables, TLBs
  - Caches, cache-conscious programming, caching issues
  - Privilege levels, syscalls, traps, interrupts, exceptions
  - Busses, programmed I/O, memory-mapped I/O
  - DMA, disks
  - Synchronization
  - Multicore processors

© Kavita Bala, Computer Science, Cornell University

#### Goals

- · Concurrency poses challenges for:
- Correctness
  - Threads accessing shared memory should not interfere with each other
- Liveness
  - Threads should not get stuck, should make forward progress
- Efficiency
  - Program should make good use of available computing resources (e.g., processors).
- Fairness
  - Resources apportioned fairly between threads

## Race conditions

- Def: timing-dependent error involving access to shared state
  - Whether it happens depends on how threads scheduled: who wins "races" to instruction that updates state vs. instruction that accesses state
  - Races are intermittent, may occur rarely
    - Timing dependent = small changes can hide bug
  - A program is correct *only* if *all possible* schedules are safe
    - Number of possible schedule permutations is huge
    - Need to imagine an adversary who switches contexts at the worst possible time

© Kavita Bala, Computer Science, Cornell University

#### **Mutexes**

- Critical sections typically associated with mutual exclusion locks (mutexes)
- Only one thread can hold a given mutex at a time
- Acquire (lock) mutex on entry to critical section
  - Or block if another thread already holds it
- Release (unlock) mutex on exit
  - Allow one waiting thread (if any) to acquire & proceed
     pthread\_mutex\_init(m);



#### Using atomic hardware primitives Mutex implementations usually rely on special hardware instructions that atomically do a read and a write. Requires special memory system support on multiprocessors while (test\_and\_set(&lock)); Mutex init: lock = false: ical Section test\_and\_set uses a special hardware instruction that sets the lock and returns the OLD value lock = false; (true: locked; false: unlocked) - Alternative instruction: compare & swap, load linked/store conditional

omputer Science, Cornell University

# Using test-and-set for mutual exclusion boolean lock = false; while test\_and\_set(&lock) skip //spin until lock is acquired. ... do critical section ... //only one process can be in this section at a time lock = false; // release lock when finished with the // critical section boolean test\_and\_set (boolean \*lock) { boolean old = \*lock; \*lock = true; return old; } © Kavita Bala, Computer Science, Cornell University

# Beyond mutexes

- Sometimes need to share resources in non-exclusive way
- Example: shared queue (multiple readers, multiple writers)
- How to let a reader wait for data without blocking a mutex?

char get() {
 char c = buffer[first];
 first++;
}



```
A first broken cut
     // invariant: data is in buffer[first..last-1].
     mutex_t *m = ...;
                                 charaghedin()gfet() {
     char buffer[n];
                                 changathene=false;
     int first = 0, last = 0;
                                  while holding fritstast last);
                                   lowerman) = buffer[first];
                                   chartmesbuttiesthifst];
     void put(char c) {
                                   firstell(first); b) iffer[first];
       lock(m);
                                   unlock(ns); = (first+1);
       buffer[last] = c;
                                         done = true;
       last = (last+1);
                                 Ooksopsvobleaksel ovaniteitälifærepty
       unlock(m);
                                     unlock(m);
Same issues
                                  }
here for full queue
                                }
                                 Oops! Reader still spins on empty qued
                       © Kavita Bala, Computer Science, Cornell University
```



#### A first broken cut last // invariant: data is in buffer[first..last-1]. mutex\_t \*m; charaghedin()gfet() { char buffer[n]; changathene=false; int first = 0, last = 0; while holding fritstast last); low(high) = buffer[first]; chartersbufflastfifs. void put(char c) { firstelled (st.): b) if ther[first]: lock(m); unlock(ns); = (first+1)%n; buffer[last] = c; done = true; last = (last+1)%n;Ooksofswaleaksel wantestill if acrepty unlock(m); unlock(m); Same issues } here for full queue }

# Condition variables

© Kavita Bala, Computer Science, Cornell University

Oops! Reader still spins on empty qued

- To let thread wait (not holding the mutex!) until a condition is true, use a condition variable [Hoare]
- wait(m, c): atomically release m and go to sleep waiting for condition c, wake up holding m
  - Must be atomic to avoid wake-up-waiting race
- signal(c): wake up one thread waiting on c
- broadcast(c): wake up all threads waiting on c
- POSIX (e.g., Linux): pthread\_cond\_wait, pthread\_cond\_signal, pthread\_cond\_broadcast

# Using a condition variable

- wait(m, c): release m, sleep waiting for c, wake up holding m
- signal(c): wake up one thread waiting on c

```
mutex_t *m;
                                      cond_t *not_empty, *not_full;
                                      char get() {
char put(char c) {
                                        lock(m);
  lock(m);
                                        while (first == last)
  while ((first-last)%n == 1)
                                           wait(m, not_empty);
    wait(m, not_full);
                                        char c = buffer[first];
  buffer[last] = c;
                                        first = (first+1)%n;
  last = (last+1)%n;
                                        unlock(m);
  unlock(m);
                                        signal(not_full);
  signal(not_empty);
                                      }
}
```

© Kavita Bala, Computer Science, Cornell University

#### **Monitors**

- A monitor is a shared concurrency-safe data structure
- Has one mutex
- Has some number of condition variables
- Operations acquire mutex so only one thread can be in the monitor at a time
- Our buffer implementation is a monitor
- Some languages (e.g. Java, C#) provide explicit support for monitors

#### Java concurrency

- · Java object is a simple monitor
  - Acts as a mutex via synchronized { S } statement and synchronized methods
  - Has one (!) builtin condition variable tied to the mutex
    - o.wait() = wait(o, o)
    - o.notify() = signal(o)
    - o.notifyAll() = broadcast(o)
    - synchronized(o) {S} = lock(o); S; unlock(o)
  - Java wait() can be called even when mutex is not held. Mutex not held when awoken by signal(). Useful?

© Kavita Bala, Computer Science, Cornell University

### More synchronization mechanisms

# Implementable with mutexes and condition variables:

- Reader/writer locks
  - Any number of threads can hold a read lock
  - Only one thread can hold the writer lock
- Semaphores
  - Some number n of threads are allowed to hold the lock
  - n=1 => semaphore = mutex
- Message-passing, sockets
  - send()/recv() transfer data and synchronize

# Where are we going?

© Kavita Bala, Computer Science, Cornell University

# Real-Time Hardware









#### Moore's Law

- 1965
  - number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time).
- Amazingly visionary
  - 2300 transistors, 1 MHz clock (Intel 4004) 1971
  - 16 Million transistors (Ultra Sparc III)
  - 42 Million transistors, 2 GHz clock (Intel Xeon) 2001
  - 55 Million transistors, 3 GHz, 130nm technology, 250mm<sup>2</sup> die (Intel Pentium 4) – 2004
  - 290+ Million transistors, 3 GHz (Intel Core 2 Duo) 2007







# Why are GPUs so fast?



FIGURE A.3.1 Direct3D 10 graphics pipeline. Each logical pipeline stage maps to GPU hardware or to a GPU processor. Programmable shader stages are blue, fixed-function blocks are white, and memory objects are grey. Each stage processes a vertex, geometric primitive, or pixel in a streaming dataflow fashion. Copyright © 2009 Elsevier, Inc. All rights reserved.

- Pipelined and parallel
- Very, very parallel: 128 to 1000 cores

© Kavita Bala, Computer Science, Cornell University



FIGURE A.2.5 Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14 streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared memory. Copyright © 2009 Elsevier, Inc. All rights reserved.

# General computing with GPUs

- Can we use these machines for general computation?
- Scientific Computing
  - MATLAB codes
- Convex hulls
- Molecular Dynamics
- Etc.
- CUDA: using it as a general purpose multicore processor

© Kavita Bala, Computer Science, Cornell University

# AMDs Hybrid CPU/GPU



#### Cell

- IBM/Sony/Toshiba
- Sony Playstation 3
- PPE
- SPEs (synergestic)



© Kavita Bala, Computer Science, Cornell Universit

# Classification of Parallelism

Flynn's taxonomy

|                        |          | Data Streams               |                                            |  |
|------------------------|----------|----------------------------|--------------------------------------------|--|
|                        |          | Single                     | Multiple                                   |  |
| Instruction<br>Streams | Single   | SISD:<br>Intel Pentium 4   | SIMD: SSE instructions of x86 Early GPUs   |  |
|                        | Multiple | MISD:<br>No examples today | MIMD:<br>Intel Xeon e5345<br>Cell<br>Tesla |  |

#### **Parallelism**

- Must exploit parallelism for performance
  - Lot of parallelism in graphics applications
- SIMD: single instruction, multiple data
  - Perform same operation in parallel on many data items
  - Data parallelism
- MIMD: multiple instruction, multiple data
  - Run separate programs in parallel (on different data)
  - Task parallelism

© Kavita Bala, Computer Science, Cornell University

## Do you believe?



# Course Objective

- Bridge the gap between hardware and software
  - How a processor works
  - How a computer is organized
- Establish a foundation for building higherlevel applications
  - How to understand program performance
  - How to understand where the world is going







• NOT: \_\_\_\_\_\_





• OR:



- · NAND and NOR are universal
  - Can implement any function with NAND or just NOR gates
  - useful for manufacturing

# Logic Manipulation

- Can specify functions by describing gates, truth tables or logic equations
- Can manipulate logic equations algebraically
- Can also use a truth table to prove equivalence
- Example: (a+b)(a+c) = a + bc

$$(a+b)(a+c)$$
  
=  $aa + ab + ac + bc$   
=  $a + a(b+c) + bc$   
=  $a(1 + (b+c)) + bc$   
=  $a + bc$ 

| а | b | С | a+b | a+c | LHS | bc | RHS |
|---|---|---|-----|-----|-----|----|-----|
| 0 | 0 | 0 | 0   | 0   | 0   | 0  | 0   |
| 0 | 0 | 1 | 0   | 1   | 0   | 0  | 0   |
| 0 | 1 | 0 | 1   | 0   | 0   | 0  | 0   |
| 0 | 1 | 1 | 1   | 1   | 1   | 1  | 1   |
| 1 | 0 | 0 | 1   | 1   | 1   | 0  | 1   |
| 1 | 0 | 1 | 1   | 1   | 1   | 0  | 1   |
| 1 | 1 | 0 | 1   | 1   | 1   | 0  | 1   |
| 1 | 1 | 1 | 1   | 1   | 1   | 1  | 1   |



# **Binary Representation**

37 = 32 + 4 + 1

0100101

26 25 24 23 22 21 20

64 32 16 8 4 2 1

© Kavita Bala, Computer Science, Cornell University

# **Hexadecimal Representation**

- 37 decimal =  $(25)_{16}$
- Convention
  - Base 16 is written with a leading 0x
  - -37 = 0x25
- · Need extra digits!
  - $\begin{array}{c} -\ 0,\ 1,\ 2,\ 3,\ 4,\ 5,\ 6,\ 7,\\ 8,\ 9,\ A,\ B,\ C,\ D,\ E,\ F \end{array}$
- Binary to hexadecimal is easy
  - Divide into groups of 4, translate groupwise into hex digits

© Kavita Bala, Computer Science, Cornell University

16<sup>1</sup> 16<sup>0</sup>









# **Ballot Reading**



• Done!

machine

The 3410 voting

© Kavita Bala, Computer Science, Cornell University

# **Stateful Components**

- Until now is combinatorial logic
  - Output is computed when inputs are present
  - System has no internal state
  - Nothing computed in the present can depend on what happened in the past!
- · Need a way to record data
- Need a way to build stateful circuits
- Need a state-holding device



# Summary

- We can now build interesting devices with sensors
  - Using combinatorial logic
- We can also store data values
  - In state-holding elements
  - Coupled with clocks





- Two states: S0 (no carry), S1 (carry in hand)
- · Inputs: a and b
- Output: z
  - Arcs labelled with input bits a and b, and output z

© Kavita Bala, Computer Science, Cornell University

# FSM: Serial Adder

- Add two input bit streams
  - streams are sent with least-significant-bit (lsb) first





# **Binary Arithmetic**

12 + 25

- Arithmetic works the same way regardless of base
  - Add the digits in each position
  - Propagate the carry

001100 + 011010 100110

- Unsigned binary addition is pretty easy
  - Combine two bits at a time
  - Along with a carry

# 1-bit Adder with Carry



| Ci | A <sub>i</sub> | B <sub>i</sub> | C <sub>out</sub> | R <sub>i</sub> |
|----|----------------|----------------|------------------|----------------|
| n  |                |                |                  |                |
| 0  | 0              | 0              | 0                | 0              |
| 0  | 0              | 1              | 0                | 1              |
| 0  | 1              | 0              | 0                | 1              |
| 0  | 1              | 1              | 1                | 0              |
| 1  | 0              | 0              | 0                | 1              |
| 1  | 0              | 1              | 1                | 0              |
| 1  | 1              | 0              | 1                | 0              |
| 1  | 1              | 1              | 1                | 1              |

- Adds two 1-bit numbers, along with carryin, computes 1-bit result and carry out
- Can be cascaded to add N-bit numbers

© Kavita Bala, Computer Science, Cornell University

# 4-bit CLA



- Given A,B's, all p,g's are generated in 1 gate delay in parallel.
- Given all p,g's, all C's are generated in 2 gate delay in parallel.
- Given all C's, all R's are generated in 2 gate delay in parallel.
- •Sequential operation in RCA is made into parallel operation!!

  © Kavita Bala, Computer Science, Cornell University

# Two's Complement

- Nonnegative numbers are represented as usual
  - 0 = 0000
  - -1 = 0001
  - -3 = 0011
  - -7 = 0111
- · To negate a number, flip all bits, add one
  - -1: 1  $\Rightarrow$  0001  $\Rightarrow$  1110  $\Rightarrow$  1111
  - -3: 3  $\Rightarrow$  0011  $\Rightarrow$  1100  $\Rightarrow$  1101
  - -7: 7  $\Rightarrow$  0111  $\Rightarrow$  1000  $\Rightarrow$  1001
  - -8: 8  $\Rightarrow$  1000  $\Rightarrow$  0111  $\Rightarrow$  1000
  - -0: 0  $\Rightarrow$  0000  $\Rightarrow$  1111  $\Rightarrow$  0000 (this is good, -0 = +0)

© Kavita Bala, Computer Science, Cornell University

# Two's Complement Subtraction

- Subtraction is simply addition, where one of the operands has been negated
  - Negation is done by inverting all bits and adding one





# • Static-RAM - So called because once stored, data values are stable as long as electricity is supplied - Based on regular flip-flops with gates Address Chip select Output enable Write enable Din[15-0] Din[15-0]

© Kavita Bala, Computer Science, Cornell University

is straightforward





FIGURE B.9.4 Typical organization of a 4M x 8 SRAM as an array of 4K x 1024 arrays. The first decoder generates the addresses for eight 4K x 1024 arrays; then a set of multiplexors is used to select 1 his from each 1024 his wise array. This is a much easier design than a single level decode that would need either an entor mound decoder or a gigantic multiplexor. In practice, a modern SRAM of this size would probably use an even larger number of blocks, each somewhat smaller.

© Kavita Bala, Computer Science, Cornell University

# Dynamic RAM: DRAM

- Dynamic-RAM
  - Data values require constant refresh
  - Internal circuitry keeps capacitor charges



FIGURE B.9.5 A single-transistor DRAM cell contains a capacitor that stores the cell contents and a transistor used to access the cell.

# Instruction Usage

- Instructions are stored in memory, encoded in binary
- · A basic processor
  - fetches
  - decodes
  - executes

one instruction at a time





# **Arithmetic Instructions**



- if op == 0 && func == 0x21
  - -R[rd] = R[rs] + R[rt] (unsigned)
- if op == 0 && func == 0x23
  - -R[rd] = R[rs] R[rt] (unsigned)
- if op == 0 && func == 0x25
  - -R[rd] = R[rs] | R[rt]











#### MIPS Addressing Modes 3. Operand: Immediate addressing rs rd operand 4. Instruction: PC-relative addressing rs rd offset Memory branch destination instruction Program Counter (PC) 5. Instruction: Pseudo-direct addressing Memory jump address jump destination instruction $\|$ Program Counter (PC) © Kavita Bala, Computer Science, Cornell University

## **Assembly Language Instructions**

- Arithmetic
  - ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU
  - ADDI, ADDIU, ANDI, ORI, XORI, LUI, SLL, SRL, SLLV, SRLV, SRAV, SLTI, SLTIU
  - MULT, DIV, MFLO, MTLO, MFHI, MTHI
- Control Flow
  - BEQ, BNE, BLEZ, BLTZ, BGEZ, BGTZ
  - J, JR, JAL, JALR, BLTZAL, BGEZAL
- Memory
  - LW, LH, LB, LHU, LBU
  - SW, SH, SB
- Special
  - LL, SC, SYSCALL, BREAK, SYNC, COPROC



# **Program Layout**

- Programs consist of segments used for different purposes
  - Text: holds instructions
  - Data: holds statically allocated program data such as variables, strings, etc.

"cornell cs"

13

25

add r1,r2,r3

ori r2, r4, 3
...

## **Assembling Programs**

```
.text
.ent main
main: la $4, Larray
li $5, 15
...
li $4, 0
jal exit
.end main
.data
Larray:
.long 51, 491, 3991
```

- Programs consist of a mix of instructions, pseudo-ops and assembler directives
- Assembler lays out binary values in memory based on directives

© Kavita Bala, Computer Science, Cornell University

#### Forward References

- · Local labels can have forward references
- Two-pass assembly
  - Do a pass through the whole program, allocate instructions and lay out data, thus determining addresses
  - Do a second pass, emitting instructions and data, with the correct label offsets now determined

# Handling Forward References

- Example:
  - bne \$1, \$2, Lsll \$0, \$0, 0L: addiu \$2, \$3, 0x2
- The assembler will change this to
  - bne \$1, \$2, +1sll \$0, \$0, 0addiu \$7, \$8, \$9

© Kavita Bala, Computer Science, Cornell University

## Frame Layout on Stack



```
blue() {
    pink(0,1,2,3,4,5);
}
pink() {
    orange(10,11,12,13,14);
}
```





#### Register Usage

- · Callee-save
  - Save it if you modify it
  - Assumes caller needs it
  - Save the previous contents of the register on procedure entry, restore just before procedure return
  - E.g. \$31 (if you are a non-leaf... what is that?)
- Caller-save
  - Save it if you need it after the call
  - Assume callee can clobber any one of the registers
  - Save contents of the register before proc call
  - Restore after the call

© Kavita Bala, Computer Science, Cornell University

#### **Buffer Overflows**

```
saved regs
arguments
return address
local variables
saved regs
arguments
return address
local variables
```

```
blue() {
    pink(0,1,2,3,4,5);
}
pink() {
    orange(10,11,12,13,14);
}
orange() {
    char buf[100];
    gets(buf); // read string, no check
}
```

# **Pipelining**



- Latency: ?
- Throughput: Batch every 45 minutes

© Kavita Bala, Computer Science, Cornell University

## Throughput is good



• What about latency?















#### Misses

- · Three types of misses
  - Cold
    - The line is being referenced for the first time
  - Capacity
    - The line was evicted because the cache was not large enough
  - Conflict
    - The line was evicted because of another access whose index conflicted

## **Direct Mapped Cache**

- Simplest
- Block can only be in one line in the cache
- How to determine this location?
  - -Use modulo arithmetic
  - -(Block address) modulo (# cache blocks)
  - For power of 2, use log (cache size in blocks)





#### **Eviction**

- Which cache line should be evicted from the cache to make room for a new line?
  - Direct-mapped
    - no choice, must evict line selected by index
  - Associative caches
    - random: select one of the lines at random
    - round-robin: similar to random
    - FIFO: replace oldest line
    - LRU: replace line that has not been used in the longest time

#### **Cache Writes**



- No-Write
  - writes invalidate the cache and go to memory
- Write-Through
  - writes go to cache and to main memory
- Write-Back
  - write cache, write main memory only when block is evicted

© Kavita Bala, Computer Science, Cornell University

# Tags and Offsets

- Tag: matching
- · Offset: within block
- · Valid bit: is the data valid?



#### **Short Performance Discussion**

- Complicated
  - Time from start-to-end (wall-clock time)
  - System time, user time
  - CPI (Cycles per instruction)
- · Ideal CPI?

© Kavita Bala, Computer Science, Cornell University

#### Cache Performance

- Consider hit (H) and miss ratio (M)
- H x AT<sub>cache</sub> + M x AT<sub>memory</sub>
- Hit rate = 1 Miss rate
- · Access Time is given in cycles
- Ratio of Access times, 1:50
- 90% :  $.90 + .1 \times 50 = 5.9$
- 95% : .95 +  $.05 \times 50$  = .95 + 2.5 = 3.45
- 99% :  $.99 + .01 \times 50 = 1.49$
- 99.9%:  $.999 + .001 \times 50 = 0.999 + 0.05 = 1.049$

#### Cache Hit/Miss Rate

- Consider processor that is 2x times faster
   But memory is same speed
- Since AT is access time in terms of cycle time: it doubles 2x
- H x AT<sub>cache</sub> + M x AT<sub>memory</sub>
- Ratio of Access times, 1:100
- $\bullet$  99% : .99 + .01 x 100 = 1.99

© Kavita Bala, Computer Science, Cornell University

#### Cache Hit/Miss Rate

- Original is 1GHz, 1ns is cycle time
- CPI (cycles per instruction): 1.49
- Therefore, 1.49 ns for each instruction
- New is 2GHz, 0.5 ns is cycle time.
- CPI: 1.99, 0.5ns. 0.995 ns for each instruction.
- So it doesn't go to 0.745 ns for each instruction.
- Speedup is 1.5x (not 2x)

# Cache Conscious Programming

int a[NCOL][NROW]; int sum = 0;



Every access is a cache miss!

© Kavita Bala, Computer Science, Cornell University

## Cache Conscious Programming

int a[NCOL][NROW];

int sum = 0;

| 10 | 9 | 8 | 7 | 6 | 5  | 4  | 3  | 2  | 1  |
|----|---|---|---|---|----|----|----|----|----|
|    |   |   |   |   | 15 | 14 | 13 | 12 | 11 |
|    |   |   |   |   |    |    |    |    |    |
|    |   |   |   |   |    |    |    |    |    |
|    |   |   |   |   |    |    |    |    |    |
|    |   |   |   |   |    |    |    |    |    |
|    |   |   |   |   |    |    |    |    |    |
|    |   |   |   |   |    |    |    |    |    |
|    |   |   |   |   |    |    |    |    |    |
|    |   |   |   |   |    |    |    |    |    |
|    |   |   |   |   |    |    |    |    |    |

 Same program, trivial transformation, 3 out of four accesses hit in the cache

#### Can answer the question.....

- A: for i = 0 to 99
  - for j = 0 to 999
    - A[i][j] = complexComputation ()
- B: for j = 0 to 999
  - for i = 0 to 99
    - A[i][j] = complexComputation ()
- Why is B 15 times slower than A?

© Kavita Bala, Computer Science, Cornell University

## **Processor & Memory**

- Currently, the processor's address lines are directly routed via the system bus to the memory banks
  - Simple, fast
- What happens when the program issues a load or store to an invalid location?
  - e.g. 0x000000000 ?
  - uninitialized pointer





- What happens when another program is executed concurrently on another processor?
  - The addresses will conflict
- We could try to relocate the second program to another location
  - Assuming there is one
  - Introduces more problems!



© Kavita Bala, Computer Science, Cornell University

#### **Address Space**

- Memory Management Unit (MMU)
  - Combination of hardware and software



## Virtual Memory Advantages

- · Can relocate program while running
- Virtualization
  - In CPU: if process is not doing anything, switch
  - In memory: when not using it, somebody else can use it



© Kavita Bala, Computer Science, Cornell University

#### How to make it work?

- Challenge: Virtual Memory can be slow!
- At run-time: virtual address must be translated to a physical address
- MMU (combination of hardware and software)



56

# Two Programs Sharing Physical Memory • The starting location of each page (either in main memory or in secondary memory) is contained in the program's page table Program 1 virtual address space Program 2 virtual address space Swap space © Kavita Bala, Computer Science, Cornell University



#### Virtual Addressing with a Cache

 Thus it takes an extra memory access to translate a VA to a PA



 This makes memory (cache) accesses very expensive (if every access was really two accesses)

© Kavita Bala, Computer Science, Cornell University

#### A TLB in the Memory Hierarchy



- A TLB miss:
  - If the page is not in main memory, then it's a true page fault
    - Takes 1,000,000's of cycles to service a page fault
- TLB misses are much more frequent than true page faults





## Hardware/Software Boundary

- Virtual to physical address translation is assisted by hardware?
  - Translation Lookaside Buffer (TLB) that caches the recent translations
    - TLB access time is part of the cache hit time
    - May allot an extra stage in the pipeline for TLB access
  - TLB miss
    - Can be in software (kernel handler) or hardware

© Kavita Bala, Computer Science, Cornell University

#### Hardware/Software Boundary

- Virtual to physical address translation is assisted by hardware?
  - Page table storage, fault detection and updating
    - Page faults result in interrupts (precise) that are then handled by the OS
    - Hardware must support (i.e., update appropriately) Dirty and Reference bits (e.g., ~LRU) in the Page Tables







# **Exceptions**

- System calls are control transfers to the OS, performed under the control of the user program
- Sometimes, need to transfer control to the OS at a time when the user program least expects it
  - Division by zero,
  - Alert from power supply that electricity is going out,
  - Alert from network device that a packet just arrived,
  - Clock notifying the processor that clock just ticked
- Some of these causes for interruption of execution have nothing to do with the user application
- Need a (slightly) different mechanism, that allows resuming the user application

## **Terminology**

- Trap
  - Any kind of a control transfer to the OS
- Syscall
  - Synchronous, program-initiated control transfer from user to the OS to obtain service from the OS
  - e.g. SYSCALL
- Exception
  - Asynchronous, program-initiated control transfer from user to the OS in response to an exceptional event
  - e.g. Divide by zero
- Interrupt
  - Asynchronous, device-initiated control transfer from user to the OS
  - e.g. Clock tick, network packet











# Review: Performance Summary

$$CPU \ Time = \frac{Instructions}{Program} \times \frac{Clock \ cycles}{Instruction} \times \frac{Seconds}{Clock \ cycle}$$

- Performance depends on
  - Algorithm: affects IC, possibly CPI
  - Programming language: affects IC, CPI
  - Compiler: affects IC, CPI
  - Instruction set architecture: affects IC, CPI, T<sub>c</sub>

© Kavita Bala, Computer Science, Cornell University

#### Why Multicore?

- · Moore's law
  - A law about transistors
  - Smaller means faster transistors
- Power consumption growing with transistors







#### Why Multicore?

- · Moore's law
  - A law about transistors
  - Smaller means faster transistors
- Power consumption growing with transistors
- · The power wall
  - We can't reduce voltage further
  - We can't remove more heat
- How else can we improve performance?





#### Amdahl's Law

- Task: serial part, parallel part
- As number of processors increases,
  - time to execute parallel part goes to zero
  - time to execute serial part remains the same
- Serial part eventually dominates
- Must parallelize ALL parts of task

$$\mathsf{Speedup}(E) = \frac{\mathsf{Execution \ Time \ without } E}{\mathsf{Execution \ Time \ with } E}$$

© Kavita Bala, Computer Science, Cornell University

#### Amdahl's Law

- Consider an improvement E
- · F of the execution time is affected
- S is the speedup

Execution time (with E) =  $((1-F)+F/S)\cdot$  Execution time (without E)

Speedup (with 
$$E$$
) =  $\frac{1}{(1-F)+F/S}$ 

## **Multithreaded Processes**



## **Shared counters**

© Kavita Bala, Computer Science, Cornell University

- · Usual result: works fine.
- Possible result: lost update!

hits = 0  
time 
$$\begin{array}{|c|c|c|c|c|}\hline
 & T1 & T2 \\
 & read hits (0) & read hits (0) \\
 & hits = 0 + 1 & hits = 0 + 1
\end{array}$$

- Occasional timing-dependent failure ⇒ Difficult to debug
- Called a race condition

#### Race conditions

- Def: a timing dependent error involving shared state
  - Whether it happens depends on how threads scheduled: who wins "races" to instructions that update state
  - Races are intermittent, may occur rarely
    - Timing dependent = small changes can hide bug
  - A program is correct only if all possible schedules are safe
    - Number of possible schedule permutations is huge
    - Need to imagine an adversary who switches contexts at the worst possible time

© Kavita Bala, Computer Science, Cornell University

#### **Critical Sections**

- Basic way to eliminate races: use critical sections that only one thread can be in
  - Contending threads must wait to enter

#### Mutexes

- Critical sections typically associated with mutual exclusion locks (mutexes)
- Only one thread can hold a given mutex at a time
- Acquire (lock) mutex on entry to critical section
  - Or block if another thread already holds it
- Release (unlock) mutex on exit
  - Allow one waiting thread (if any) to acquire & proceed

(avita Bala, Computer Science, Cornell University

## Protecting an invariant

```
// invariant: data is in buffer[first..last-1]. Protected by m.
pthread_mutex_t *m;
                               char get() {
char buffer[1000];
                                 pthread_mutex_lock(m);
int first = 0, last = 0;
                                 char c = buffer[first];
                                 first++;
                                              X what if first==last?
void put(char c) {
                                 pthread_mutex_unlock(m);
  pthread_mutex_lock(m);
  buffer[last] = c;
  last++;
  pthread_mutex_unlock(m);
}
```

 Rule of thumb: all updates that can affect invariant become critical sections.







# Where to?

• Smart Dust....







## Where to?

- CS 3110: Better concurrent programming
- CS 4410: The Operating System!
- CS 4450: Networking
- · CS 6620: Graphics
- And many more...

© Kavita Bala, Computer Science, Cornell University

Thank you!