### Lec 28: Conclusions

### Kavita Bala CS 3410, Fall 2008

Computer Science Cornell University

### Goals

- · Concurrency poses challenges for:
- Correctness
  - Threads accessing shared memory should not interfere with each other
- · Liveness
  - Threads should not get stuck, should make forward progress
- · Efficiency
  - Program should make good use of available computing resources (e.g., processors).
- Fairness
  - Resources apportioned fairly between threads

© Kavita Bala, Computer Science, Cornell University

### **Announcements**

- · Pizza party was fun
  - Winner: Andrew Cameron and Ross Anderson
- Final project out tomorrow afternoon
  - Demos: Dec 16 (Tuesday)
- Prelim 2: Dec 4 Thursday
  - Hollister 110, 7:30-10:00

© Kavita Bala, Computer Science, Cornell University

### Race conditions

- · Def: timing-dependent error involving access to shared state
  - Whether it happens depends on how threads scheduled: who wins "races" to instruction that updates state vs. instruction that accesses state
  - Races are intermittent, may occur rarely
    - Timing dependent = small changes can hide bug
  - A program is correct only if all possible schedules are safe
    - Number of possible schedule permutations is huge
    - Need to imagine an adversary who switches contexts at the worst possible time

© Kavita Bala, Computer Science, Cornell University

### Prelim 2 Topics

- · Cumulative, but newer stuff:
  - Physical and virtual memory, page tables, TLBs
  - Caches, cache-conscious programming, caching
  - Privilege levels, syscalls, traps, interrupts, exceptions
  - Busses, programmed I/O, memory-mapped I/O
  - DMA, disks
  - Synchronization
  - Multicore processors

© Kavita Bala, Computer Science, Cornell University

### Mutexes

- Critical sections typically associated with mutual exclusion locks (*mutexes*)
- Only one thread can hold a given mutex at a
- Acquire (lock) mutex on entry to critical section
  - Or block if another thread already holds it
- · Release (unlock) mutex on exit
  - Allow one waiting thread (if any) to acquire & proceed pthread\_mutex\_init(m);

hits = hits+1; hits = hits+1;
pthread\_mutex\_unlock(m); pthread\_mutex\_unlock(m);









## Beyond mutexes Sometimes need to share resources in non-exclusive way Example: shared queue (multiple readers, multiple writers) How to let a reader wait for data without blocking a mutex? | Char get() { | Char c = buffer[first]; | | first+++; | | }



### A first broken cut 1 2 3 // invariant: data is in buffer[first..last-1]. mutex t \*m: charaphedin()gfet() { char buffer[n]; chan glatilithme=false; int first = 0, last = 0; while midigration to just ); lock(thing); = buffer[first]; chartersbuttestifit 3%); void put(char c) { firstel (614(sn):b) & fier[first]; lock(m): unlock(ns); = (first+1)%n; buffer[last] = c; done = true; last = (last+1)%n; Ooksops white the if acrepty unlock(m): unlock(m); Same issues here for full queue Oops! Reader still spins on empty que © Kavita Bala, Computer Science, Cornell University

### **Monitors**

- A monitor is a shared concurrency-safe data structure
- · Has one mutex
- Has some number of condition variables
- Operations acquire mutex so only one thread can be in the monitor at a time
- Our buffer implementation is a monitor
- Some languages (e.g. Java, C#) provide explicit support for monitors

© Kavita Bala, Computer Science, Cornell University

### Condition variables

- To let thread wait (not holding the mutex!) until a condition is true, use a condition variable [Hoare]
- wait(m, c): atomically release m and go to sleep waiting for condition c, wake up holding m
  - Must be atomic to avoid wake-up-waiting race
- signal(c): wake up one thread waiting on c
- broadcast(c): wake up all threads waiting on c
- POSIX (e.g., Linux): pthread\_cond\_wait, pthread\_cond\_signal, pthread\_cond\_broadcast

© Kavita Bala, Computer Science, Cornell University

### Java concurrency

- · Java object is a simple monitor
  - Acts as a mutex via synchronized { S } statement and synchronized methods
  - Has one (!) builtin condition variable tied to the mutex
    - o.wait() = wait(o, o)
    - o.notify() = signal(o)
    - o.notifyAll() = broadcast(o)
    - synchronized(o) {S} = lock(o); S; unlock(o)
  - Java wait() can be called even when mutex is not held. Mutex not held when awoken by signal(). Useful?

© Kavita Bala, Computer Science, Cornell University

### Using a condition variable

- wait(m, c): release m, sleep waiting for c, wake up holding m
  - signal(c): wake up one thread waiting on c

```
mutex_t *m;
                                     cond_t *not_empty, *not_full;
                                     char get() {
char put(char c) {
                                       lock(m);
 lock(m):
                                       while (first == last)
  while ((first-last)%n == 1)
                                         wait(m, not_empty);
    wait(m, not_full);
                                       char c = buffer[first];
 buffer[last] = c:
                                       first = (first+1)%n;
 last = (last+1)%n;
                                       unlock(m);
 unlock(m):
                                       signal(not_full);
 signal(not_empty);
```

© Kavita Bala, Computer Science, Cornell University

### More synchronization mechanisms

Implementable with mutexes and condition variables:

- · Reader/writer locks
  - Any number of threads can hold a read lock
  - Only one thread can hold the writer lock
- Semaphores
  - Some number n of threads are allowed to hold the lock
  - n=1 => semaphore = mutex
- Message-passing, sockets
  - send()/recv() transfer data and synchronize

# Where are we going? © Kavita Bala, Computer Science, Cornell University











### Moore's Law

- 1965
  - number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time).
- · Amazingly visionary
  - 2300 transistors, 1 MHz clock (Intel 4004) 1971
  - 16 Million transistors (Ultra Sparc III)
  - 42 Million transistors, 2 GHz clock (Intel Xeon) 2001
  - 55 Million transistors, 3 GHz, 130nm technology, 250mm² die (Intel Pentium 4) – 2004
  - 290+ Million transistors, 3 GHz (Intel Core 2 Duo) 2007











### General computing with GPUs

- Can we use these machines for general computation?
- Scientific Computing
  - MATLAB codes
- Convex hulls
- Molecular Dynamics
- · Etc.
- CUDA: using it as a general purpose multicore processor

© Kavita Bala, Computer Science, Cornell University

### Classification of Parallelism

Flynn's taxonomy

|                        |          | Data Streams               |                                                |
|------------------------|----------|----------------------------|------------------------------------------------|
|                        |          | Single                     | Multiple                                       |
| Instruction<br>Streams | Single   | SISD:<br>Intel Pentium 4   | SIMD: SSE<br>instructions of x86<br>Early GPUs |
|                        | Multiple | MISD:<br>No examples today | MIMD:<br>Intel Xeon e5345<br>Cell<br>Tesla     |

© Kavita Bala, Computer Science, Cornell University

### AMDs Hybrid CPU/GPU



© Kavita Bala, Computer Science, Cornell University

### Parallelism

- Must exploit parallelism for performance
   Lot of parallelism in graphics applications
- SIMD: single instruction, multiple data
  - Perform same operation in parallel on many data items
  - Data parallelism
- MIMD: multiple instruction, multiple data
  - Run separate programs in parallel (on different data)
  - Task parallelism

© Kavita Bala, Computer Science, Cornell University

### Cell

- IBM/Sony/Toshiba
- Sony Playstation 3
- PPE
- SPEs (synergestic)



### Do you believe?



### Course Objective

- · Bridge the gap between hardware and software
  - How a processor works
  - How a computer is organized
- · Establish a foundation for building higherlevel applications
  - How to understand program performance
  - How to understand where the world is going

© Kavita Bala, Computer Science, Cornell University









Logic Manipulation Can specify functions by describing gates, truth

· Can manipulate logic equations algebraically

tables or logic equations





### **Binary Representation**

• 37 = 32 + 4 + 1

### 0100101

2<sup>6</sup> 2<sup>5</sup> 2<sup>4</sup> 2<sup>3</sup> 2<sup>2</sup> 2<sup>1</sup> 2<sup>0</sup>

64 32 16 8 4 2 1

16<sup>1</sup> 16<sup>0</sup>

© Kavita Bala, Computer Science, Cornell University



### **Hexadecimal Representation**

- 37 decimal = (25)<sub>16</sub>
- Convention
  - Base 16 is written with a leading 0x
  - -37 = 0x25
- · Need extra digits!
  - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F
- Binary to hexadecimal is easy
  - Divide into groups of 4, translate groupwise into hex digits









### Summary

- We can now build interesting devices with sensors
  - Using combinatorial logic
- · We can also store data values
  - In state-holding elements
  - Coupled with clocks

Kavita Bala, Computer Science, Cornell University

### Stateful Components

- · Until now is combinatorial logic
  - Output is computed when inputs are present
  - System has no internal state
  - Nothing computed in the present can depend on what happened in the past!
- · Need a way to record data
- · Need a way to build stateful circuits
- · Need a state-holding device

© Kavita Bala, Computer Science, Cornell University

### FSM: State Diagram

- Two states: S0 (no carry), S1 (carry in hand)
- · Inputs: a and b
- Output: z
  - Arcs labelled with input bits a and b, and output z









### **Binary Arithmetic** 12 · Arithmetic works the same way regardless of base + 25 - Add the digits in each position 37 - Propagate the carry · Unsigned binary addition is 001100 pretty easy + 011010 - Combine two bits at a time 100110 - Along with a carry © Kavita Bala, Computer Science, Cornell University



















### **Arithmetic Instructions** ор rt rd shamt func rs 5 bits 5 bits 5 bits 6 bits 5 bits 6 bits • if op == 0 && func == 0x21 -R[rd] = R[rs] + R[rt] (unsigned) • if op == 0 && func == 0x23-R[rd] = R[rs] - R[rt] (unsigned) • if op == 0 && func == 0x25-R[rd] = R[rs] | R[rt]© Kavita Bala, Computer Science, Cornell University













### **Program Layout** · Programs consist of segments used for different "cornell cs" purposes 13 - Text: holds instructions data - Data: holds statically allocated program data such as add r1.r2.r3 variables, strings, etc. ori r2, r4, 3 © Kavita Bala, Computer Science, Cornell University

### Assembly Language Instructions Arithmetic - ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU - ADDI, ADDIU, ANDI, ORI, XORI, LUI, SLL, SRL, SLLV, SRLV, SRAV, SLTI, SLTIU – MULT, DIV, MFLO, MTLO, MFHI, MTHI · Control Flow - BEQ, BNE, BLEZ, BLTZ, BGEZ, BGTZ - J, JR, JAL, JALR, BLTZAL, BGEZAL Memory - LW, LH, LB, LHU, LBU - SW, SH, SB - LL, SC, SYSCALL, BREAK, SYNC, COPROC © Kavita Bala, Computer Science, Cornell University





### **Assembling Programs** .text Programs consist of a mix of .ent main instructions, pseudo-ops main: la \$4, Larray and assembler directives li \$5, 15 li \$4, 0 Assembler lays out binary jal exit .end main values in memory based on .data directives Larray: .long 51, 491, 3991 © Kavita Bala, Computer Science, Cornell University

### Forward References

- · Local labels can have forward references
- Two-pass assembly
  - Do a pass through the whole program, allocate instructions and lay out data, thus determining addresses
  - Do a second pass, emitting instructions and data, with the correct label offsets now determined

### Handling Forward References

- Example:
  - bne \$1, \$2, L sll \$0, \$0, 0 L: addiu \$2, \$3, 0x2
- · The assembler will change this to
  - bne \$1, \$2, +1 sll \$0, \$0, 0 addiu \$7, \$8, \$9

© Kavita Bala, Computer Science, Cornell University



### Frame Layout on Stack blue() { saved regs pink(0,1,2,3,4,5); } pink() { local variables orange(10,11,12,13,14); saved regs return address local variables

### Register Usage · Callee-save - Save it if you modify it - Assumes caller needs it Save the previous contents of the register on procedure entry, restore just before procedure return - E.g. \$31 (if you are a non-leaf... what is that?) · Caller-save - Save it if you need it after the call - Assume callee can clobber any one of the registers - Save contents of the register before proc call - Restore after the call























### **Direct Mapped Cache**

- Simplest
- Block can only be in one line in the cache
- · How to determine this location?
  - -Use modulo arithmetic
  - -(Block address) modulo (# cache blocks)
  - For power of 2, use log (cache size in blocks)









### **Eviction**

- · Which cache line should be evicted from the cache to make room for a new line?
  - Direct-mapped
    - no choice, must evict line selected by index
  - Associative caches
    - random: select one of the lines at random
    - round-robin: similar to random
    - FIFO: replace oldest line
    - LRU: replace line that has not been used in the longest time

© Kavita Bala, Computer Science, Cornell University

### **Short Performance Discussion**

- Complicated
  - Time from start-to-end (wall-clock time)
  - System time, user time
  - CPI (Cycles per instruction)
- Ideal CPI?

© Kavita Bala, Computer Science, Cornell University



### Cache Performance

- · Consider hit (H) and miss ratio (M)
- H x AT<sub>cache</sub> + M x AT<sub>memory</sub>
   Hit rate = 1 Miss rate
- · Access Time is given in cycles
- · Ratio of Access times, 1:50

• 90% :  $.90 + .1 \times 50 = 5.9$ 

• 95% : .95 + .05 x 50 = .95+2.5=3.45

• 99% :  $.99 + .01 \times 50 = 1.49$ 

• 99.9%:  $.999 + .001 \times 50 = 0.999 + 0.05 = 1.049$ 

### Cache Hit/Miss Rate

- Consider processor that is 2x times faster
  - But memory is same speed
- Since AT is access time in terms of cycle time: it doubles 2x
- H x AT<sub>cache</sub> + M x AT<sub>memory</sub>
- Ratio of Access times, 1:100
- 99% :  $.99 + .01 \times 100 = 1.99$

© Kavita Bala, Computer Science, Cornell University

### Cache Conscious Programming

int a[NCOL][NROW]; int sum = 0;

 $\begin{aligned} & \text{for}(i = 0; i < \text{NROW}; ++i) \\ & \text{for}(j = 0; j < \text{NCOL}; ++j) \\ & \text{sum} += a[j][i]; \end{aligned}$ 



 Same program, trivial transformation, 3 out of four accesses hit in the cache

© Kavita Bala, Computer Science, Cornell University

### Cache Hit/Miss Rate

- · Original is 1GHz, 1ns is cycle time
- CPI (cycles per instruction): 1.49
- Therefore, 1.49 ns for each instruction
- New is 2GHz, 0.5 ns is cycle time.
- CPI: 1.99, 0.5ns. 0.995 ns for each instruction.
- So it doesn't go to 0.745 ns for each instruction.
- Speedup is 1.5x (not 2x)

© Kavita Bala, Computer Science, Cornell University

### Can answer the question.....

- A: for i = 0 to 99
  - for j = 0 to 999
    - A[i][j] = complexComputation ()
- B: for j = 0 to 999
  - for i = 0 to 99
    - A[i][j] = complexComputation ()
- Why is B 15 times slower than A?

© Kavita Bala, Computer Science, Cornell University

### Cache Conscious Programming

int a[NCOL][NROW]; int sum = 0;

$$\begin{split} \text{for}(j = 0; j < \text{NCOL}; ++j) \\ \text{for}(i = 0; i < \text{NROW}; ++i) \\ \text{sum} += a[j][i]; \end{split}$$



· Every access is a cache miss!

© Kavita Bala, Computer Science, Cornell University

### Processor & Memory

- Currently, the processor's address lines are directly routed via the system bus to the memory banks
  - Simple, fast
- What happens when the program issues a load or store to an invalid location?
  - e.g. 0x000000000 ?
  - uninitialized pointer



### **Physical Addressing Problems** · What happens when another program is Stack executed concurrently on another processor? Stack - The addresses will conflict Неар · We could try to relocate Data the second program to Text another location - Assuming there is one Неар - Introduces more problems! Processors Data Text Memory © Kavita Bala, Computer Science, Cornell University











### Virtual Addressing with a Cache

· Thus it takes an extra memory access to translate a VA to a PA



· This makes memory (cache) accesses very expensive (if every access was really two accesses)

C Kavita Bala, Computer Science, Cornell University

### Address Translation • Translation is done through the page table - A virtual memory miss (i.e., when the page is not in physical memory) is called a page fault Physical Translation box (MMU) address virtual Store Physical memory CPU fault © Kavita Bala, Computer Science, Cornell University

### A TLB in the Memory Hierarchy



- · A TLB miss:
  - If the page is not in main memory, then it's a true page fault
    - Takes 1,000,000's of cycles to service a page fault
- TLB misses are much more frequent than true page faults

### Hardware/Software Boundary

- · Virtual to physical address translation is assisted by hardware?
  - Translation Lookaside Buffer (TLB) that caches the recent translations
    - TLB access time is part of the cache hit time
    - May allot an extra stage in the pipeline for TLB access
  - - Can be in software (kernel handler) or hardware

### Virtual vs. Physical Caches



- L1 (on-chip) caches are typically virtual
- · L2 (off-chip) caches are typically physical

### Hardware/Software Boundary

- · Virtual to physical address translation is assisted by hardware?
  - Page table storage, fault detection and updating
    - Page faults result in interrupts (precise) that are then handled by the OS
    - Hardware must support (i.e., update appropriately) Dirty and Reference bits (e.g., ~LRU) in the Page Tables



### **Exceptions**

- System calls are control transfers to the OS, performed under the control of the user program
- Sometimes, need to transfer control to the OS at a time when the user program least expects it
  - Division by zero,
  - Alert from power supply that electricity is going out,
- Alert from network device that a packet just arrived,
- Clock notifying the processor that clock just ticked
- Some of these causes for interruption of execution have nothing to do with the user application
- Need a (slightly) different mechanism, that allows resuming the user application

© Kavita Bala, Computer Science, Cornell University



### Terminology

- Trap
  - Any kind of a control transfer to the OS
- Syscall
  - Synchronous, program-initiated control transfer from user to the OS to obtain service from the OS
  - e.g. SYSCALL
- Exception
  - Asynchronous, program-initiated control transfer from user to the OS in response to an exceptional event
  - e.g. Divide by zero
- Interrupt
  - Asynchronous, device-initiated control transfer from user to the OS
  - e.g. Clock tick, network packet





# DMA: Direct Memory Access Non-DMA transfer: I/O device ←→ CPU ←→ RAM - for (i = 1 .. n) • CPU sends transfer request to device • I/O writes data to bus, CPU reads into registers • CPU writes data to registers to memory DISK DISK DISK RAM - CPU sets up DMA request on device - for (i = 1 .. n) • I/O device writes data to bus, RAM reads data Based on lecture from Kevin Walsh © Kavita Bala, Computer Science, Cornell University











### 

### Why Multicore?

- · Moore's law
  - A law about transistors
  - Smaller means faster transistors
- · Power consumption growing with transistors
- · The power wall
  - We can't reduce voltage further
  - We can't remove more heat
- How else can we improve performance?









### Amdahl's Law

- · Task: serial part, parallel part
- · As number of processors increases.
  - time to execute parallel part goes to zero
  - time to execute serial part remains the same
- · Serial part eventually dominates
- · Must parallelize ALL parts of task

 $\mathsf{Speedup}(E) = \frac{\mathsf{Execution Time\ without\ } E}{\mathsf{Execution\ Time\ with\ } E}$ 

© Kavita Bala, Computer Science, Cornell University

### Shared counters

- · Usual result: works fine.
- · Possible result: lost update!



- Occasional timing-dependent failure ⇒ Difficult to debug
- Called a race condition

© Kavita Bala, Computer Science, Cornell University

### Amdahl's Law

- · Consider an improvement E
- · F of the execution time is affected
- S is the speedup

Execution time (with  $E)=\left((1-F)+F/S\right)\cdot$  Execution time (without E) .

Speedup (with E) =  $\frac{1}{(1-F)+F/S}$ 

© Kavita Bala, Computer Science, Cornell University

Multithreaded Processes

### Race conditions

- Def: a timing dependent error involving shared state
  - Whether it happens depends on how threads scheduled: who wins "races" to instructions that update state
  - Races are intermittent, may occur rarely
    - Timing dependent = small changes can hide bug
  - A program is correct only if all possible schedules are safe
    - Number of possible schedule permutations is huge
    - Need to imagine an adversary who switches contexts at the worst possible time

© Kavita Bala, Computer Science, Cornell University

### code data files registers stack registers registers stack stack stack stack thread \$ thread \$ thread \$ thread \$

© Kavita Bala, Computer Science, Cornell University

single-threaded process

### **Critical Sections**

- Basic way to eliminate races: use critical sections that only one thread can be in
  - Contending threads must wait to enter



### Mutexes

- Critical sections typically associated with mutual exclusion locks (mutexes)
- Only one thread can hold a given mutex at a time
- Acquire (lock) mutex on entry to critical section
   Or block if another thread already holds it
- · Release (unlock) mutex on exit
  - Allow one waiting thread (if any) to acquire & proceed





### Protecting an invariant

```
// invariant: data is in buffer[first..last-1]. Protected by m.
pthread_mutex_t *m;
char buffer[1000];
int first = 0, last = 0;

void put(char c) {
    pthread_mutex_lock(m);
    buffer[last] = c;
    last++;
    pthread_mutex_unlock(m);
}

char get() {
    pthread_mutex_lock(m);
    char c = buffer[first];
    first++;
    x what if first==last?
    pthread_mutex_unlock(m);
}
```

• Rule of thumb: all updates that can affect invariant become critical sections.







### Where to?

• CS 3110: Better concurrent programming

• CS 4410: The Operating System!

CS 4450: Networking

CS 6620: Graphics

• And many more...

© Kavita Bala, Computer Science, Cornell University

Thank you!