# CS 5220: Applications of Parallel Computers
## Instructionlevel parallelism
## 01 Sep 2015
## Example 1: Laundry
 Three stages to laundry: wash, dry, fold
 Three loads: darks, lights, underwear
 How long will this take?
How long will it take?
Three loads of laundry to wash, dry, fold.
One hour per stage. What is the total time?
A: 9 hours
=: You spend too much time on laundry
A: 5 hours
=: That's what I had in mind!
A: 3 hours
=: Maybe at a laundromat; what if only one washer/drier?
## Setup
 Three *functional units*
 Washer
 Drier
 Folding table
 Different cases
 One load at a time
 Three loads with one washer/drier
 Three loads with friends at the laundromat
Serial execution (9 hours)
Wash 
Dry 
Fold 









Wash 
Dry 
Fold 









Wash 
Dry 
Fold 
Pipelined execution (5 hours)
Wash 
Dry 
Fold 



Wash 
Dry 
Fold 



Wash 
Dry 
Fold 
Parallel units (3 hours)
Wash 
Dry 
Fold 
Wash 
Dry 
Fold 
Wash 
Dry 
Fold 
## Example 2: Arithmetic
$$2 \times 2 + 3 \times 3$$
> A child of five would understand this. Send someone to fetch a child
> of five.
>  [Groucho Marx](http://www.goodreads.com/quotes/98966achildoffivecouldunderstandthissendsomeoneto)
How long will it take?
Suppose all children can do one add or multiply per second.
How long would it take to compute $2 \times 2 + 3 \times 3$?
A: 3 seconds
=: OK, three ops at one op/s; what if there are multiple kids?
A: 2 seconds
=: OK, if two kids do the multiplies in parallel
A: 1 second
=: Not without finding faster kids!
One child
$2 \times 2 = 4$ 
$3 \times 3 = 9$ 
$4 + 9 = 13$ 
Total time is 3 seconds
Two children
$2 \times 2 = 4$ 
$4 + 9 = 13$ 
$3 \times 3 = 9$ 

Total time is 2 seconds
Many children
$2 \times 2 = 4$ 
$4 + 9 = 13$ 
$3 \times 3 = 9$ 

Total time remains 2 seconds = sum of latencies for
two stages with a data dependency between them.
## Pipelining
 Improves *bandwidth*, but not *latency*
 Potential speedup = number of stages
 What if there's a branch?
 Different pipelines for different functional units
 Frontend has a pipeline
 Functional units (FP adder, multiplier) pipelined
 Divider often not pipelined
## SIMD
 Single Instruction Multiple Data
 Old idea with resurgence in 90s (for graphics)
 Now short vectors are ubiquitous
 256 bit wide AVX on CPU
 512 bit wide on Xeon Phi!
 Alignment matters
## Example: [My laptop](http://www.everymac.com/systems/apple/macbook_pro/specs/macbookprocorei52.613late2013retinadisplayspecs.html)
MacBook Pro (Retina, 13 in, Late 2013)
 [Intel Core i54228U (Haswell arch)](http://ark.intel.com/products/75991/IntelCorei54288UProcessor3MCacheupto3_10GHz)
 Two cores / four HW threads
 Variable clock: 2.6 GHz / 3.1 GHz TurboBoost
 Four wide front end (fetch+decode 4 ops/cycle/core)
 Operations internally broken down into "microops"
 Cache microops  like hardware JIT?!
## My laptop: floating point
 Two fullypipelined multiply or FMA per cycle
 FMA = Fused Multiply Add: one op, one rounding error
 [256 bit SIMD (AVX)](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
 [Two fully pipelined FP units](http://www.realworldtech.com/haswellcpu/4/)
 Two multiply or Fused MultiplyAdd (FMA) per cycle
 Only one regular add per cycle
## Peak flop rate
 Result (double precision) $\approx 100$ GFlop/s
 2 flops/FMA
 $\times 4$ FMA/vector FMA = 8 flops/vector FMA
 $\times 2$ vector FMAs/cycle = 16 flops/cycle
 $\times 2$ cores = 32 flops/cycle
 $\times 3.1 \times 10^9$ cycles/s $\approx 100$ GFlop/s
 Single precision $\approx 200$ GFlop/s
## Reaching peak flop
 Need lots of *independent* vector work
 FMA latency = 5 cycles on Haswell
 Need $8 \times 5 = 40$ *independent* FMA to reach peak
 Great for matrix multiply  hard in general
 Still haven't [talked about memory!](/slides/20150901memory.html)
## Punchline
 Special features: SIMD, FMA
 Compiler understands how to use these *in principle*
 Rearranges instructions to get good mix
 Tries to use FMAs, vector instructions
 *In practice*, the compiler needs your help
 Set optimization flags, pragmas, etc
 Rearrange code to make obvious and predictable
 Use special intrinsics or library routines
 Choose data layouts + algorithms to suit machine
 Goal: You handle highlevel, compiler handles lowlevel