# CS 5220: Applications of Parallel Computers ## Instruction-level parallelism ## 01 Sep 2015
## Example 1: Laundry - Three stages to laundry: wash, dry, fold - Three loads: darks, lights, underwear - How long will this take?

## How long will it take?

Three loads of laundry to wash, dry, fold. One hour per stage. What is the total time? A: 9 hours =: You spend too much time on laundry A: 5 hours =: That's what I had in mind! A: 3 hours =: Maybe at a laundromat; what if only one washer/drier?
## Setup - Three *functional units* - Washer - Drier - Folding table - Different cases - One load at a time - Three loads with one washer/drier - Three loads with friends at the laundromat

## Serial execution (9 hours)

 Wash Dry Fold Wash Dry Fold Wash Dry Fold

## Pipelined execution (5 hours)

 Wash Dry Fold Wash Dry Fold Wash Dry Fold

## Parallel units (3 hours)

 Wash Dry Fold Wash Dry Fold Wash Dry Fold
## Example 2: Arithmetic $$2 \times 2 + 3 \times 3$$ > A child of five would understand this. Send someone to fetch a child > of five. > -- [Groucho Marx](http://www.goodreads.com/quotes/98966-a-child-of-five-could-understand-this-send-someone-to)

## How long will it take?

Suppose all children can do one add or multiply per second. How long would it take to compute $2 \times 2 + 3 \times 3$? A: 3 seconds =: OK, three ops at one op/s; what if there are multiple kids? A: 2 seconds =: OK, if two kids do the multiplies in parallel A: 1 second =: Not without finding faster kids!

## One child

 $2 \times 2 = 4$ $3 \times 3 = 9$ $4 + 9 = 13$

Total time is 3 seconds

## Two children

 $2 \times 2 = 4$ $4 + 9 = 13$ $3 \times 3 = 9$

Total time is 2 seconds

## Many children

 $2 \times 2 = 4$ $4 + 9 = 13$ $3 \times 3 = 9$

Total time remains 2 seconds = sum of latencies for two stages with a data dependency between them.

## Pipelining - Improves *bandwidth*, but not *latency* - Potential speedup = number of stages - What if there's a branch? - Different pipelines for different functional units - Front-end has a pipeline - Functional units (FP adder, multiplier) pipelined - Divider often not pipelined
## SIMD - Single Instruction Multiple Data - Old idea with resurgence in 90s (for graphics) - Now short vectors are ubiquitous - 256 bit wide AVX on CPU - 512 bit wide on Xeon Phi! - Alignment matters
## Example: [My laptop](http://www.everymac.com/systems/apple/macbook_pro/specs/macbook-pro-core-i5-2.6-13-late-2013-retina-display-specs.html) MacBook Pro (Retina, 13 in, Late 2013) - [Intel Core i5-4228U (Haswell arch)](http://ark.intel.com/products/75991/Intel-Core-i5-4288U-Processor-3M-Cache-up-to-3_10-GHz) - Two cores / four HW threads - Variable clock: 2.6 GHz / 3.1 GHz TurboBoost - Four wide front end (fetch+decode 4 ops/cycle/core) - Operations internally broken down into "micro-ops" - Cache micro-ops -- like hardware JIT?!
## My laptop: floating point - Two fully-pipelined multiply or FMA per cycle - FMA = Fused Multiply Add: one op, one rounding error - [256 bit SIMD (AVX)](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) - [Two fully pipelined FP units](http://www.realworldtech.com/haswell-cpu/4/) - Two multiply or Fused Multiply-Add (FMA) per cycle - Only one regular add per cycle
## Peak flop rate - Result (double precision) $\approx 100$ GFlop/s - 2 flops/FMA - $\times 4$ FMA/vector FMA = 8 flops/vector FMA - $\times 2$ vector FMAs/cycle = 16 flops/cycle - $\times 2$ cores = 32 flops/cycle - $\times 3.1 \times 10^9$ cycles/s $\approx 100$ GFlop/s - Single precision $\approx 200$ GFlop/s
## Reaching peak flop - Need lots of *independent* vector work - FMA latency = 5 cycles on Haswell - Need $8 \times 5 = 40$ *independent* FMA to reach peak - Great for matrix multiply -- hard in general - Still haven't [talked about memory!](/slides/2015-09-01-memory.html)
## Punchline - Special features: SIMD, FMA - Compiler understands how to use these *in principle* - Rearranges instructions to get good mix - Tries to use FMAs, vector instructions - *In practice*, the compiler needs your help - Set optimization flags, pragmas, etc - Rearrange code to make obvious and predictable - Use special intrinsics or library routines - Choose data layouts + algorithms to suit machine - Goal: You handle high-level, compiler handles low-level