CS 5220

Applications of Parallel Computers

Single-Core Architecture

Prof David Bindel

Please click the play button below.

Just for fun

Is this fair?

See: “Should I port my code to a GPU?”

The real world

The idealized machine

  • Address space of named words

  • Basic ops: register read/write, logic, arithmetic

  • Everything runs in the program order

  • High-level language \(\rightarrow\) “obvious” machine code

  • All operations take about the same amount of time

The real world

Memory operations are not all the same!

  • Speeds vary (registers and caches)

  • Memory layout dramatically affects performance

The real world

Instructions are non-obvious!

  • Pipelining allows instructions to overlap

  • Functional units run in parallel (and out of order)

  • Instructions take different amounts of time

  • Cost depends on order, instruction mix

The real world

Our goal:
enough understanding to help the compiler out.



We hold these truths to be self-evident:

  1. One should not sacrifice correctness for speed

  2. One should not re-invent (or re-tune) the wheel

  3. Your time matters more than computer time


Less obvious, but still true:

  1. Most of the time goes to a few bottlenecks

  2. The bottlenecks are hard to find without measuring

  3. Communication is expensive (often a bottleneck)


A little good hygiene will save your sanity

  1. Automate testing
  2. Time carefully
  3. Use version control

A sketch of reality

Today, a play in two acts:

  1. One core is not so serial

  2. Memory matters

Act 1

One core is not so serial.

Parallel processing at the laundromat

  • Three stages to laundry: wash, dry, fold.

  • Three loads: darks, lights, underwear

  • How long will this take?

Parallel processing at the laundromat

Serial version:

1 2 3 4 5 6 7 8 9
wash dry fold
wash dry fold
wash dry fold

Parallel processing at the laundromat

Pipeline version:

1 2 3 4 5
wash dry fold Dinner?
wash dry fold Cat videos?
wash dry fold Gym and tanning?


  • Pipelining improves bandwidth, but not latency

  • Potential speedup = number of stages

    • But what if there’s a branch?


Different pipelines for different functional units

  • Front-end has a pipeline

  • Functional units have own pipelines

    • Example: FP adder, FP multiplier
    • Divider often not pipelined


  • Single Instruction Multiple Data

  • Cray-1 (1976): 8 registers \(\times\) 64 words of 64 bits each

  • Resurgence in mid-late 90s (for graphics)

  • Now short vectors (256-512 bit) are ubiquitous

Wide front-end

Fetch/decode or retire multiple ops at once

  • Limited by instruction mix
    (Different ops use different ports)

  • NB: May dynamically translate to micro-ops


Support multiple HW threads / core

  • Independent registers, program counter
  • Shared functional units
  • Helps feed core independent work

Out-of-order execution

  • Internally reorder operations

  • Have to commit instructions in order

  • May throw away uncommited results
    (speculative execution)

  • Limited by data dependencies

All together, now...

  • Front-end reads several ops at once
  • Ops may act on vectors (SIMD)
  • Break into mystery micro-ops (and cache)
  • Out-of-order scheduling to functional units
  • Pipelining within functional units
  • In-order commit of finished ops
  • Can discard before commit
    (speculative execution)


Compiler understands CPU in principle

  • Rearranges instructions to get a good mix

  • Tries to make use of FMAs, SIMD instructions, etc


Needs help in practice:

  • Set optimization flags, pragmas, etc

  • Make code obvious and predictable

  • Expose local independent work

  • Use special intrinsics or library routines

  • Data layouts, algorithms to suit machine


The goal:

  • You handle high-level optimization
  • Compiler handles low-level stuff

Act 2

Memory matters.

Basic problem

  • Memory latency = how long to get a requested item

  • Memory bandwidth = steady-state rate

  • Bandwidth improving faster than latency

  • Inverse bandwidth remains worse than flop rate

My machine

  • Theoretical peak flop rate: 51.2 GFlop/s (w/o turbo)

  • Peak memory bandwidth: 31.79 GB/s (2 banks)

  • Arithmetic intensity = flops / memory accesses

  • Example: Sum several million doubles (AI = 1)?

    • About 4 GFlop/s peak if BW limited

    • Much worse if latency limited

  • So what can we do?


Programs usually have locality

  • Spatial locality: things close to each other tend to be accessed consecutively

  • Temporal locality: use a “working set” of data repeatedly

Cache hierarchy built to use locality.

How caches help

  • Hide memory costs by reusing data

    • Exploit temporal locality
  • Use bandwidth to

    • Fetch by cache line (spatial locality)
    • Support multiple reads
    • Prefetch data

This is mostly automatic and implicit.

Cache basics

  • Store cache lines of several bytes

  • Cache hit when copy of needed data in cache

  • Cache miss otherwise. Three basic types:

    • Compulsory miss: never used this data before

    • Capacity miss: filled the cache with other things since this was last used – working set too big

    • Conflict miss: insufficient associativity for access pattern

Cache associativity

Where can data go in cache?

  • Direct-mapped: each address can only go in one cache location (e.g. store address xxxx1101 only at cache location 1101)

  • \(n\)-way: each address can go into one of \(n\) possible cache locations (store up to 16 words with addresses xxxx1101 at cache location 1101).

Higher associativity is more expensive.


We have \(N = 10^6\) two-dimensional coordinates, and want their centroid. Which of these is faster and why?

  1. Store an array of \((x_i, y_i)\) coordinates. Loop \(i\) and simultaneously sum the \(x_i\) and the \(y_i\).

  2. Store an array of \((x_i, y_i)\) coordinates. Loop \(i\) and sum the \(x_i\), then sum the \(y_i\) in a separate loop.

  3. Store the \(x_i\) in one array, the \(y_i\) in a second array. Sum the \(x_i\), then sum the \(y_i\).

Caches on my laptop

  • 32 KB L1 data and memory caches (per core),
    8-way associative

  • 256 KB L2 cache (per core),
    4-way associative

  • 2 MB L3 cache (per core),
    12-way associative

A memory benchmark (membench)

for array A of length L from 4 KB to 8MB by 2x
  for stride s from 4 bytes to L/2 by 2x
    time the following loop
    for i = 0 to L by s
      load A[i] from memory

membench on my laptop

Line graph of membench timing data
Raw timings (CSV)

membench on my laptop

Heat map of membench timing data
Raw timings (CSV)

membench on my laptop

  • Vertical: 64B line size (\(2^5\)), 4K page size (\(2^{12}\))

  • Horizontal: 32K L1 (\(2^{15}\)), 256K L2 (\(2^{18}\)), 2 MB L3 ((\(2^{21}\))

  • Diagonal: 8-way cache associativity, 512 entry L2 TLB

The moral

Even for simple programs, performance is a complicated function of architecture!

  • Need to know a little to write fast programs

  • Want simple models to understand efficiency

  • Want tricks to help design fast codes

    • Example: blocking (also called tiling)