CS 4220: Numerical Analysis

BLAS and basic operations

David Bindel

2026-01-21

Matrix algebra and linear algebra

Linear algebra

Focus on:

  • Abstract vector spaces
  • Mappings on/between those spaces
  • Invariant properties

Matrix computations

Add concern with:

  • Matrix shapes
  • Graph of a matrix
  • Efficient representations

… all of which are basis-dependent!

Dense matrix basics

The basics

Build on Basic Linear Algebra Subroutines (BLAS):

  • Level 1: Vector operations (e.g. \(y^T x\))
  • Level 2: Matrix-vector operation (e.g. \(Ax\))
  • Level 3: Matrix-matrix operations (e.g. \(AB\))

Arithmetic costs are \(O(n^1)\), \(O(n^2)\), and \(O(n^3)\), respectively.

Computation and memory

Worry about two costs:

  • Storage used
  • Time used
    • For arithmetic…
    • and for data movement

Data movement time varies a lot, depending on

  • Data layout in memory
  • Access patterns in algorithms

Memory and meaning

Memory (linearly addressed) contains 1D floating point array:

1.0
2.0
3.0
4.0

Interpretation:

\[ x = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \mbox{ or } A = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \mbox{ or ...} \]

Column major and row major

1.0
2.0
3.0
4.0

Column major

\[ A = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \] (\(a_{ij}\) at data[i+j*m])

Fortran, MATLAB, Julia, R, NumPy / SciPy, Eigen and Arbmadillo (C++)

Row major

\[ A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \]

(\(a_{ij}\) at data[j+i*n])

C and C++ (built in), Pascal

Column major in practice

let
    A = [1.0 3.0;
         2.0 4.0]
    A[:]  # View multidimensional array mem as vector
end
4-element Vector{Float64}:
 1.0
 2.0
 3.0
 4.0

So what?

Why do you need to know this?

Consider matvecs

Can be row-oriented (ij) or column-oriented (ji)

\[ \begin{bmatrix} \color{purple}{y_{1}} \\ \color{lightgray}{y_{2}} \\ \color{lightgray}{y_{3}} \end{bmatrix} = \begin{bmatrix} \color{blue}{a_{11}} & \color{blue}{a_{12}} & \color{blue}{a_{13}} \\ \color{lightgray}{a_{21}} & \color{lightgray}{a_{22}} & \color{lightgray}{a_{23}} \\ \color{lightgray}{a_{31}} & \color{lightgray}{a_{32}} & \color{lightgray}{a_{33}} \end{bmatrix} \begin{bmatrix} \color{red}{x_{1}} \\ \color{red}{x_{2}} \\ \color{red}{x_{3}} \end{bmatrix} \] \[ \begin{bmatrix} \color{purple}{y_{1}} \\ \color{purple}{y_{2}} \\ \color{purple}{y_{3}} \end{bmatrix} = \begin{bmatrix} \color{blue}{a_{11}} & \color{lightgray}{a_{12}} & \color{lightgray}{a_{13}} \\ \color{blue}{a_{21}} & \color{lightgray}{a_{22}} & \color{lightgray}{a_{23}} \\ \color{blue}{a_{31}} & \color{lightgray}{a_{32}} & \color{lightgray}{a_{33}} \end{bmatrix} \begin{bmatrix} \color{red}{x_{1}} \\ \color{lightgray}{x_{2}} \\ \color{lightgray}{x_{3}} \end{bmatrix} \]

Which is faster?

function matvec1_row(A, x)
  m, n = size(A)
  y = zeros(m)
  for i = 1:m
    for j = 1:n
      y[i] += A[i,j]*x[j]
    end
  end
  y
end
function matvec1_col(A, x)
  m, n = size(A)
  y = zeros(m)
  for j = 1:n
    for i = 1:m
      y[i] += A[i,j]*x[j]
    end
  end
  y
end
  • Consider \(m = n = 4000\) on my laptop (M1 Macbook Pro)
  • matvec1_row takes about 118 ms
  • matvec1_col takes about 14 ms
  • A*x takes about 3 ms

Why is it faster?

Cache locality takes advantage of locality of reference

  • Temporal locality: Access the same items close in time
  • Spatial locality: Access data in close proximity together
  • Cache hierarchy: small/fast to big/slow
  • Get a line at a time (say 64 bytes) – spatial locality
  • Try to keep recently-used data – temporal locality

Cache on my laptop

Numbers from one core on my laptop:

  • Registers (on chip): use immediately
  • L1 cache (128 KB): 3-4 cycles latency
  • L2 cache (12 MB, shared): 18 cycles latency
  • L3 cache (8 MB, shared): 18 cycle latency
  • M1 uses 128-byte cache lines (64 more common)
  • Main memory: worst case about 350 cycles
  • Note: Can launch several vector ops per cycle!

Explanation

  • At 4000-by-4000 8-byte floats, A takes about 120 MB
    • \(10\times\) the biggest cache!
    • So we are mostly going to main memory
  • matvec1_col accesses A with unit stride
    • This is great for spatial locality!
    • One cache line is 16 doubles
    • So about an order of magnitude fewer cache misses
    • And arithmetic is negligible compared to data transfers

Matrix-matrix products

\[c_{ij} = \sum_k a_{ik} b_{kj}\]

  • Not two, but six possible orderings of \(i, j, k\)
  • These provide some useful perspectives…

Matrix-matrix: inner products

ij(k) or ji(k): Each entry of \(C\) is a dot product of a row of \(A\) and a column of \(B\).

\[ \begin{bmatrix} \color{purple}{c_{11}} & \color{lightgray}{c_{12}} & \color{lightgray}{c_{13}} \\ \color{lightgray}{c_{21}} & \color{lightgray}{c_{22}} & \color{lightgray}{c_{23}} \\ \color{lightgray}{c_{31}} & \color{lightgray}{c_{32}} & \color{lightgray}{c_{33}} \end{bmatrix} = \begin{bmatrix} \color{blue}{a_{11}} & \color{blue}{a_{12}} & \color{blue}{a_{13}} \\ \color{lightgray}{a_{21}} & \color{lightgray}{a_{22}} & \color{lightgray}{a_{23}} \\ \color{lightgray}{a_{31}} & \color{lightgray}{a_{32}} & \color{lightgray}{a_{33}} \end{bmatrix} \begin{bmatrix} \color{red}{b_{11}} & \color{lightgray}{b_{12}} & \color{lightgray}{b_{13}} \\ \color{red}{b_{21}} & \color{lightgray}{b_{22}} & \color{lightgray}{b_{23}} \\ \color{red}{b_{31}} & \color{lightgray}{b_{32}} & \color{lightgray}{b_{33}} \end{bmatrix} \]

Matrix-matrix: outer products

k(ij) or k(ji): \(C\) is a sum of outer products of column \(k\) of \(A\) and row \(k\) of \(B\).

\[ \begin{bmatrix} \color{purple}{c_{11}} & \color{purple}{c_{12}} & \color{purple}{c_{13}} \\ \color{purple}{c_{21}} & \color{purple}{c_{22}} & \color{purple}{c_{23}} \\ \color{purple}{c_{31}} & \color{purple}{c_{32}} & \color{purple}{c_{33}} \end{bmatrix} = \begin{bmatrix} \color{blue}{a_{11}} & \color{lightgray}{a_{12}} & \color{lightgray}{a_{13}} \\ \color{blue}{a_{21}} & \color{lightgray}{a_{22}} & \color{lightgray}{a_{23}} \\ \color{blue}{a_{31}} & \color{lightgray}{a_{32}} & \color{lightgray}{a_{33}} \end{bmatrix} \begin{bmatrix} \color{red}{b_{11}} & \color{red}{b_{12}} & \color{red}{b_{13}} \\ \color{lightgray}{b_{21}} & \color{lightgray}{b_{22}} & \color{lightgray}{b_{23}} \\ \color{lightgray}{b_{31}} & \color{lightgray}{b_{32}} & \color{lightgray}{b_{33}} \end{bmatrix} \]

Matrix-matrix: row matvecs

i(jk) or i(kj): Each row of \(C\) is a row of \(A\) multiplied by \(B\)

\[ \begin{bmatrix} \color{purple}{c_{11}} & \color{purple}{c_{12}} & \color{purple}{c_{13}} \\ \color{lightgray}{c_{21}} & \color{lightgray}{c_{22}} & \color{lightgray}{c_{23}} \\ \color{lightgray}{c_{31}} & \color{lightgray}{c_{32}} & \color{lightgray}{c_{33}} \end{bmatrix} = \begin{bmatrix} \color{blue}{a_{11}} & \color{blue}{a_{12}} & \color{blue}{a_{13}} \\ \color{lightgray}{a_{21}} & \color{lightgray}{a_{22}} & \color{lightgray}{a_{23}} \\ \color{lightgray}{a_{31}} & \color{lightgray}{a_{32}} & \color{lightgray}{a_{33}} \end{bmatrix} \begin{bmatrix} \color{red}{b_{11}} & \color{red}{b_{12}} & \color{red}{b_{13}} \\ \color{red}{b_{21}} & \color{red}{b_{22}} & \color{red}{b_{23}} \\ \color{red}{b_{31}} & \color{red}{b_{32}} & \color{red}{b_{33}} \end{bmatrix} \]

Matrix-matrix: col matvecs

j(ik) or j(ki): Each column of \(C\) is \(A\) times a column of \(B\)

\[ \begin{bmatrix} \color{purple}{c_{11}} & \color{lightgray}{c_{12}} & \color{lightgray}{c_{13}} \\ \color{purple}{c_{21}} & \color{lightgray}{c_{22}} & \color{lightgray}{c_{23}} \\ \color{purple}{c_{31}} & \color{lightgray}{c_{32}} & \color{lightgray}{c_{33}} \end{bmatrix} = \begin{bmatrix} \color{blue}{a_{11}} & \color{blue}{a_{12}} & \color{blue}{a_{13}} \\ \color{blue}{a_{21}} & \color{blue}{a_{22}} & \color{blue}{a_{23}} \\ \color{blue}{a_{31}} & \color{blue}{a_{32}} & \color{blue}{a_{33}} \end{bmatrix} \begin{bmatrix} \color{red}{b_{11}} & \color{lightgray}{b_{12}} & \color{lightgray}{b_{13}} \\ \color{red}{b_{21}} & \color{lightgray}{b_{22}} & \color{lightgray}{b_{23}} \\ \color{red}{b_{31}} & \color{lightgray}{b_{32}} & \color{lightgray}{b_{33}} \end{bmatrix} \]

Matrix-matrix: blocking

  • Can do any traversal over \((i,j,k)\) index space
  • So organize around blocks: \(C_{ij} = \sum_k A_{ij} B_{jk}\)

\[ \begin{bmatrix} \color{purple}{C_{11}} & \color{blue}{C_{12}} \\ \color{red}{C_{21}} & \color{gray}{C_{22}} \end{bmatrix} = \begin{bmatrix} \color{purple}{c_{11}} & \color{purple}{c_{12}} & \color{blue}{c_{13}} & \color{blue}{c_{14}} \\ \color{purple}{c_{21}} & \color{purple}{c_{22}} & \color{blue}{c_{23}} & \color{blue}{c_{24}} \\ \color{red}{c_{31}} & \color{red}{c_{32}} & \color{gray}{c_{33}} & \color{gray}{c_{34}} \\ \color{red}{c_{41}} & \color{red}{c_{42}} & \color{gray}{c_{43}} & \color{gray}{c_{44}} \end{bmatrix} \]

Q: Why is this a useful idea?

Blocking and cache

  • For spatial locality, want unit stride access
  • For temporal locality, want lots of re-use
    • Requires re-use in computation and good access pattern!
    • Measure re-use via arithmetic intensity: flops/memory ops
  • Idea: use small blocks that fit in cache
    • Simple model: \(b\)-by-\(b\) blocks, explicit cache control
    • Memory transfers to multiply blocks: \(O(b^2)\)
    • Arithmetic costs to multiply blocks: \(O(b^3)\)
  • Block recursively for better use at multiple cache levels

Blocking and linear algebra

For \(\mathcal{A} \in L(\mathcal{V}_1 \oplus \mathcal{V}_2, \mathcal{W}_1 \oplus \mathcal{W}_2)\): \[ w = \mathcal{A} v \equiv \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} = \begin{bmatrix} \mathcal{A}_{11} & \mathcal{A}_{12} \\ \mathcal{A}_{21} & \mathcal{A}_{22} \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \end{bmatrix} \] where \[ v_i = \Pi_{\mathcal{V}_i} v, \quad w_j = \Pi_{\mathcal{W}_j} w, \quad \mathcal{A}_{ij} = \Pi_{\mathcal{W}_i} \mathcal{A} |_{\mathcal{V}_j} \] Given bases \(V_1\), \(V_2\), and \(V = \begin{bmatrix} V_1 & V_2 \end{bmatrix}\) for \(\mathcal{V}_1\), \(\mathcal{V}_2\) and \(\mathcal{V}= \mathcal{V}_1 \oplus \mathcal{V}_2\) (and similarly for \(\mathcal{W}\)), get matrix representation with same block structure as above.

The difference

Performance the lazy way

How LAPACK does it:

  • Organize around block matrices (subspace sums)
  • Do as much as possible with level 3 BLAS calls
    • Matrix-matrix multiply, rank \(k\) update, etc
  • Use someone else’s well-tuned BLAS library