CS 5220

Applications of Parallel Computers

Load balancing

Prof David Bindel

Please click the play button below.

Inefficiencies in parallel code

Poor single processor performance

Typically in the memory system
Saw this in matrix multiply assignment

Inefficiencies in parallel code

Overhead for parallelism

Thread creation, synchronization, communication
Saw this in shallow water assignment

Inefficiencies in parallel code

Load imbalance

Different amounts of work across processors
Different speeds / available resources
Insufficient parallel work
All this can change over phases

Where does the time go?

Load balance looks like large sync cost
... maybe so does ordinary sync overhead!
And spin-locks may make sync look like useful work
And ordinary time sharing can confuse things more
Can get some help from profiling tools

Many independent tasks

Simplest strategy: partition by task index
- What if task costs are inhomogeneous?
- Worse: all expensive tasks on one thread?
Potential fixes
- Many small tasks, randomly assigned
- Dynamic task assignment
Issue: what about scheduling overhead?

Scheduling is easiest when we have independent tasks. So let’s consider this case first. A natural approach is to partition statically, say by index, giving each processor one pth of the original tasks. But… if task costs are inhomogeneous, we could end up with a bad situation where we lump all the expensive tasks on one thread.

If there are lots of small tasks, we could randomly assign them in the hopes that things will balance out. If you remember your law of large numbers, though, you’ll realize that for a fixed distribution of task costs, the load imbalance relative to the total time cost decays like the square root of the number of tasks. So for this to make sense, either you want fairly homogeneous task costs or you want a large number of tasks.

We could also dynamically assign tasks in a way that evens out the load across processors. But if we’re going to do that type of dynamic task assignment, what about the overhead of scheduling?

Variations on a theme

How to avoid overhead? Chunks!
(Think OpenMP loops)

Small chunks: good balance, large overhead
Large chunks: poor balance, low overhead

Variations on a theme

Fixed chunk size (requires good cost estimates)
Guided self-scheduling (take \(\lceil (\mbox{tasks left})/p \rceil\) work)
Tapering (size chunks based on variance)
Weighted factoring (GSS with heterogeneity)

So how do we resolve the tension between large chunks good for overhead and small chunks good for load balancing?

The simplest approach is to statically partition the work into fixed size chunks, but this requires good cost estimates a priori. The other extreme, sometimes called self-scheduling, involves querying a work queue for every new task; this is good for load balance under uncertain costs, but the overhead is pretty high.

An alternate approach that has some of the advantages of self-scheduling without quite so much scheduling overhead is called guided self scheduling (GSS). In GSS, the scheduler decreases the chunk size over time: each time a processor requests work, it gets one pth of the remaining work (or one work item, whichever is larger).

Tapering is based on GSS, but it takes the mean and standard deviation of the work costs into account. If the standard deviation is zero, tapering is just GSS. Otherwise, tapering takes somewhat smaller chunks than GSS according to soem complicated formula.

Weighted factoring is a little like GSS, but it can take into account heterogeneity in processor speeds as well.

These are not the only scheduling protocols out there! This stuff has been studied since the 1980s, and there are still new ideas proposed every year. But our goal is not to consider only independent tasks, so let’s move on.

Static dependency

Graph \(G = (V,E)\) with vertex and edge weights
Goal: even partition, small cut (comm volume)
Optimal partitioning is NP complete – use heuristics
Tradeoff quality vs speed
Good software exists (e.g. METIS)

The limits of graph partitioning

What if

We don’t know task costs?
We don’t know the comm/dependency pattern?
These things change over time?

May want dynamic load balancing?

Even in regular case: not every problem looks like an undirected graph!

Dependency graphs

So far: Graphs for dependencies between unknowns.

For dependency between tasks or computations:

Arrow from \(A\) to \(B\) means that \(B\) depends on \(A\)
Result is a directed acyclic graph (DAG)

Longest Common Substring

Goal: Longest sequence of (not necessarily contiguous) characters common to strings \(S\) and \(T\).

Recursive formulation: \[\begin{aligned} & \mathrm{LCS}[i,j] = \\ & \begin{cases} \max(\mathrm{LCS}[i-1,j], \mathrm{LCS}[j,i-1]), & S[i] \neq T[j] \\ 1 + \mathrm{LCS}[i-1,j-1], & S[i] = T[j] \end{cases} \end{aligned}\] Dynamic programming: Form a table of \(\mathrm{LCS}[i,j]\)

The longest common substring problem is one of those classic CS problems that comes up in algorithms classes where people talk about dynamic programming (though it also comes up in some other situations, like in genomics studies). The goal is just to find the longest subsequence of characters that two strings S and T have in common.

We can write down the length of the longest common subsequence via a recursion. Let LCS[i,j] represent the longest common subsequence of characters 1 through i in string S and 1 through j in string T. When i or j is zero, the LCS is zero. That’s our base case. Otherwise, we could have the last characters in the substrings of S and T be the same, or they could be different. If they’re the same, the LCS is going to be one longer than the LCS where we leave the ith character of S and the jth of T. Otherwise, we take the max of the LCS where we either leave the ith character off S or the jth off T.

It’s OK to stare at this if you need a moment.

The usual dynamic programming approach to solving this problem involves computing a table of LCS[i,j] for every i and j in the range from zero to the string lengths.

Dependency graphs

Process in any order consistent with dependencies.
Limits to available parallel work early on or late!

Here’s a plot of the dependencies in the longest substring recurrence. Each entry depends on the entries below it, to the left of it, and on the diagonal to the left and below. We can process the entries in this graph in any order consistent with the dependencies, and there are various ways to do this.

The coloring denotes one order, a sweep starting at the lower left corner (position 1,1) and moving to the top right corner. If we tilt our heads so that the diagonals of constant color run side-to-side, we might notice that the pattern of arrows is very similar to the one that we’ve seen in save problems, and this suggests that we can think about what’s happening in similar ways (e.g. we can think of the analog of “batching steps”, doing some redundant computation in order to reduce communication between neighboring processors.

But one of the things that is different between this problem and our wave problems is that the diagonals are not all the same size! So there isn’t much work (or parallelism) available in the early phases, nor at the very end.

Dependency graphs

Partition into coarser-grain tasks for locality?

Dependency graphs

Dependence between coarse tasks limits parallelism.

Alternate perspective

Two approaches to LCS:

Solve subproblems from bottom up
Solve top down, memoize common subproblems

Parallel question: shared memoization (and synchronize) or independent memoization (and redundant computation)?

Load balancing and task-based parallelism

Task DAG captures data dependencies
May be known at outset or dynamically generated
Topological sort reveals parallelism opportunities

Going back to the general picture: a task graph is a directed acyclic graph that captures the data dependencies between different tasks in our computation. The task graph might be known from the start, as in our dynamic programming example, or it might be something that we generate on the fly. Either way, any DAG admits a “topological sort” of the nodes: that is, we can always come up with a linear ordering of the tasks so that all dependencies between tasks are satisfied. In fact, there are many such orderings – if there weren’t, we would have no room for parallelism! A variant of one of the earliest topological sort algorithms (Kahn’s algorithm – no relation to the Star Trek villan!) decomposes a task graph into layers as shown here, where all the tasks in each layer are independent, and can be computed when the tasks in each previous layer are done. Hence, topological sort – or at least this layered variant of topological sort – shows us the opportunities for parallelism that exist in the computation.

Basic parameters

Task costs
- Do all tasks have equal costs?
- Known statically, at creation, at completion?
Task dependencies
- Can tasks be run in any order?
- If not, when are dependencies known?
Locality
- Tasks co-located to reduce communication?
- When is this information known?

We now have a handle on the main parameters we need to consider when thinking about dynamic load balancing and parallelism. Somehow, we have to start with decomposing our problem into tasks, and look for parallelism between those tasks. In the easiest case, the tasks have equal costs, known statically; but there are certainly problems where we don’t know how much it will cost to execute a task until we start working on it (or even until we finish working on it!). We also may have dependencies between tasks that keep us from executing them in arbitrary order and with arbitrary parallelism; the fewer the dependencies and the earlier we know them, the easier the task of scheduling for parallelism. Finally, we always look for ways to keep locality of reference, and in a task-based problem decomposition, that often means co-locating the execution of tasks that depend on each other as much as possible.

Task costs

Easy: equal unit cost tasks (branch-free loops)

Harder: different, known times (sparse MVM)

Hardest: costs unknown until completed (search)

Breaking it down a bit more: in terms of task costs, the easiest case is lots of tasks that cost the same amount.

An example of a harder case might be partitioning the row-times-vector products in a sparse matrix-vector product. The cost of handling each row is proportional to the number of nonzeros in that row; this varies from row to row, but we know the counts in advance. If we want to partition the matrix into sets of rows so that each processor is doing the same amount of work in a matrix-vector product, we have to take this heterogeneity into account. But it’s something that we know at the start, and it doesn’t change over time.

The hardest case, which is common in search, is when we don’t know how much time it will take to complete a given task until the task is actually done!

Dependencies

Easy: dependency-free loop (Jacobi sweep)

Harder: tasks have predictable structure (some DAG)

Hardest: structure is dynamic (search, sparse LU)

Locality/communication

When do you communicate?

Easy: Only at start/end (embarrassingly parallel)
Harder: In a predictable pattern (PDE solver)
Hardest: Unpredictable (discrete event simulation)

A spectrum of solutions

Depending on cost, dependency, locality:

Static scheduling
Semi-static scheduling
Dynamic scheduling

Static scheduling

Everything known in advance
Can schedule offline (e.g. graph partitioning)
Example: Shallow water solver

When we know the whole shape of the computation in advance, we can plan things out in advance and just execute our plan, with no need for communication to update the plans as we go. Often this involves some form of graph partitioning. An example of a completely static schedule is what we did with the shallow water solver (or at least our version of the solver). We know exactly what depends on what, and probably decide in advance how many steps we should take in a bach.

Of course, you might recall that the stable step size depends on the water height, so there is some room to do something dynamic here, advancing some parts of the domain with longer time steps and other parts with shorter steps. Load balancing gets trickier if we want to do something like that, since we would get a bad load imbalance if the regions with long time steps were the same size as the regions requiring short time steps.

Semi-static scheduling

Everything known at start of step (for example)
Use offline ideas (e.g. Kernighan-Lin refinement)
Example: Particle-based methods

In other problems, dependencies or task costs might change over time, but slowly enough that we can create a static plan that is useful over several steps of the algorithm. We might then re-compute the plan from scratch, or we might do something to refine the plan in the face of changes, like applying a few sweeps of Kernighan-Lin to update a partition after changing some edges around. An example where this comes in handy is in particle simulations where particles interact with all other particles in some local neighborhood. As the particles move around, who they interact with slowly changes. But these changes are slow enough relative to the time step size that we can re-use the same (slightly conservative) interaction graph for scheduling interaction computations over several consecutive time steps.

Dynamic scheduling

Don’t know what we’re doing until we’ve started
Have to use online algorithms
Example: most search problems

Search problems

Different set of strategies from physics sims!
Usually require dynamic load balance
Example:
- Optimal VLSI layout
- Robot motion planning
- Game playing
- Speech processing
- Reconstructing phylogeny
- ...

Example: Tree search

Tree unfolds dynamically during search
Common problems on different paths (graph)?
Graph may or may not be explicit in advance

Search algorithms

Generic search:

Put root in stack/queue
while stack/queue has work
- remove node \(n\) from queue
- if \(n\) satisfies goal, return
- mark \(n\) as searched
- queue viable unsearched children
  (Can branch-and-bound)

DFS (stack), BFS (queue), A\(^*\) (priority queue), ...

Simple parallel search

Static load balancing:

Each new task on a proc until all have a subtree
Ineffective without work estimates for subtrees!
How can we do better?

Centralized scheduling

Idea: obvious parallelization of standard search

Locks on shared data structure (stack, queue, etc)
Or might be a manager task

Centralized scheduling

Teaser: What could go wrong with this parallel BFS?

Queue root and fork
- obtain queue lock
- while queue has work
  - remove node \(n\) from queue
  - release queue lock
  - process \(n\), mark as searched
  - obtain queue lock
  - enqueue unsearched children
- release queue lock
join

Centralized scheduling

Put root in queue; workers active = 0; fork
- obtain queue lock
- while queue has work or workers active > 0
  - remove node \(n\) from queue; workers active ++
  - release queue lock
  - process \(n\), mark as searched
  - obtain queue lock
  - enqueue unsearched children; workers active –
- release queue lock
join

Centralized task queue

Called self-scheduling when applied to loops
- Tasks might be range of loop indices
- Assume independent iterations
- Loop body has unpredictable time (or do it statically)
Pro: dynamic, online scheduling
Con: centralized, so doesn’t scale
Con: high overhead if tasks are small

Beyond centralized task queue

Basic distributed task queue idea:

Each processor works on part of a tree
When done, get work from a peer
Or if busy, push work to a peer
Asynch communication useful

Also goes by work stealing, work crews...

Picking a donor

Could use:

Asynchronous round-robin
Global round-robin (current donor ptr at P0)
Randomized – optimal with high probability!

Let’s consider the work-stealing variant. My queue is empty, and I want to take work from a donor. How do I decide who to go to?

One approach would be for me to ask each of the processors in turn. The problem with this is that if I use the same ordering as everyone else, we’ll all end up swamping the first few nodes with requests for work (and probably running them completely dry so that they then have to beg). This is not great for spreading around the wealth.

A second approach, much more equitable in how it steals work, would involve a global round-robin ordering. But to do a global ordering of work stealing, we would need to keep a synchronized pointer to the next in the list, maybe at processor 0. That pointer is now a source of contention and communication overhead.

It turns out that a somewhat stupid-sounding strategy is nearly as good as global round robin, but involves no communication. This strategy is to choose a donor at random.

Diffusion-based balancing

Problem with random polling: communication cost!
- But not all connections are equal
- Idea: prefer to poll more local neighbors
Average out load with neighbors \(\implies\) diffusion!

Mixed parallelism

Today: mostly coarse-grain task parallelism
Other times: fine-grain data parallelism
Why not do both? Switched parallelism.

Takeaway

Lots of ideas, not one size fits all!
Axes: task size, task dependence, communication
Dynamic tree search is a particularly hard case!
Fundamental tradeoffs
- Overdecompose (load balance) vs
  keep tasks big (overhead, locality)
- Steal work globally (balance) vs
  steal from neighbors (comm. overhead)
Sometimes hard to know when code should stop!

Wrapping up for today: there is no uber-algorithm for finding parallelism and balancing load across processors. While it’s often useful to think about these problems in terms of tasks and their interdependencies, the exact nature of the task costs and dependencies (and when we find out about those costs and dependencies) has a huge impact on what’s most appropriate. The right solution in any given situation represents a particular answer to how we should make some fundamental tradeoffs. For example, should we overdecompose a project, getting lots of small tasks that are easier to spread across processors in a balanced way? Or should we keep the tasks big in order to amortize scheduling overhead and potentially improve locality of reference? And if we’re in distributed memory environments, should we do work stealing with donors chosen uniformly at random, which is better for evening out load quickly in very imbalanced settings? Or should we use a diffusion scheme that favors nearby donors, which may not spread out load as quickly but will tend to do less expensive communication when we stay around an equilibrium?

As a final theme that has come up a few times: for as cool as some of the more dynamic load balancing mechanisms are, they often make it really hard to decide when a program should be done.

Fortunately, I have no such difficulty with deciding when to end this slide deck! Until next time…