Fundamental ideas

There are some points here that seem to be fundamental:

CPUs evolved to be good at managing a few complex instruction streams. They predict and speculate, pipeline, reorder, and do all sorts of tricks to squeeze parallelism out of those streams. GPUs evolved to use lots of much simpler cores for problems with simple enough structure (and enough easy-to-see parallelism) so that the usual CPU trickery was superfluous. Graphics has this flavor, and so do some kernel operations in scientific computing (and in machine learning, which has a lot in common with scientific computing). Hence, GPUs can be quite effective for these workloads, but they are highly sensitive to details that we might be able to ignore on a CPU – issues with memory alignment, branching, and lack of available parallelism can all have a huge imoact on GPU performance.
The basic picture of performance tuning for a GPU is the same as for a CPU: understand where the code has bottlenecks (and why), arrange the data structures and access patterns to be performance friendly, and expose opportunities for the compiler to parallelize to the greatest extent possible.

A lot of the rest is window dressing.

Thinking about choosing a system

Accelerator programming is complicated, and there are no small number of attempts to address that complexity. Before spending time talking about the different programming models, it is worth spending a moment talking about the factors that might go into choosing between different models. For me, the big factors are effectiveness, community, lack of lock-in, longevity, and maturity.

Effectiveness: This is deceptive, as the difficulty of writing “hello world” may be very different from the difficulty of writing something that gets good performance. Nonetheless, we would ideally like codes that hide accidental complexity away from us, rather than rubbing our noses in it.
Community: Face it, you’re going to get stuck. Having a lab-mate who knows what is what helps immensely. Alternately, it’s nice to be able to ask on StackOverflow and get a response.
Lack of lock-in: If I have to write my code around a given framework, that makes me nervous. I like narrow interfaces. It means that if I want to change to something completely different later, I have that much less to rewrite.
Longevity: Some scientific codes live a very short life (essentially the length of time it takes a PhD student to get the results she needs and graduate). Others live a very long life. If a code is going to live forever, it makes sense to be conservative about the libraries and programming languages on which it depends. But a quick-and-dirty code? Perhaps there it doesn’t matter as much. Of course, if we are going to write codes to actually do science that others can replicate, it still makes sense to choose tools that will stick around a while! Assuming I would like my code to last, I would also like it to build on systems I believe will last. A library or language that has been around for a while is more likely to be around for a while longer. In other words, Fortran will never die, nor will C++; but I have no such confidence in Rust! Company support can also help indicate longevity, but not always.
Maturity: Bluntly put, I make enough mistakes of my own without finding the mistakes that still have not been debugged in some support library on which I depend. The more the code has been tested in the field by others, the more confidence I have that I can use it safely.

Mixed paradigm programming

Mixing MPI, OpenMP, and CUDA (for instance) is not conceptually difficult, and may be quite sensible. There are just a few things to remember:

The different systems should not be assumed to interact neatly. For example, while CUDA calls are thread-safe, MPI calls need not be. And even for CUDA calls, there is context (e.g. the current CUDA device) that needs to be thought through carefully.
Things are likely to be simplest if different types of parallelism are compiled under different modules (program files). This makes it easy to determine which compiler front-end should be invoked.
Support libraries are needed for MPI, and for OpenMP, and for CUDA. These are automatically included when linking with mpicc or nvcc, or when including -fopenmp as a flag to gcc or clang (or -qopenmp for icc). But they need to be brought in explicitly when multiple versions are linked together. Names vary from platform to platform; tools like autoconf or CMake can help with the necessary configuration.

The accelerator landscape

This is almost surely an incomplete survey, but it seems reasonable at the time I am writing it.

CUDA and OpenCL

CUDA: NVidia only – the most popular kid on the block
OpenCL: Developed shortly after CUDA (by Apple), widely supported

In the beginning (or at least in 2006 or 2007), there was CUDA. Not long after that, others started to get on board. Because CUDA was aggressively proprietary, Apple developed OpenCL, which it then submitted to the Khronos Group as an open standard for device programming. The first public release of the spec was in Q4 of 2008. Perhaps because it has been around for a long time (and it is open), there are probably more devices programmable by OpenCL than by any other single mechanism. However, OpenCL also seems universally regarded as a bit clunky – it takes a lot of typing to get to “Hello World” in OpenCL.

Making CUDA and OpenCL play nice

CU2CL: Translate CUDA to OpenCL
Ocelot: Allows recompilation of CUDA to NVidia/AMD GPUs and to CPUs. Academic project, no longer maintained.

Because programming in CUDA is more popular than programming in OpenCL, despite the larger number of devices that support the latter, there have been a few efforts to translate CUDA to OpenCL. The CU2CL project seems to be ongoing, while Ocelot is no longer maintained now that the students involved have moved on.

AMD gets in the game

HIP: CUDA-like, open, plays with NVidia (NVCC) or AMD (HCC)
HCC: AMD’s answer to nvcc, supports multiple APIs
HSA: Supports a virtual instruction set other things compile to

NVidia is certainly the most popular vendor for GPGPU, but AMD also makes high-end graphics cards. While AMD has been programmable via OpenCL for some time, they currently support several other programming models as well via the HCC compiler (a C++ compiler with support for several different parallel programming extensions). AMD also supports HIP, which essentially looks like CUDA after doing a search-and-replace through the code to change all instances of cuda to hip. HIP can compile down either to CUDA or to the HSA Intermediate Language (HSAIL), which is supposed to be a common intermediate language for accelerator code.

Directive-based parallelism

OpenACC: OpenMP-like directive-based programming
OpenMP: Version 4.x supports GPU offload; see Jeff Larkin talk from GPU Tech 2016

OpenMP has been quite successful as a shared memory programming paradigm for multicore, though getting good performance from it is not so simple as it might at first appear. It’s natural enough to want to use the same approach for accelerators, and this is now done in OpenMP (as of version 4.x) and in OpenACC. Neither of these is as widely or as well supported as CUDA and OpenMP.

C++ libraries for “performance portable” GPU algorithms

Kokkos: Part of Sandia’s Trilinos, meant to provide performance-portable programming of array-based computations
RAJA: An effort out of LLNL, C++ templates, also targets performance portability
Thrust: Like STL for CUDA programming
Boost.Compute: Like STL for OpenCL programming
VexCL: C++ vector expression template library for OpenCL and CUDA

One of the difficulties with programming at the level of CUDA and OpenMP is that decisions affecting performance often occur at a very low level. Wouldn’t it be nice to have a {\em portable} performance layer, so that codes will not need to be significantly rewritten for performance every year or two? This is the goal behind several projects, including Kokkos and RAJA from the national labs and Thrust from NVidia. Boost.Compute seems close in spirit to Thrust, but targeting OpenCL rather than CUDA. VexCL operates at a somewhat lower level.

Python GPU programming

Numba: Bringing CUDA to Python via a JIT
Pyculib: CUDA libraries from Python
PyCUDA: A bit lower level
PyOpenCL: Sister to PyCUDA

If we are going to try to raise the level of abstraction for GPU programming, why not think bigger than raising it up to the C++ level? The Numba project provides JIT compilation for Python functions, and it can target kernels. The older (and somewhat lower level) PyCUDA and PyOpenCL systems actually invoke an ordinary compiler behind the scenes to generate CUDA or OpenCL kernels, rather than using LLVM infrastructure to handle it.

Academic efforts with new languages

Occa: JIT-compiled kernels for multiple back-ends (including CPUs, GPUs, Phis, FPGAs) – David Medina (PhD student who worked on it) and Tim Warburton at Virginia Tech
Simit: Complete special-purpose language out of MIT

Of course, C++ and Python are ordinary languages, though both get extended in providing support for CUDA and OpenCL. What if we were willing to use a different language altogether? The Occa project introduces a very C-like language for writing kernels (not unlike what is done in CUDA/OpenCL); these kernels are then translated to multiple possible backends. The Simit project introduces a whole different language that can be compiled down to either CPU code or GPU code; in either case, it tends to get rather high performance, partly because it is pretty good at expressing opportunities for parallelism.

The Microsoft entry

C++AMP: Microsoft’s C++ Acclerated Massive Parallelism; unclear whether this is still going anywhere at MS, but looks like it is being picked up other places? Another set of language extensions for data parallelism.

The C++AMP system was Microsoft’s entry into the battle for the hearts and minds of accelerator programmers. Like CUDA and OpenCL, C++AMP is essentially an extension of C++ to express parallelism for the purpose of GPU programing. Microsoft appear to have mostly moved on from it, but others have taken it up, sort of. In particular, AMD’s HCC system supports C++AMP as one of the possible parallelization methods.

The compilers

GCC

GCC supports HSAIL, OpenMP 4.5, and OpenACC. OpenCL is really library-based, and so can of course be handled by GCC.

PGI

The PGI compilers support OpenMP 4.5 and OpenACC, as well as things like CUDA Fortran.

NVidia

The NVCC compiler relies on an underlying host C/C++ compiler to do the heavy lifting of compiling ordinary code.

Intel

The Intel compilers support recent OpenMP and OpenACC versions.

LLVM-based compilers

The CLang compiler supports OpenMP (since 3.8.0), but does not completely support OpenMP 4.0 or 4.5. The support is not part of the Apple XCode version of CLang. As far as I can tell, there is also as yet no support for OpenACC.

At the same time, LLVM has direct support at the back-end for producing PTX code, and there are some hints that someone has written code to produce HSAIL code for AMD devices.

Optimization and analysis tools

Intel’s compilers

The Intel compiler can generate reports that tell how successful it was at vectorizing your code. For example, compiling with the flags

-qopt-report=5 -qopt-report-phase=vec

will give a detailed report (*.optrpt) that says what is and is not successfully vectorized. Note that the report may talk about several different types of dependencies; there is a reasonable Wikipedia article covering the terminology that is used.

Note that there are several references on Intel’s web pages that describe how to annotate functions for vectorization and how to vectorize step-by-step, as well as a more thorough guide to vectorization. Once you address memory bottlenecks, you will probably see a significant speedup for this assignment by following the vectorization guidelines.

VTune Amplifier

We can’t run the VTune GUI from our head node, but we can use the command line interface amplxe-cl. This is available whenever you have the cs5220 module loaded. The reference guide for the command-line tool is hard to find by a Google search, but it is up on the Intel web site. You can also get an overview of the options by running

amplxe-cl -help

Before using data with Amplifier, you will want to compile your code with debugging symbols (-g flag). From there, profiling proceeds in two phases: data collection and reporting.

Data collection

The collection phase will run on the compute nodes in the cluster. To run an advanced hotspot diagnostic, for instance, execute (from a PBS script)

amplxe-cl -collect advanced-hotspots ./shallow

It is possible to collect a wide variety of data, including not only hotspots but also counter information, memory bandwidth utilization, etc. The data will be deposited in a subdirectory of the directory where the code was run; for example, the first time the above collection is run, it will create a directory r001ah.

Reporting

Once the data has been collected, reports can be generated on the front-end node. For example, to get an overview of the hotspots, try

amplxe-cl -report hotspots -r r001ah/

To see the time spent in compute_step, try something like

amplxe-cl -report hotspots -source-object function="compute_forces"

Notes for 2017-10-05