Accelerators
2024-10-08
Different fundamental goals
Graphics (and ML) are high throughput!
Five-stage RISC basically has
Modern machines have pipeline and
Mostly to find parallelism and reduce latency.
Simplify, simplify, simplify!
Simplify: Fetch/decode
Simplify: Branches
Simplify: Branches
Simplify: Latency tolerance
No caches, but still:
Therefore:
(From PMPP book)
Much is not so dissimilar to multicore CPU!
CPU
GPU
GPUs are great for
… but you still want a CPU, too!
GPU nodes have:

Each SM has
CPU code calls GPU kernels
“Hello world”: vector addition in CUDA
Serial version of add:
CPU code calls GPU kernels
__global__ for host-callable kernel on GPU__device__ or __host__First, allocate memory on GPU:
int size = n * sizeof(float);
float *d_x, *d_y;
cudaMalloc((void**)& d_x, size);
cudaMalloc((void**)& d_y, size);cudaMalloc(void** p, size_t size);p refers to a device (global) memory locationvector, for ex)cudaFree(void* p) to freeShould probably be more careful:
// cudaMemcpy(void* dest, void* src, size_t size, int direction);
cudaMemcpy(d_x, x.data(), size, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y.data(), size, cudaMemcpyHostToDevice);data for C pointer to vector storagey at endx back (only y updated)void add(const std::vector<float>& x, std::vector<float>& y)
{
int n = x.size();
// Allocate GPU buffers and transfer data in
int size = n * sizeof(float);
float *d_x, *d_y;
cudaMalloc((void**)& d_x, size); cudaMalloc((void**)& d_y, size);
cudaMemcpy(d_x, x.data(), size, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y.data(), size, cudaMemcpyHostToDevice);
// Call kernel on the GPU (1 block, 1 thread)
gpu_add<<<1,1>>>(n, d_x, d_y);
// Copy data back and free GPU memory
cudaMemcpy(y.data(), d_y, size, cudaMemcpyDeviceToHost);
cudaFree(d_x); cudaFree(d_y);
}malloc/free/memcpy thing is not very C++!__global__
void gpu_add(int n, float* x, float* y)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n)
y[i] += x[i]
}
// Call looks like
gpu_add<<<n_blocks,block_size>>>(n, x, y);blockDim.x threadsn total threads