Can write a lot of MPI code with 6 operations we’ve seen:
MPI_InitMPI_FinalizeMPI_Comm_sizeMPI_Comm_rankMPI_SendMPI_Recv... but there are sometimes better ways. Decide on communication style using simple performance models.
MPI_Send(buf, count, datatype,
dest, tag, comm);
MPI_Recv(buf, count, datatype,
source, tag, comm, status);
MPI_Send and MPI_Recv are blocking
Block until data in system
— maybe in a buffer?
Alternative: don’t copy, block until done.
Both processors wait to finish send before they can receive! May not happen if lots of buffering on both sides.
Could alternate who sends and who receives.
Common operations deserve explicit support!
MPI_Sendrecv(sendbuf, sendcount, sendtype,
dest, sendtag,
recvbuf, recvcount, recvtype,
source, recvtag,
comm, status);
Blocking operation, combines send and recv to avoid deadlock.
Partial solution: nonblocking communication
MPI_Send and MPI_Recv are blockingInitiate message:
MPI_Isend(start, count, datatype, dest
tag, comm, request);
MPI_Irecv(start, count, datatype, dest
tag, comm, request);
Wait for message completion:
MPI_Wait(request, status);
Test for message completion:
MPI_Test(request, status);
Sometimes useful to have multiple outstanding messages:
MPI_Waitall(count, requests, statuses);
MPI_Waitany(count, requests, index, status);
MPI_Waitsome(count, requests, indices, statuses);
Multiple versions of test as well.
Other variants of MPI_Send
MPI_Ssend (synchronous) – complete after receive begunMPI_Bsend (buffered) – user provides bufferMPI_Buffer_attachMPI_Rsend (ready) – must have receive already postedMPI_Issend)MPI_Recv receives anything.
MPI_Bcast(buffer, count, datatype,
root, comm);
MPI_Reduce(sendbuf, recvbuf, count, datatype,
op, root, comm);
buffer is copied from root to othersrecvbuf receives result only at rootop is MPI_MAX, MPI_SUM, etc#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv) {
int nproc, myid, ntrials;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
if (myid == 0) {
printf("Trials per CPU:\n");
scanf("%d", &ntrials);
}
MPI_Bcast(&ntrials, 1, MPI_INT,
0, MPI_COMM_WORLD);
run_trials(myid, nproc, ntrials);
MPI_Finalize();
return 0;
}
Let sum[0] = ∑iXi and sum[1] = ∑iX2i.
void run_mc(int myid, int nproc, int ntrials) {
double sums[2] = {0,0};
double my_sums[2] = {0,0};
/* ... run ntrials local experiments ... */
MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) {
int N = nproc*ntrials;
double EX = sums[0]/N;
double EX2 = sums[1]/N;
printf("Mean: %g; err: %g\n",
EX, sqrt((EX*EX-EX2)/N));
}
}
MPI_Barrier(comm);
Not much more to say. Not needed that often.
Allreduce),
Reduce_scatter, ...Init/FinalizeGet_comm_rank, Get_comm_sizeSend/Recv variants and WaitAllreduce, Allgather, Bcast