Can write a lot of MPI code with 6 operations we’ve seen:
MPI_Init
MPI_Finalize
MPI_Comm_size
MPI_Comm_rank
MPI_Send
MPI_Recv
... but there are sometimes better ways. Decide on communication style using simple performance models.
MPI_Send(buf, count, datatype,
dest, tag, comm);
MPI_Recv(buf, count, datatype,
source, tag, comm, status);
MPI_Send
and MPI_Recv
are blocking
Block until data in system
— maybe in a buffer?
Alternative: don’t copy, block until done.
Both processors wait to finish send before they can receive! May not happen if lots of buffering on both sides.
Could alternate who sends and who receives.
Common operations deserve explicit support!
MPI_Sendrecv(sendbuf, sendcount, sendtype,
dest, sendtag,
recvbuf, recvcount, recvtype,
source, recvtag,
comm, status);
Blocking operation, combines send and recv to avoid deadlock.
Partial solution: nonblocking communication
MPI_Send
and MPI_Recv
are blockingInitiate message:
MPI_Isend(start, count, datatype, dest
tag, comm, request);
MPI_Irecv(start, count, datatype, dest
tag, comm, request);
Wait for message completion:
MPI_Wait(request, status);
Test for message completion:
MPI_Test(request, status);
Sometimes useful to have multiple outstanding messages:
MPI_Waitall(count, requests, statuses);
MPI_Waitany(count, requests, index, status);
MPI_Waitsome(count, requests, indices, statuses);
Multiple versions of test as well.
Other variants of MPI_Send
MPI_Ssend
(synchronous) – complete after receive begunMPI_Bsend
(buffered) – user provides bufferMPI_Buffer_attach
MPI_Rsend
(ready) – must have receive already postedMPI_Issend
)MPI_Recv
receives anything.
MPI_Bcast(buffer, count, datatype,
root, comm);
MPI_Reduce(sendbuf, recvbuf, count, datatype,
op, root, comm);
buffer
is copied from root to othersrecvbuf
receives result only at rootop
is MPI_MAX
, MPI_SUM
, etc#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv) {
int nproc, myid, ntrials;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
if (myid == 0) {
printf("Trials per CPU:\n");
scanf("%d", &ntrials);
}
MPI_Bcast(&ntrials, 1, MPI_INT,
0, MPI_COMM_WORLD);
run_trials(myid, nproc, ntrials);
MPI_Finalize();
return 0;
}
Let sum[0] = ∑iXi and sum[1] = ∑iX2i.
void run_mc(int myid, int nproc, int ntrials) {
double sums[2] = {0,0};
double my_sums[2] = {0,0};
/* ... run ntrials local experiments ... */
MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) {
int N = nproc*ntrials;
double EX = sums[0]/N;
double EX2 = sums[1]/N;
printf("Mean: %g; err: %g\n",
EX, sqrt((EX*EX-EX2)/N));
}
}
MPI_Barrier(comm);
Not much more to say. Not needed that often.
Allreduce
),
Reduce_scatter
, ...Init
/Finalize
Get_comm_rank
, Get_comm_size
Send
/Recv
variants and Wait
Allreduce
, Allgather
, Bcast