For this project, we will extend our MIPS simulator to simulate a multicore architecture. We will then take a simple matrix multiply program and modify it to take advantage of multicore programming.
The system you are simulating has multiple CPU cores that execute code in parallel but share a single memory address space and cache. As such, you will need to have a separate copy of the CPU registers, program counter, and any other CPU-internal registers (such as HI and LO) for each processor core. There is a single on-chip cache that is shared between all the cores, and main memory will be shared as well. Your simulator should be capable of simulating N cores, where N is chosen at compile time. For simplicity, start out by setting N to 2.
Use your simulator with cache from project 6 as a starting point. You will need to make extensive modifications in multiple files, and will turn in a tarball when finished. You're free to modify anything you wish, as long as you conform to the API and requirements below.
When you execute a program on your multicore simulator, it will
initially start out running in a single context on one core. This is
no different from the previous projects. To run on multiple cores, the
application will use a system call to launch a specified function in a
new context on an unused core. As an example of how this looks,
download the multicore
test programs and look at context.c
. When
__start()
calls context_create(parallel,
55)
, the function void parallel(int arg) { }
is
invoked on the second core and passed the (arbitrarily chosen) number
55. Meanwhile, the __start()
function continues to
execute on the first core, so the two functions are running in
parallel.
Looking in test-include.h
, you will see two new systems calls, void context_create(void*, int)
and int context_get_num(void)
. The first creates a new context and begins executing it; the second provides a way for a program to determine which core it is currently running on. You will need to implement both of these system calls in syscalls.c
in the simulator, using system call numbers 5 and 6 as given in test-include.h
.
Your multicore simulator should consist of a single-threaded
program that simulates multiple contexts. It can do so by executing
some number of instructions from an application in one context,
switching to another, and executing some other number of instructions
from a different application in the second context. Your simulator
itself may not be multi-threaded, as that would defeat the point; the
point of this exercise is to create a deterministic, single-threaded
simulator that can repeatably simulate multiple cores and collect
measurements. To do this, you'll need to execute some number n
of instructions from one context on one core, and then execute the
same number of instructions from the other context on the second
core. If n=1, this gives a fine-grained appearance that things
are happening in parallel. For larger values of n, the
simulation is courser grained, but may run a bit more
efficiently. Because of this tradeoff, your simulator should take a
command line argument -i
n to specify how many
instructions should be run on each core in any simulation step. (Use
the already-implemented -s
command line option as an
example of how to read arguments.) If n=0, you should choose a
random number between 1 and 100 on every step of the
simulation. Random numbers can be obtained by the rand()
function; see man rand
. You'll need to change the range
of the random number to be 1 to 100; a modulo operation may help with
this. Introducing some randomness will help you catch bugs in your
simulated application code because a different number of instructions will be executed on
every step.
As a metric for comparing single core execution to multicore execution, print out the total time taken to run a program. The units should be number of instructions. For example, if your program takes 50 instructions to set things up before calling context_create()
, and then each context executes 100 instructions, you would print 150. You should also print the cache statistics as in the last project.
As always, an example solution is provided. You can run this by typing /usr/local/cs316/p7/multicore
on any of the CSUG Linux hosts.
Your run
function should be modified to take a parameter for the number of instructions to run. After running that many instructions it will return, and will be called again for the second core, and then again for the first core, and so on, alternating between the cores. This gives rise to the question, how do you store the PC and register contents separately for each core, while using the same run
function? Simply creating a second copy of all of the global variables is not good enough, since that approach will not support more than two cores. One solution is to create a structure for storing all of the information associated with a context. This could be implemented by placing something like the following in main.h
:
struct context { int core_num; unsigned int pc; unsigned int R[32]; ... };It is up to you to decide what exactly needs to be stored for each context. For each CPU core you can then create an instance of this structure, and then pass the run function a pointer to one of these instances. You'll want to get rid of the global
R[]
array and other global variables and instead use the ones in the structure, passing a pointer to the structure to any function which needs to access or modify the context.
You will need to modify the function that allocates memory for the stack so that multiple stacks are allocated, one for each core. This could be done by allocating consecutive 4MB chunks of memory at address 0x40000000, one for each stack. There is nothing magic about these numbers, so you can choose different values if you wish. (As a side note, you can get rid of all of the NetID based stack placement code, since that was only used for the buffer overflow homework.)
In the multicore test programs package you will find a matrix multiplying program, matrix.c
. This is designed to run on a single core processor.
Copy the program and rename it fastmatrix.c
. You'll need to update the TARGETS line in the makefile as well. Modify this program to take advantage of your new multicore simulator. How much faster (in terms of the total time printed by your simulator) can you make this program run?
Finally, report on the cache behavior of your fastmatrix application. What is the cache hit rate of the single-threaded, slow matrix multiply program? What is the cache hit rate of your fast multicore matrix multiply function?
It will be necessary to synchronize the two contexts before printing the resulting matrix. That is, the context which prints the matrix needs to be sure that the other context has finished its computation. This can be done by barrier synchronization, where the main context goes into a loop, checking a flag to see if the other context has finished, and only continuing once this flag has been set.
Extract the project from the tarball: tar -xvzf cache.tar.gz
Compile the project: make
Delete the executable and object files, leaving your source code intact: make clobber
(this is normally not necessary, as make
will selectively rebuild the portions of your project that have changed)
Run the project: ./simulate <filename>
Turn in a tarball of your multicore simulator. Inside of it there should be a tests
directory containing your fastmatrix.c
(and possibly other test programs) and a Makefile
to build it.
If your simulator is in a directory /home/user/multicore
, and you are in your home directory /home/user
, you can create a tarball by typing tar -cvzf multicore.tar.gz multicore
. If you want to copy this file to a Windows system for submission, you can use FileZilla or any other SFTP client to download it from the CSUG Linux host.
Note: These suggestions for an extra challenge will be examined (and commented on, if your project works well) but not graded. They will have no impact on the class grades. They are here to provide some direction to those who finish their assignments early and are looking for a way to impress friends and family.
context_destroy(int core_num)
system call. This call forcibly terminates the context running on the specified core.
Ask the TAs for help. We expect to see most students in office hours during the course of the project. Extra hours will be scheduled as needed.
If you suspect a bug in the provided framework... ask Michael for help.