COMP 303 - Computer ArchitectureFall 2007

CS316 Project 7

MIPS Multicore Simulator

For this project, we will extend our MIPS simulator to simulate a multicore architecture. We will then take a simple matrix multiply program and modify it to take advantage of multicore programming.

Part 1: Multicore Simulator

The system you are simulating has multiple CPU cores that execute code in parallel but share a single memory address space and cache. As such, you will need to have a separate copy of the CPU registers, program counter, and any other CPU-internal registers (such as HI and LO) for each processor core. There is a single on-chip cache that is shared between all the cores, and main memory will be shared as well. Your simulator should be capable of simulating N cores, where N is chosen at compile time. For simplicity, start out by setting N to 2.

Use your simulator with cache from project 6 as a starting point. You will need to make extensive modifications in multiple files, and will turn in a tarball when finished. You're free to modify anything you wish, as long as you conform to the API and requirements below.

API and Requirements

When you execute a program on your multicore simulator, it will initially start out running in a single context on one core. This is no different from the previous projects. To run on multiple cores, the application will use a system call to launch a specified function in a new context on an unused core. As an example of how this looks, download the multicore test programs and look at context.c. When __start() calls context_create(parallel, 55), the function void parallel(int arg) { } is invoked on the second core and passed the (arbitrarily chosen) number 55. Meanwhile, the __start() function continues to execute on the first core, so the two functions are running in parallel.

Looking in test-include.h, you will see two new systems calls, void context_create(void*, int) and int context_get_num(void). The first creates a new context and begins executing it; the second provides a way for a program to determine which core it is currently running on. You will need to implement both of these system calls in syscalls.c in the simulator, using system call numbers 5 and 6 as given in test-include.h.

Your multicore simulator should consist of a single-threaded program that simulates multiple contexts. It can do so by executing some number of instructions from an application in one context, switching to another, and executing some other number of instructions from a different application in the second context. Your simulator itself may not be multi-threaded, as that would defeat the point; the point of this exercise is to create a deterministic, single-threaded simulator that can repeatably simulate multiple cores and collect measurements. To do this, you'll need to execute some number n of instructions from one context on one core, and then execute the same number of instructions from the other context on the second core. If n=1, this gives a fine-grained appearance that things are happening in parallel. For larger values of n, the simulation is courser grained, but may run a bit more efficiently. Because of this tradeoff, your simulator should take a command line argument -i n to specify how many instructions should be run on each core in any simulation step. (Use the already-implemented -s command line option as an example of how to read arguments.) If n=0, you should choose a random number between 1 and 100 on every step of the simulation. Random numbers can be obtained by the rand() function; see man rand. You'll need to change the range of the random number to be 1 to 100; a modulo operation may help with this. Introducing some randomness will help you catch bugs in your simulated application code because a different number of instructions will be executed on every step.

As a metric for comparing single core execution to multicore execution, print out the total time taken to run a program. The units should be number of instructions. For example, if your program takes 50 instructions to set things up before calling context_create(), and then each context executes 100 instructions, you would print 150. You should also print the cache statistics as in the last project.

As always, an example solution is provided. You can run this by typing /usr/local/cs316/p7/multicore on any of the CSUG Linux hosts.

Implementation Guidelines

Your run function should be modified to take a parameter for the number of instructions to run. After running that many instructions it will return, and will be called again for the second core, and then again for the first core, and so on, alternating between the cores. This gives rise to the question, how do you store the PC and register contents separately for each core, while using the same run function? Simply creating a second copy of all of the global variables is not good enough, since that approach will not support more than two cores. One solution is to create a structure for storing all of the information associated with a context. This could be implemented by placing something like the following in main.h:

struct context {
  int core_num;
  unsigned int pc;
  unsigned int R[32];
  ...
};

It is up to you to decide what exactly needs to be stored for each context. For each CPU core you can then create an instance of this structure, and then pass the run function a pointer to one of these instances. You'll want to get rid of the global R[] array and other global variables and instead use the ones in the structure, passing a pointer to the structure to any function which needs to access or modify the context.

You will need to modify the function that allocates memory for the stack so that multiple stacks are allocated, one for each core. This could be done by allocating consecutive 4MB chunks of memory at address 0x40000000, one for each stack. There is nothing magic about these numbers, so you can choose different values if you wish. (As a side note, you can get rid of all of the NetID based stack placement code, since that was only used for the buffer overflow homework.)

Part 2: Matrix Multiply Application

In the multicore test programs package you will find a matrix multiplying program, matrix.c. This is designed to run on a single core processor.

Copy the program and rename it fastmatrix.c. You'll need to update the TARGETS line in the makefile as well. Modify this program to take advantage of your new multicore simulator. How much faster (in terms of the total time printed by your simulator) can you make this program run?

Finally, report on the cache behavior of your fastmatrix application. What is the cache hit rate of the single-threaded, slow matrix multiply program? What is the cache hit rate of your fast multicore matrix multiply function?

It will be necessary to synchronize the two contexts before printing the resulting matrix. That is, the context which prints the matrix needs to be sure that the other context has finished its computation. This can be done by barrier synchronization, where the main context goes into a loop, checking a flag to see if the other context has finished, and only continuing once this flag has been set.

How to Get Started

Extract the project from the tarball: tar -xvzf cache.tar.gz
Compile the project: make
Delete the executable and object files, leaving your source code intact: make clobber (this is normally not necessary, as make will selectively rebuild the portions of your project that have changed)
Run the project: ./simulate <filename>

What to Turn In

Turn in a tarball of your multicore simulator. Inside of it there should be a tests directory containing your fastmatrix.c (and possibly other test programs) and a Makefile to build it.

If your simulator is in a directory /home/user/multicore, and you are in your home directory /home/user, you can create a tarball by typing tar -cvzf multicore.tar.gz multicore. If you want to copy this file to a Windows system for submission, you can use FileZilla or any other SFTP client to download it from the CSUG Linux host.

For the Adventurous

Note: These suggestions for an extra challenge will be examined (and commented on, if your project works well) but not graded. They will have no impact on the class grades. They are here to provide some direction to those who finish their assignments early and are looking for a way to impress friends and family.

Add a context_destroy(int core_num) system call. This call forcibly terminates the context running on the specified core.
Add needed functionality to allow synchronization without busy looping.

Help and Hints

Ask the TAs for help. We expect to see most students in office hours during the course of the project. Extra hours will be scheduled as needed.

If you suspect a bug in the provided framework... ask Michael for help.

COMP 303 - Computer Architecture

Fall 2007

Course Related

Tools & Style

Other