Functions in Assembly

Instructions: Remember, all assignments in CS 3410 are individual. You must submit work that is 100% your own. Remember to ask for help from the CS 3410 staff in office hours or on Ed! If you discuss the assignment with anyone else, be careful not to share your actual work, and include an acknowledgment of the discussion in a collaboration.txt file along with your submission.

The assignment is due via Gradescope at 11:59pm on the due date indicated on the schedule.

Submission Requirements

Please submit the following files.

From the in-lab work:

  • addone.s: your first assembly function, which just does i+1
  • recsum.s: a recursive summation function

From Part 1:

Provided Files

We have provided you:

  • a Makefile
  • recursive.c, memoization.c, tail_recursive.c, and opt_tail.c as the skeleton code
  • compare.c to run the timing to evaluate different methods

Overview

This assignment will expand your understanding of RISC-V assembly programming with a focus on managing the call stack. You will get direct experience with defining and calling functions, and you will learn about optimizing code for performance.

Part 0: Lab Section

During this lab section, you’ll get some initial experience with writing functions in assembly. The key challenge is in following the RISC-V standard calling convention. The calling convention describes the rules that functions and the code that call them follow so that they can be compatible.

To adhere to the calling convention, all RISC-V functions are composed of a prologue, the function body, and an epilogue. You can start by writing the body, and then note which callee- and caller-saved registers you use in the body. Then, you can write the prologue and epilogue to properly save and restore those registers.

Warm Up: addOne

Let’s start simple by implementing a function that just adds 1 to its argument. You can imagine this C function:

int addOne(int i) {
  return i + 1;
}

Start by writing the body. If you refer to the RISC-V calling convention, you’ll notice that the first argument and the return value go in register a0. So the body of this function is pretty simple:

addi a0, a0, 1

Next, write the prologue. We need to decide how big the stack frame must be; we’ll call that number of SIZE. It must be big enough to hold the return address, any callee-saved registers, and any local variables that don’t fit in registers. Here’s a compact “to-do” list for what the prologue must do:

  1. Move the stack pointer, with addi sp, sp, -SIZE.
  2. Save the return address to the stack.
  3. Save any callee-saved registers that our body uses to the stack.

Then, the epilogue does the opposite:

  1. Restore any callee-saved registers from the stack.
  2. Retrieve the return address, ra, from the stack.
  3. Move the stack pointer back to its original position, with addi sp, sp, SIZE.
  4. Use ret (a.k.a. jr ra) to return to the caller.

Our function body, as we’ve written it, doesn’t use any callee-saved registers (s0 through s11) or need any stack space for locals. So the only thing we need to store in the stack frame is the return address. Pointers in our architecture are 64 bits, so that’s 8 bytes in our stack frame.

Putting it all together, here’s an implementation of addOne:

addOne:
  # Prologue.
  addi sp, sp, -8  # Push the stack frame.
  sd   ra, 0(sp)   # Save return address.

  # Body.
  addi a0, a0, 1

  # Epilogue.
  ld   ra, 0(sp)   # Restore return address.
  addi sp, sp, 8   # Pop the stack frame.
  ret

A main part of writing the prologue and epilogue is deciding where in the stack frame to put all your temporary values. In other words, which offsets on sp will you use to load and store them? In this function, there’s only one temporary, the return address, so that goes at 0(sp). But in general, it’s up to you to decide how to lay out the values within your stack frame.

Put your addOne implementation in a file called addone.s.

Trying It Out: Calling Your Function From C

We can’t run this assembly program directly because there is no main function yet. It also doesn’t print anything out, which would make it hard to tell what it’s doing. One way to try out your assembly functions is to write some C code that calls them.

Make sure that your addone.s implementation has an addOne: label at the top. At the top of the file, add this line:

.global addOne

This directive tells the assembler that the addOne label is a global symbol, so it’s accessible to other code.

Then, write (in a separate file) a short C program like this:

#include <stdio.h>

int addOne(int i);

int main() {
  int res = addOne(42);
  printf("%d\n", res);
}

That addOne declaration is a prototype, which means it doesn’t have a function body. It just tells the C compiler that the function is implemented elsewhere—in your case, in an assembly file.

Now, let’s compile and link these two files together. Use a command like this:

$ rv gcc your_assembly_file.s your_wrapper_code.c -o your_test

Then use rv qemu your_test, as usual, to run the linked program.

All this works because of the magic of calling conventions. You and GCC are both “assembly writers,” and because you agree on the standard way to invoke functions, the assembly code you both write can interoperate.

Recursive Sum

Next, we’ll write a recursive function that sums the integers from \(1\) through \(n\). The function we want to implement would look something like this in C:

int sum(int n) {
  if (n == 0)
    return n;
  return n + sum(n - 1);
}

In assembly, recursive function calls work exactly the same way as any other function call—the caller and callee just happen to be the same function. We’ll follow the RISC-V calling convention in both roles.

Start by writing the function body. The interesting part is implementing the function call. Which caller-saved registers do you need to save before the jal instruction and restore after the jal?

Next, write the prologue and epilogue. You’ll want to start by making a list of all the values this function will ever need to store in its stack frame, including the return address and any local-variable slots. Decide stack frame layout, i.e., which offsets you’ll use for each value. Then, follow the recipe from the addOne step above to implement the prologue and epilogue.

Try your function out by writing a main wrapper in C, as we did above. You’ll want to try calling your sum function on several different inputs.

Put your assembly implementation of sum in a file called recsum.s.

Part 1: Optimizing Fibonacci

In this assignment, you will implement several different versions of a function that calculates numbers in the Fibonacci sequence. We’ll start with a straightforward recursive implementation and then explore some performance optimizations.

Recursive Fibonacci

Here’s a straightforward recursive implementation of a Fibonacci function in C:

unsigned long r_fibonacci(int n) {    
  if (n == 0)
    return 0;
  else if (n == 1)
    return 1;
  else
    return r_fibonacci(n - 2) + r_fibonacci(n - 1);
}

Your task is to translate this code into RISC-V assembly.

Put your implementation in a file called recursive.s. We have provided a main function you can use to test your code in recursive.c. To test your code, type:

$ rv make recursive     # Build the `recursive` executable.
$ rv qemu recursive 10  # Run it.

The recursive executable takes a command-line argument: the index of the Fibonacci to calculate. So qemu recursive 10 should print the 10th Fibonacci number, which is 55.

Memoization

The recursive implementation works, but it is very slow. Try timing the execution of a few Fibonacci calculations:

$ time rv qemu recursive 35
$ time rv qemu recursive 40
$ time rv qemu recursive 42

On my machine, calculating the 40th Fibonacci number took 4 seconds, and calculating the 42nd took 11 seconds. That suggests that the asymptotic complexity is pretty bad.

Part of the problem is that the recursive version recomputes the same answer many times. For example, if you call r_fibonacci(4), it will eventually call r_fibonacci(2) twice: once directly, and once indirectly via the recursive call to r_fibonacci(3). This redundancy can waste a lot of work.

A popular way to avoid wasteful recomputation is memoization. The idea is to maintain a memo table of previously-computed answers and to reuse them whenever possible. For our function, the memo table can just be an array, where the \(i\)th index holds the \(i\)th Fibonacci number. Here’s some Python code that illustrates the idea:

def m_fibonacci(n, memo_table, size):
    # Check the memo table. A nonzero value means we've already computed this.
    if n < size and memo_table[n] != 0:
        return memo_table[n]

    # We haven't computed this, so do the actual recursive computation.
    if n == 0:
        return 0
    elif n == 1:
        return 1
    answer = (m_fibonacci(n - 2, memo_table, size) + 
        m_fibonacci(n - 1, memo_table, size))

    # Save the answer in the memo table before returning.
    if n < size:
        memo_table[n] = answer

    return answer

In C, the type of memo_table will be unsigned long*, i.e., an array of positive numbers. size is the length of that array. Here’s the function signature for our new function:

unsigned long m_fibonacci(int n, unsigned long* memo_table, int size);

Implement this m_fibonacci function in RISC-V assembly. Put your code in memoization.s.

We have provided a memoization.c wrapper that you can use to test your code. You can use the same procedure as above to try your implementation: rv make memoization followed by rv qemu memoization <number>.

Notice how much faster the new implementation is! Take some number that was especially slow in the recursive implementation and time it using your memoized version:

$ time rv qemu memoization 42

On my machine, that takes just 0.5 seconds. That’s 22× faster!

Tail-Recursive Version

While the new version is a lot faster, it still makes a lot of function calls. Some of those function calls turn out to be fast, because they just look up the answer in the memo table. But we can do better by changing the algorithm to need only one recursive call.

Again using Python syntax, here’s the algorithm for a faster recursive version:

def tail_r_fibonacci(n, a, b):
    if n == 0:
        return a
    if n == 1:
        return b
    return tail_r_fibonacci(n - 1, b, a + b)

This version is called tail-recursive because the recursive call is the very last thing the function does before returning. Marvel at the fact that this version makes only \(n\) recursive calls to calculate the \(n\)th Fibonacci number!

Here’s the function signature for this version:

unsigned long tail_r_fibonacci(int n, unsigned long a, unsigned long b);

Implement this tail_r_fibonacci function in tail_recursive.s. As usual, we have provided a C wrapper so you can test your implementation: rv make tail_recursive followed by rv qemu tail_recursive <number>.

Tail-Call Optimization

Making \(n\) recursive calls is pretty good, but is it possible to optimize this code to do no recursion at all? That would mean that the algorithm uses \(O(1)\) stack space instead of \(O(n)\).

That’s the idea in tail-call optimization. The plan is to exploit that, once the recursive call to tail_r_fibonacci is done, the caller has nothing more to do. The callee puts its return value in a0, and that is exactly what the caller wants to return too. Because there is no more work to do after the tail call, we don’t need to waste time maintaining the stack frame for the caller. We can just reuse the same stack frame for the recursive call!

Implement an optimized version of the tail-recursive Fibonacci algorithm in opt_tail.s. Instead of using a jal (or call) instruction for the recursive call, you can just use a plain unconditional jump (j in RISC-V). Be sure to carefully think through when and where you need to save and restore the return address to make this work.

Your function should be named opt_tail_fibonacci, and it should have the same function signature as the previous version. As usual, opt_tail.c can help you test your implementation: rv make opt_tail followed by rv qemu opt_tail <number>.

Compare Performance

We have provided a program, in compare.c, that can compare the performance of these various optimizations more precisely than the time command. (That was also measuring the time it takes to start the executable up, which can be slow, especially when it entails launching a Docker container.) Build the tool and invoke it like this:

$ rv make compare
$ rv qemu compare <method> <n>

You can give it the name of a method (recursive, memoization, tail_recursive, or opt_tail) and a number \(n\) to measure the time taken to compute the \(n\)th Fibonacci number. Or use the all method to compare all the implementations.

When I ran this once on my machine with \(n=20\), it reported that the recursive implementation took about 2.6 seconds, memoization brought this down to just 7 milliseconds, tail recursion was even faster at 3 ms, and the optimized tail call version was blazingly fast at only half a millisecond. Every computer is different, so your numbers will vary, but see if you observe the same overall performance trend.

There is nothing to turn in for this part—it’s just cool!

Rubric

  1. Part 0
    • addone.s: 5 points
    • recsum.s: 5 points
  2. Part 1
    • recursive.s: 10 points
    • memoization.s: 15 points
    • tail_recursive.s: 10 points
    • opt_tail.s: 15 points