Project 4 > Buffer Overflow

CS 3410 Spring 2019


Due: 11:59PM, Tuesday, April 16, 2019

Late Policy: Up to 2 slip days can be used for this project. If you are out of slip days, submissions after the due date will incur a 25% deduction per day late.

Grace Period Policy: Do not rely on the grace period to submit on time. Everything should be uploaded BEFORE the due date above.

Reminder: You must work alone for this project.

Warning: Read the ENTIRE writeup before you begin. Regrades will not be honored for submissions that do not follow the writeup.


Setting up your Environment

For this project you should either be SSH'd into a UGCLINUX machine or be using the course VM found on the course webpage.

Files: The files you will need for this assignment will be in your personal github repository.

Before you can run the course simulator, you need to make sure that your toolchain and environmental variables are set up correctly. We have provided a handy setup script to handle this for you.

On the course server or VM, navigate to where your P4 release files are and run:

$ python setup.py

The script automatically detects whether you have the RISC-V toolchain installed and makes an attempt at determing your netid from your system environment. If everything goes well, you should see something like

  RISC-V toolchain detected at /usr/local/riscv, skipping PATH modification
    Setting up environment variables...
    Please confirm your NETID (enter nothing if default is correct), autodetected as "<your netid>":

If the displayed netid is correct, simply hit enter and finish setup. If not, enter the correct netid and return. In the case that there is not a local copy of the RISC-V toolchain installed, setup.py will automatically download and install it from the course server. Follow the provided prompts and proceed.

To invoke the simulator, run:

$ python simulate.py browser

Overview

The goal of this project is to get intimately familiar with the layout and use of call stacks, as well as RISCV machine language, assembly and disassembly, debugging, and reverse engineering. As a side benefit, we hope to raise your awareness of computer security issues. To this end, you will write a buffer overrun exploit to break a program that we provide to you.

WARNING: These kinds of friendly hacking challenges have a long history, and hacking skills are priceless, as they reflect a deep understanding of the operation of a computer system. But you must be responsible and use your skills wisely. Taking over machines or hacking the Internet carries stiff penalties, is a sure-fire way to get expelled from Cornell, interferes with other people's lives, and is a waste of your talent. It is also plain wrong.

What to Submit

Submit your raw binary exploit file containing the specially crafted input. We will try it out on our own copy of browser to see if it successfully breaks it.

Also submit a text document/README that explains the exploit file. This should include a text listing from xxd of the bytes in your exploit file, annotated with comments to explain what it is doing (or trying to do). Your documentation should explain how your exploit tries to subvert the program's check that the input string matches the expected string, and why this works. In addition, it should explain how your exploit is able to take control of the program and what steps the exploit takes to force the program to print out the desired string.

The Story

In this project, you will "0wn" a binary program called browser that we will provide to you. We will not be providing the source code for this program. All that you know about this program is what is documented here, and what you can figure out for yourself by running or examining the binary. The browser is a simplified web browser. The normal operation of browser is very simple. When executed, it prompts you for a URL, and then prints a simple message (the '$' shown here is the linux shell prompt):

$ python simulate.py browser
Where to connect? www.google.com
Connected to www.google.com!

I can also send input to browser from another program using the linux shell '|' operator, with the same results:

$ echo "www.google.com" | python simulate.py browser
Where to connect?
Connected to www.google.com!

However, this browser only lets you connect to www.google.com. All other URLs will be rejected — try it and see!

The rumor is that browser suffers from a buffer overflow vulnerability. Since the program only takes one input, it's not difficult to guess where the problem might lie. Thus, you would like to to get this browser to let you connect to facebook.ru, even though browser was originally designed to only allow access to google.com

0wning browser: Your job is to craft some input to browser that will cause it to print out a different message, specifically: "LOL 0wn3d! <netid> is on facebook.ru!" (substitute your own NetID). The fact that the normal "Only www.google.com is allowed" message is missing constitutes proof that you have completely subverted the browser, and have gotten it to do something that it could not do before.

$ cat exploit | python simulate.py browser
Where to connect?
LOL 0wn3d! hw342 is on facebook.ru!

To do this, you will need to inject new code into the browser program as it is running. You are not allowed to modify or replace the browser program on disk. The only way you get to interact with browser is to feed it some carefully crafted input.

The simulator: The browser program is compiled to run on a RISCV CPU. Since most of you don't have access to a real RISCV CPU (neither do we), you will not be able to natively execute the program. Instead, you can run a program which takes browser and simulates the execution of the code.

To figure out how to attack browser, you'll need to step through its code as it is executing and reverse engineer the parts that matter, namely, where (i.e., at which memory location) the input buffer is stored, what the values are that lie near them in memory, and what precise instruction sequence is vulnerable to a buffer overflow attack. Since you have the RISCV binary, you can use various tools, to disassemble the browser binary and learn about its layout and code.

*********You can also use the -d option to the simulator, which starts an interactive debugger for the simulated program execution. This lets you step through the execution one instruction at a time, examine memory and the stack contents, and so on. See the README file in your repo for help using the simulator and it's built-in debugger.*******

Stack Randomization: Note that in a feeble effort to thwart just such attacks, the simulator, like many real machines, implements stack randomization, a limited kind of program layout randomization. When the simulator starts, it initializes the stack to a variable address, rather than the standard 0x7FFFFFFC. The starting location of the stack is derived from the $NETID environment variable.

Executing the Attack: Once you have figured out the program and stack layout, you need to come up with a carefully crafted input that will take over browser. This input will likely contain some binary data (the attack payload) that corresponds to RISCV instructions you want to have executed. There are several tools you might want to use to create the payload and inject it into the running browser: a RISCV assembler, to convert from RISCV assembly into RISCV machine language; xxd for converting text files containing hex digits to (or from) raw binary files; and cat for sending raw binary input to browser.

Once your attack causes browser to print the "LOL 0wn3d! <netid> is on facebook.ru!" message, the browser program should exit gracefully (this means, exit with status 0). It is trivial to make it loop forever. A clean exit only takes a few extra instructions to invoke the normal exit() routine.

Tools

Here are a few tools you might find useful for this homework.

xxd

xxd is a tool for converting back and forth between raw binary files and text representations of the binary data. For example, if I create a file exploit.txt (using a regular text editor) specifying twenty-eight consecutive "bytes" in hex:

68 77 33 34 32 20
00 00 00 00 00 00 00 00 00
00
01 02 03 04
aa bb cc dd
11 22 33 44

then I can convert this into raw binary using xxd in "reverse plain" mode:

$ xxd -r -p exploit.txt > exploit
$ ls -l exploit*
-rw-r--r-- 1 hw342 hw342 28 2011-02-25 12:06 exploit
-rw-r--r-- 1 hw342 hw342 84 2011-02-25 12:06 exploit.txt

You can see that the text version is 84 bytes (includes spaces and 2 digits of text per "byte"), and the raw of the input file in "reverse" mode (spaces at the ends of lines silently mess things up, for example). So you may want to convert the raw file back to text and compare to your desired bytes to make sure nothing went wrong:

$ xxd exploit
0000000: 6877 3334 3220 0000 0000 0000 0000 0000  hw342 ..........
0000010: 0102 0304 aabb ccdd 1122 3344            ........."3D

Using the standard library

riscv32-objdump can give you a listing of the assembly code for browser:

$ riscv32-objdump -xdl browser 

This becomes very helpful as it includes the disassembly of the standard library, which has functions you need to call.

Note: all functions without underscores follow calling conventions. If you want to see information about function calls (such as get(), print() etc) that you see in the object dumps, refer to the LINUX man pages.

Example: to use the stdlib function malloc (which is not relevant to this project and is only used here as an example), search in the assembly code of browser output by

00010cb8 <malloc>:
malloc():
   10cb8: 85aa                  mv  a1,a0
   10cba: 1cc1a503            lw  a0,460(gp) # 1da1c <_impure_ptr>
   10cbe: a029                  j 10cc8 <_malloc_r>

This enables you to call the function malloc by jumping to address 00010cb8 and using standard calling conventions to invoke the call by saving arguments in appropriate registers.

Pipelines and Redirections

Pipes and redirection, you may recall, are shell command line operators that let you connect the output of one program (say cat or xxd) to the input of another program or to a file. So you can, for example, concatenate two text files using cat, send the resulting text as input to xxd -r -p, send the resulting raw binary to the simulated browser, then send the resulting output to a file output.txt, all using a single command:

$ cat exploit_part1.txt exploit_part2.txt | xxd -r -p | python simulate.py browser > output.txt

Debugging

To start an interactive debugger for the simulated program execution, run

$ python simulate.py -d browser

For help, use

$ python simulate.py -h browser

When running this with the -d flag, you will be prompted that the program is listening on a given port number - you should use the port number in the next section.

Once you're in the debugger, open gdb using the following commands in another terminal:

$ riscv32-gdb browser

And in the GDB terminal, using the port number you saw earlier,

(gdb) target remote localhost:[enter port number here]

You can now debug the program remotely.

  • To debug with gdb and see the assembly of the program, you can use:
    (gdb) layout asm
  • To step to the next assembly instruction, you can use:
    (gdb) si
  • To examine the stack, we can use the sp register. This will give you the first 4 words of the memory, starting from sp.
    (gdb) x/4x $sp
    This will give you the first 10 words starting from sp - 20:
    (gdb) x/10x $sp-20
  • To list all the registers, you can use:
    (gdb) i r
  • To print the contents of a specific register, use the following where # is the register number.
    (gdb) p $#

You may find the GDB lab (in your course repo) useful as a refresher. For more information on the x GDB command, refer to: https://sourceware.org/gdb/onlinedocs/gdb/Memory.html.

Epilogue

We're here to help. Take advantage of our office hours if you are stuck.

For an entertaining (and a somewhat dated) read on buffer overflow attacks, check out:

Aleph One. Smashing the Stack for Fun and Profit. Phrack Magazine, 7(49), November 1996.
http://www.phrack.org/issues.html?issue=49&id=14

And finally, to reiterate: a friendly hacking challenge can be fun, and hacking skills are invaluable for working with real systems. But you must be responsible for your own behavior. We are not giving you free reign to launch attacks on CMS, fellow students' machines, or any anything else. Such behavior is unethical and most likely illegal as well.

FAQ

ECALLS and Other Instructions

You may see ECALL and other RISC-V instructions in the object-dump. ECALL is an assembly instruction used to make a system call to the OS. You can refer to the RISC-V manual for further explanation on instructions, but don't worry too much about understanding every instruction. .

You need the newlines!

Yes, you need the newlines both before and after the "LOL 0wn3d!" message. Of course getting the message in the first place is worth the most points, but the newlines will get you those final few points.

So, an exploit that looks like this:

$ python simulate.py browser < pht24-soln
Where to connect?
LOL 0wn3d! pht24 is on facebook.ru!

... is preferable to an exploit that looks like this:

$ python simulate.py browser < pht24-bad
Where to connect?  LOL 0wn3d! pht24 is on facebook.ru!

As you may have discovered, you can't simply embed a newline or carriage return in the message, because the browser stops reading when it encounters these characters. Something more clever is called for.

Aha! I found this handy vertical tab (0x0b) character! I can just use that instead of a newline, right?

No, a vertical tab is not a newline. You must embed a newline into the message.

Why does calling printf in my exploit print garbage?

Because of the nature of the exploit, we may end up ruining the value of the stack pointer. We need to set our sp and fp to be valid stack values so that function calls still work nicely.

Why does it fail to connect to my program when I run gdb and try to connect to localhost?

This might be because you are not using tmux to open different sessions/screens. Without tmux, if you ssh twice, you might get put into a different machine, thus connecting using localhost won't work. You could also get lucky and not be on a different machine, but that's just luck.

Why are some instructions only 16 bits wide?

Some instructions in the browser instruction set are compressed.This shouldn't affect your solution.

Extra Credit

There are ways to make your program resistant to changes in stack layout. These clever exploits work when the stack starts in some small region, instead of only working for one fixed location. If you implement such an exploit, feel free to brag about it in your documentation for extra credit!

Finally, there is a way to make your program work with any arbitrary stack layout. We'll leave this one for the adventurous. If you find this exploit, again, specify clearly in your documentation what we need to do to see this awesome exploit in action, and you will be awarded more extra credit.