CS3410 Spring 2014 Group Project 4

Reminder: you must work in a group of two for this project. You don't need to have the same partner as the earlier projects.

Note: There are exactly enough demo slots so that every pair of students has a slot. You may not work alone on this project without express permission of Profs. Bala and Weatherspoon.

In this project you will write a network honeypot -- a system that receives packets from of a network device, analyzes and classifies those packets, and tracks various statistics over time¹. Your system will have a simple interactive user interface to allow the statistics to be displayed to the user, and way to control the system over the network.

Your software will run on a simulated MIPS multi-core machine with simulated I/O devices, including a simulated network card, keyboard, and display console. Your code will run in kernel-mode at all times (indeed, user mode is not even simulated), but you will still need to deal with many of the complexities of a complete operating system: writing a device driver, handling interrupts from devices or polling those devices for events, scheduling and synchronizing work across multiple cores or multiple threads, etc. We have modified the MIPS simulator to implement kernel-mode, traps (both interrupts and exceptions), and the few new instructions that you will need for these tasks. We also provide you with a very basic kernel with some of the mundane housekeeping tasks completed: simple boot-time initialization, a keyboard input driver, a console output driver, simple memory management including malloc and free, and a printf implementation. The drivers all run on core 0, since the simulator routes all device interrupts to this core; and other cores currently do nothing except print greetings. You are free to modify this code in any way you see fit.

Primary Objective

This assignment is meant to test your understanding of the hardware / software interface, including some of the real-world intricacies of writing robust concurrent programs. In concrete terms, there are three main tasks for you to complete:

Write a device driver for the simulated network card;
Design and implement one or more concurrency-safe datastructures; and
Implement the packet analysis, statistics gathering, and user interface parts of the honeypot.

However, none of these three tasks is meant to be challenging in isolation. Instead, it is the combination of these parts that will determine how your system will perform. In the overall design of your system you have the opportunity to be creative, and this is where we want you to spend the bulk of your time. In other words, the primary objective of this assignment is for you to experiment with various different ways of organizing a concurrent system to maximize overall performance.

Trivial approach: Whenever a packet arrives, core 0 receives an interrupt. A very simple and correct implementation could do all of the packet analysis and statistics gathering inside core 0's network interrupt handler: each time a network interrupt happens, the interrupt handler analyzes the packet, records statistics in some datastructure, then returns. This approach is unlikely to perform well. First, it prevents you from taking advantage of multiple cores. And second, it will cause you to miss packets during bursts of network activity, since the network device can only buffer a few packets at a time. Some variants of this trivial approach include: handling multiple packets per interrupt, to eliminate most of the overhead of interrupt handling; or using polling instead of interrupts.

Buffering approach: One might instead have the network interrupt handler simply queue received packets for later analysis. This allows large bursts of packets to be buffered in RAM. The main (non-interrupt) thread can process the packets whenever there is time to do so, i.e. in between the network interrupt handler's activity during traffic lulls. Of course, if the main thread can't keep up with the load, eventually either it or the network driver will need to "shed" the excess load by dropping packets. The queue datastructure will need to be protected by synchronization primitives so that the main thread and the interrupt handler can access it concurrently.

Pipelined approach: If the work of the main thread is pipelined into two or more stages, then different pipeline stages can be assigned to different cores. For instance, core 0 might be dedicated to receiving packets, core 1 to performing packet analysis, and core 2 to recording statistics and handling user interface requests. Packets would then move along a pipeline from core 0, to core 1, to core 2. The various queues or "pipeline register" datastructures between these stages will need to be protected by synchronization primitives so that the different cores can access them concurrently.

Parallel approach: If the work of the main thread can be partitioned into independent parallel tasks, then different cores can work in parallel. For instance, we might assign each core some fraction of the received packets, with all cores performing the complete packet analysis on its assigned packets. Some additional synchronization would be needed to combine the statistics results from all of these parallel cores into a single report for the user.

Your approach: There are many variants on and combinations of the above ideas, and certainly more ideas that we haven't mentioned or thought of. Some require only simple synchronization primitives -- mutexes, for example -- while others might need a variety of other primitives -- reader/writer locks, atomic increment, atomic compare and exchange. Moreover, it will take some analysis, estimation, and experimentation to come up with a reasonable design. Here is a small sampling of the kinds of questions you should be asking yourself (and you should answer for us in your demo):

How many cycles does an interrupt cost?
How many cycles to move packets from network device to an in-RAM queue, or from one core to another?
How many cycles of analysis are needed per packet, or per received byte?
How many cycles to update the statistics after the analysis is finished?
How long are the important locks going to be held?
How do we partition the work?
Will adding more cores help or hurt?
Can some locks be eliminated?
Can lock contention be reduced by restructuring things, or using different locking primitives?
Are there common cases that can be optimized?

Grading

Performance: The overall performance of your system will be one factor in your grade: how much load can your system can handle without dropping packets (dropping packets occasionally is unavoidable; but typically the packet drops rate will rise dramatically as soon as the system becomes overloaded).

Design and Rationale: An equally important factor in your grade will be the design of your ystem, and more importantly, the process you used to arrive at this design. Did you try out different designs? How did you decide between various tradeoffs? Can you explain the performance you achieve, or is it just dumb luck? Can you explain where the bottlenecks in your system are (even if you don't have the time or knowledge to try and reduce those bottlenecks)?

Experimental Results: Finally, we expect to see quantitative arguments to support your claimed performance and your design choices. For instance, to support your claimed performance, you might produce a graph of packet drops versus offered load. To justify your choice of number of cores, you could produce a graph of achieved performance while varying the number of cores. And to justify a particular division of work among the cores, you should provide estimates (or measurements) of how many cycles per packet (or per byte) each core needs to do, and then use this data to show that the work is evenly divided or that there is some particular bottleneck preventing you from doing better.

The Network Device Driver

You will need to implement a device driver for the simulated network card. Access to the network card is done via memory-mapped I/O (for commands and status) and DMA (for packet data). The memory-mapped I/O region matches the following layout:

    struct network_dev {
      unsigned int cmd; // (offset 0) command register
      unsigned int data; // (offset 4) data register
      unsigned int rx_base; // (offset 8) receive ring physical address register
      unsigned int rx_capacity; // (offset 12) receive ring buffer capacity register
      unsigned int rx_head; // (offset 16) receive ring head index register
      unsigned int rx_tail; // (offset 20) receive ring tail index register 
    }

Commands. The cmd and data registers are used together to configure the device (e.g. to turn on the card, or to enable interrupts or DMA), and to check the status of the device (e.g. to find out how many packets have been dropped by the network device). A driver first writes a command into the cmd register, than either reads or writes the data register, depending on the command. For instance, to get a count of packets dropped so far, a driver first writes NET_GET_DROPCOUNT into cmd, the reads the result from data. And to turn on or off DMA packet reception, a driver writes NET_SET_RECEIVE into the command register, then writes a 1 or 0 into the data register. The complete list of commands is given in the header files we provide.

Receive Ring. Packet reception uses a dedicated "receive ring buffer". Before turning on DMA for packet reception, the driver must allocate and initialize this ring, and let the network device know where the ring lives in memory. The driver allocates the ring in RAM, and configures the device by writing to the rx_base (to specify the physical address of the ring in RAM) and rx_capacity (to specify how many slots are in the ring).

Ring Slots. Each ring slot is 8 bytes, and contains a physical address of an empty memory buffer, followed by the length of that buffer. The maximum packet length for the network device is 4000 bytes, well under a single page, so buffers should be at least this large. The ring should be initialized so that every slot initially contains a maximum sized but empty buffer. The network card puts network packet data into the buffer at index rx_head, and the driver takes network packet data from the buffer at rx_tail. So the rx_head and rx_tail indexes specify the range of slots that contain network data, with the card adding newly received packets to the head, and the driver removing received packets from the tail. The header files we give you detail exactly how the rx_head and rx_tail indexes are used. Please note that tx_tail reflects a hardware counter and should only be incremented if the associated packet is to be processed without being dropped.

Interrupts. Core 0 will normally trap into the interrupt handler whenever the ring is non-empty, that is, whenever rx_head != rx_tail. Several conditions have to hold for such interrupts to be received by core 0:

The network card must be powered on (see the NET_SET_POWER command);
The network card must have interrupts turned on (see the NET_SET_INTERRUPTS command);
The network card must have reception turned on (see the NET_SET_RECEIVE command);
Core 0 must have interrupts enabled (see the IE bit of the Status register);
Core 0 must not currently be running the interrupt handler (see the EXL bit of the Status register);
Core 0 must be configured to receive network device interrupts (see the INTR_NETWORK and IM fields of the Status register).

If all of these conditions hold, your driver's interrupt handler will run every time there are packets in the receive ring. Your driver should, at such times, remove one (or more) received packets from the tail of the receive ring, replace them with empty buffers, then increment the rx_tail index. The empty buffers used by the driver can be recycled from previously allocated packets that are no longer needed, or can be allocated as needed (so long as the allocation is concurrency-safe, of course).

Buffering. The receive ring allows for a small amount of buffering to be done by the network card hardware. The ring can have up to 16 slots, so the hardware can receive up to 16 back-to-back packets without dropping, even if the driver is busy or ignoring interrupts. Any more buffering than this will need to be done in software by your driver or some other code you write. Ideally, the interrupt handler for the network card does the absolute minimal amount of work possible: remove one (or more) full packets from the receive ring, put them on some other larger list, and replace the vacated slots on the ring with full-sized empty buffers.

Concurrency-safe Datastructures

Typically, a driver's interrupt handler puts packets onto a shared list for a non-interrupt thread to later process. Similarly, packets that are finished need to be recycled in some way: either passing them back to the driver via some shared list, or freeing them so the driver can allocate them, etc. At minimum, these shared datastructures needs to be safe to access from both within and outside of interrupt handlers. This can be as simple as disabling interrupts whenever the datastructure state is accessed. But for multi-core operation, you will need to ensure that multiple cores can safely access the datastructure by using atomic locking primitives, possibly in conjunction with interrupt disabling to deal with the interrupt handler. Alternatively, you can use multiple different datastructures for different purposes, e.g. one datastructure that is interrupt-safe for access only within core 0, and one or more separate data structures that are safe for access by other cores.

Besides the interrupt-disable and LL/SC-based MIPS mutexes discussed above and in lecture, you may find it useful to explore some other concurrency primitives: reader/writer locks, atomic test and set, atomic increment, atomic compare and exchange, semaphores, etc. The usefulness of these varies with the purpose: e.g. protecting a single shared integer counter can be done with atomic increment, but mutexes may be more appropriate for protecting a shared queue.

Honeypot Analysis and Statistics

The honeypot's real functionality consists of (a) analyzing packets received and updating statistics about the results, and (b) listening for and acting on user interface commands. There are four statistics of interest your honeypot should track. They are all very similar, and none require a lot of code to implement, but the cost of calculating each statistic varies. The packets you analyze have a very simple format²: 28 bytes of header information (a 4 byte "source address", a 2 byte "source port", a 2 byte "destination port", a "length" field, etc.), followed by zero or more payload bytes. Mostly, your software can ignore the meaning of the packet headers and payloads.

Global Stats. Keep track of how many packets arrive, how many bytes arrive, and the rate of these over time (i.e. packets per second, bits per second, etc.).

Known Spammer Stats. Given a list of known spammer "source addresses", keep track of how many packets arrive from each (i.e. when the "source address" in the packet matches an address on the list).

Known Vulnerable Ports. Given a list of known "destination ports", keep track of how many packets arrive for each (i.e. when the "destination port" in the packet matches an port number on the list).

Known Evil Packets. Given a list of hash "fingerprint" values, keep track of how many matching packets arrive with that hash "fingerprint". We will use the "djb2" hash function, which is trivial to implement in about 3 lines of C code. The djb2 function, when run over all the bytes in a packet, produces a 4 byte "fingerprint", which can then be compared with the numbers on the list.

For all the above cases, you can average over the whole time your system runs, or over a smaller window (e.g. the last 10 seconds).

User interface. Your system should have some kind of interactive user interface for use in the demo. It can be as simple as just printing the current statistics to the console whenever a certain key is pressed on the keyboard.

Network command interface. Certain packets that arrive will be command packets. For instance, one type of command packet contains a new spammer "source address", which you should add to your list of known spammers, and another type of command packet contains a spammer "source address" to be deleted from your list of known spammers. Similar add and remove command packets exist for the other lists as well. This is the mechanism we will use to populate the lists. Lastly, there will be a command packet that causes your honeypot to print out its statistics. The format of the command packets is given in the simulator documentation.

Notes

Getting Started. In the directory /courses/cs3410/pa4 you will find sources for a skeleton kernel, and documentation about the simulated MIPS environment. You should copy this directory to your home directory so that you can modify these files.

The modified MIPS simulator is available at /courses/cs3410/pa4-sim/ksimulate on the CSUG servers. To use this simulator on the CSUG servers, add /courses/cs3410/pa4-sim to your PATH as you have done in previous assignments. As a reminder, this typically means adding the following line to your ~/.bashrc:

export PATH=${PATH}:/courses/cs3410/pa4-sim

Remember, after adding this line, you need to restart bash for it to take effect, so either log out and log back in, or run the command exec bash. After you have done this, you will be able to simply type ksimulate to invoke the simulator. Instructions on how to set up your own machine can be found on the environment page.

Simulator options. Running ksimulate without any arguments will print out a list of command-line arguments that the simulator accepts. We list the most important of these here:

-t <n> will set the network traffic rate to n megabits per second.
-c <n> will run the kernel with a n-core machine.
-m <n> will allocate n MB of physical memory. You are constrained to use no more than 12 MB of memory for your honeypot. Read up on Bufferbloat to understand why having larger buffers is not a solution to all your problems.
-s <n> will print out status messages every n cycles.

There are other useful options as well; simply run ksimulate and take a look.

Simulator source code.For those interested, the source for the simulator will also be available in /courses/cs3410/pa4-sim/src. It is not necessary to do anything with the pa4-sim/src directory, but it is provided for those who wish to compile the simulator themselves (e.g. to run the simulator on a local machine).

Borrowing Code. It is not necessary to write your own implementation of every function you will need. There are numerous free, public domain and/or open source implementations that can be easily adapted and/or used directly for this project, e.g. for the DJB2 hash function. Remember to cite your sources. Similarly, you can find plenty acceptable MIPS implementations of spinlocks, reader-writer locks, atomic test-and-set, and other concurrency-safe examples. You may adapt these for your own use (being sure to cite your sources).

There are four caveats you must be aware of when borrowing other code:

You may not borrow from other students in this class, or ask others to do your work for you. Borrowing from existing Linux code is fine. Asking Linus to write your project for you is not.
You must cite your sources. Anything else will be considered plagiarism.
Ultimately, you are responsible for choosing the right primitives, making sure the implementations are correct (either by writing it yourself or auditing code you find elsewhere), and applying them correctly. As such, you can expect us to ask questions about what you did, why you did it, and how it works. If you borrow some primitive from Linux, be prepared to explain to us exactly how it works, from the MIPS assembly on up.
Writing your own code from scratch is often not the best use of your time, especially for subtle concurrency-related or hashing code.

Design Document

Submit your design document via CMS by Wednesday, May 7.
Schedule and meet with your TA to present your intended design.

Live Demonstration

You and your project parter will have 10 minutes of time with some of the course staff in which to show off, explain, and answer questions about your project. The demos will be held on Tuesday and Wednesday, May 13 and 14, in the CSUG lab (we have reserved several machines for this purpose, or you can bring laptops). Earlier time slots can be created by appointment. Arrive early to get set up; you have very little time in which to impress the staff with your work.

Use CMS to schedule a Demo time slot on Tuesday or Wednesday, May 13 and 14.

Run your code: In your 10 minute demo, you should show us your system in action, in whatever way you think makes it look best. For instance, you might simply show it running under one or two different configurations, or show off your user interface. Note: there is no advantage in trying to hide flaws, bugs, or incomplete work. Honesty will be rewarded here: if you have not completed part of the project, or if your system crashes under some conditions, then it is best to simply say so. We expect minor bugs or other such problems to be more common than not. You are better off exploring more of the design space than fixing every last minor incompleteness or bug.

Explain your project: You should bring to the demo a small number of slides (say 4 to 6 PowerPoint slides, or equivalent). Use these to explain and illustrate your project to the course staff during the demo. The slides should show, at minimum:

The overall organization (how you used multiple cores, etc.) of your system;
Where and what kinds of synchronization primitives were used;
Quantitative data showing the performance of your system and justifying your design choices, e.g. graphs of simulated throughput versus handled throughput, number of cores versus maximum throughput.

Have the code available: It is likely that the course staff will ask you about how you implemented specific datastructures or synchronization primitives, and may ask to see your code for some of them. Please have your code available at the demo.

What to submit

By 11:59 pm, May 3, 2014

Schedule a design doc meeting on CMS.

By 11:59 pm, May 7, 2014

Submit your design document.
Meet with a TA to present your intended design and receive feedback.

By 11:59 pm, May 10, 2014

Schedule a final presentation on CMS.

By 6:30 pm, May 14, 2014

Meet with the course staff to present your implementation.

By 4:30 pm, May 15, 2014

Submit all of your code.
Submit an improvements_since_demo.txt text file, if any. Also, if you did make improvements, mark your submission as a regrade request.
Submit your presentation slides. Your slides serve as documentation, so they should contain enough extra text to explain what is being shown in each slide. This might be a few extra slides explaining what you said, or should have said, aloud during the demo. Alternatively, you can use the "notes" section of the slides for the same purpose.

Footnotes

(1) A network firewall typically receives packets from one network device, classifies those packets as "good" or "bad", and forwards only the "good" ones to a second network device. A network intrusion detection system typically receives and classifies packets, but raises an alarm whenever a "bad" packet is detected. A honeypot is similar, but typically only gathers statistics for later examination. Real-world honeypots might have a lure of some sort to invite attacks. So perhaps more accurate term for the completely passive approach we are taking in this project is a network telescope or, better yet, an Internet Black Hole.

(2) The packets are actually formatted as UDP datagrams.

Final Group Project 4 - Multi-Core Network Honeypot