CS4411 Programming Assignment 2: Sparse Petabyte File System

Due March 31.

This is the first half of the semester project, in which you will develop a basic but complete implementation of your Sparse Petabyte File System (SPBFS). This phase has several steps, for which I provide nominal due dates—you should consider these due dates genuine, because they keep you on track. March 31, however, is the date when you are expected to submit your working implementation up to the requirements described below.

When submission is closed, I will release a working reference implementation. You may then proceed with the rest of the semester project based on your own code or the reference implementation, whichever you find easiest. Obviously, you will understand your own code the best, but might find it easier not to have to worry about your own bugs. You must turn in a complete original submission, no matter how buggy, to be allowed to switch to the reference implementation.

Regarding threading: FUSE can operate in two modes. In multi-threaded mode, the default, FUSE will spawn a pool of worker threads to handle incoming requests. It will do this totally transparently, on your behalf; the only thing you need to know is that your functions must be re-entrant. In order to avoid this concern for the first half of the project, be sure to run your server in the other, single-threaded mode. You do this by passing your server the -s option. For instance, instead of mounting your server with ./server mnt, run ./server -s mnt.

Announcements

March 31.: The due date has been "extended" to allow late submission until April 7. On-time submissions will receive extra credit.
February 10.: As mentioned in class, the original due date was in the middle of Spring Break. In order to fix that, and also due to reactions in class and the need for lectures on C memory allocation, debugging, index structures, et cetera, I have corrected (extended) the deadline and internal checkpoints. Take note of the final values.; I will provide a test program soon that will exercise the features of the SPBFS. You should look into the testing program, play with it, and possibly write your own, but your responsibility is first to the filesystem itself.; The project has been released; get started as soon as possible so you will be sure to be on track!

Requirements

Your server must:

Provide complete filesystem semantics, including multiple, hierarchical directories, creation and deletion of files and directories, and accurate statistics such as file sizes and hard link counts. You may ignore creating hard links, keeping track of modification times, managing permissions (directories can all be mode 0755 and files can all be mode 0644), et cetera.
Use some sort of tree-like in-memory indexing structure internally to support sparse petabyte-sized files. You need not back the structure on disk anywhere, it can be entirely in-memory for this portion of the project. You need not worry about synchronization or possible race conditions, because you will be running your server in single-threaded mode.

Recommendations

Although the system is due completely on March 31, I suggest breaking it down into the following checkpoints. I will check up informally in class how many people are on track for this plan.

February 24.: Create a filesystem with only one file (called, for example, hello) that can be written at arbitrary positions (which might change the size of the file) as well as read at arbitrary positions (remember that uninitialized parts of a file contain zero bytes). Keep the file in memory as a simple buffer, which you resize using realloc, initialize empty portions eagerly with memset, and do writes and reads with memcpy.
March 10.: Change your malloc-based code to use a tree-like indexing structure, still in memory, to support extremely large but sparse files. Test that your basic filesystem semantics have not been broken, including file sizes, and that your system does indeed support very large files so long as they contain relatively few non-zero initialized bytes.
March 24.: Add multiple files (file creation and deletion). Ensure that their contents are kept straight and their statistics such as file size are correct. Then, add directories, allowing them to be created, removed, and correctly listed. Ensure that the resulting file system behaves like a correct but simple system, with correct hard link counts, the ability to move files around, et cetera.
March 31.: The actual deadline. March 24 is the day after you come back from Spring Break; this is a week's buffer to finish up whatever is left over, so that you do not need to work yourself to death during break.

You might want to take these in a different order. Specifically, you should start with the goals that are likeliest to give you trouble or fail, so that you can get help early. Thus, if you are most worried about the indexing structure (which is, indeed, the logically interesting portion of the assignment), you might want to get to it before worrying about adding directories; if dealing with FUSE and the complexities of filesystem structure worry you more, you should start there. Order them in such a way that the greatest risk is resolved earliest, while leaving a system such that the rest can be appended easily.

Do not concern yourself with efficiency. The point of this phase of the semester-long project is to get a complete, correct, stable implementation. The entire second half of the project will be devoted to tweaking this project with multithreading, synchronization, more advanced indexing, profiling, and optimization. Do not bother with any clever tree rebalancing, resource reclamation, or performance tricks. If you do, not only will this portion of the project be harder for you, you might start down a wrong path that will hinder you in the second half. Correctness comes first, then profiling to understand bottlenecks, and only then comes optimization. Only the first part concerns you so far.

Remember to run your code single-threaded, with the -s option when running your server. Otherwise, you might encounter race conditions when manipulating your indexing structure that will confuse and derail you indefinitely.

Although this project frequently makes reference to petabytes, you do not actually need to worry about an upper limit on the size of the files. The offsets you are passed in to work with are of type off_t, which is the largest size the system itself can handle; make sure that you can support this entire range. On a 64-bit machine, you will indeed support petabytes, but you do not need to get caught up in the details of the actual upper limit on any particular system.