Lecture 18: Files

Filesystem API


Files are an abstraction that the OS provides to applications for persistent storage of large amounts of data (typically on a disk).

The most common interface is that a file is a unstructured array of bytes optimized for sequential reading or writing. However, some filesystems can provide a sequence-of-records style of interface so that applications can store, retrieve, and query structured data.

Most filesystems store some additional metadata, such as the length of the file, the file permissions (which users may read, write, or execute which files?), various timestamps (when was the file created, read, or modified?), tags (user-specified hints that make it easier to organize and retrieve files), and many other features.

Naming and directories

Filesystems also need some kind of naming scheme, so that persistent data can be located after it is stored. Some filesystems (especially distributed file systems like the Google File System or Amazon's S3) provide a flat naming system, in which each file has a unique name.

However most filesystems provide a directory structure: a directory is a collection of files and other directories, so that the filesystem has a tree structure. Names of files and directories only need be unique within a directory. Files can be identified using an absolute path: for example the unix path "/home/mdgeorge/4410/grades.txt" refers to a file called "grades.txt" which is contained in the "4410" directory, which is itself contained in the "mdgeorge" directory, which in turn is contained in the "home" directory, which is in turn contained in the global "root directory" (simply called "/").

File names in a directory structure can also be specified using a relative path if a particular parent directory is clear from context. For example, if my "working directory" is "/home/mdgeorge", then I can refer to the file "4410/grades.txt" instead of the more verbose "/home/mdgeorge/4410/grades.txt".

Some filesystems provide the ability to give the same file two different paths, effectively turning the tree-structured filesystem into a graph-structured] filesystem. There are two techniques for this:

Virtual filesystems

Many operating systems provide support for multiple distinct filesystems using the same file API. In UNIX, all filesystems are accessed using a single global directory structure. New filesystems are "mounted" at particular paths, and the operating system maintains a table mapping paths to filesystem drivers.

When an application accesses a file within a mounted path, the request is handled by the appropriate driver. For example, I may have a Windows filesystem stored on a USB key, containing the files "/happycat.jpg" and "/grumpycat.jpg". I can mount the USB key filesystem to the directory "/media/usb_key" of my laptop's filesystem. When I try to display "/media/usb_key/happycat.jpg", the OS will realize that the folder "/media/usb_key" should be handled by the USB and Windows file system drivers, and forward the request as appropriate.

This technology makes it easy to implement Virtual File Systems: file systems that are not connected to a disk at all. We have already seen the "/proc" filesystem which provides an interface for traversing the set of PCBs. There are no actual files on a disk anywhere; the virtual files in /proc are generated on the fly by the procfs driver.

Virtual filesystems are also useful for handling network drives, temporary files stored in RAM, and even interfaces to some applications.

On UNIX, you can see the table of mounted file systems using the "mount" command with no arguments. The mount command can also be used to mount additional filesystems.

File API

Since files are conceptually large arrays of bytes, we might imagine an API that takes a file name, a position in the file, and some data, and writes the data into the file (or reads the data out of the file).

This would be inefficient for two reasons:

For this reason, most file APIs require you to open a file before you read or write to it, and maintain a position within the file from which reads or writes proceed.

The open system call typically takes in a file name, and a "mode": whether to open the file for reading, writing, appending, etc. It then

Subsequent calls to the read or write system calls only pass in the file descriptor. These operations will access the file starting at the current position in the file, and will advance the current position.

Typically file system APIs also provide a seek operation to move the reading/writing position to a different part of the file.

On unix, you can see the set of open file descriptors in the fd folder in the /proc entry for the process.

Implementing files

Files can have arbitrary sizes, but are stored in the sectors of disks, which have a fixed size (such as 512 bytes). For this reason, large files are broken into many blocks of data, each the same size as a sector. As an analogy, blocks are to sectors as pages are to frames (but these are very different concepts, don't confuse them!).

There are various strategies for connecting multiple blocks together into a file, described below:

contiguous allocation

Store an entire file across many contiguous blocks. Suffers from external fragmentation. Prevents expanding a file once allocated (appending to a file is a common operation)

Linked list allocation

In addition to containing data, each block contains a pointer to the next block in the file. Makes seek very expensive, because all previous blocks of the file must be read to find any given block.

This also prevents applications from using entire blocks to store data, so that (for example) a 1kB file would span 3 blocks instead of two (because the two pointers would take up a few bytes each, pushing it over the sector size).

File allocation table (FAT)

A file allocation table (or FAT) is used by older MS-DOS filesystems, which is often used as a "lowest common denominator" filesystem for interoperation between different operating systems.

A FAT filesystem takes the pointers from a linked list structure and puts them all into a single, large table (the FAT). The file allocation table contains one entry for each sector of the entire disk, the entry contains the location of the next block.

Because a FAT only contains a few bytes per sector, it is reasonable to cache the FAT for a small disk entirely in memory. This makes it possible to seek (because the entire linked list of a file can be traversed without any disk reads).

This is a common pattern seen when talking about disk: by making a data structure (a linked list in this case) more compact, we can cache them entirely in memory. On the timescales involved in disk accesses, operations that do not access the disk (such as traversing a large in-memory linked list) are basically free.

The primary downside to FAT is that the FAT scales in the size of the disk; a large disk (which all disks are these days) will have a prohibitively large FAT.

Indexed files (UNIX files system, inodes)

The UNIX File System (UFS) uses a data structure called an index node (inode), and is similar to the schemes used by most modern filesystems.

In UFS, each file has a single block called an inode that contains:

This scheme allows the beginnings of files to be accessed using only a small number of disk reads (one for the inode, one for the data), while also supporting large files using indirect blocks.

Directories in UFS

In the UNIX file system, directories are files: each directory contains an inode and some number of data blocks.

The "data" of a directory contains a mapping from file names to inode addresses. This mapping could be stored as a hashtable, an array of fixed-sized entries (if the filesystem only supports fixed-size file names), a list of entries, or a sorted array or list. Because the data can span many blocks, UFS supports directories containing many files or files with very long names.

The downside of this design choice is that locating a file in a directory requires at least two disk reads, and potentially more if the directory contains many files.