A log structured file system (LFS) gives a completely different approach to managing a filesystem.
The central idea behind LFS is that blocks are never modified; whenever an operation conceptually modifies a file, the operation instead places a new block at the end of the log. Writes always go to the end of the disk.
For example, suppose I wished to change the first byte of a file. I would create a new copy of the direct block containing that byte, and place it at the end of the log. Since the address of that block has now changed, I would also create a copy of the inode for the file.
This might require me to create a new copy of any directory containing that file (since the address of the inode has changed). However this is difficult and expensive (consider hard links), so instead we add an extra layer of indirection. Directory entries in an LFS contain inode numbers for the files contained inside of them, instead of the disk addresses of the inodes themselves.
These inode numbers are then looked up in a global inode map, which maps inode numbers to the current location of the inode.
The disk is divided up into large segments. Each segment contains a large number (~1000) of data blocks and inodes, as well as the most recent copy of the inode map. To keep track of the current segment of the filesystem (i.e. the end of the log), a designated "superblock" at the beginning of the filesystem contains a reference to the most recently written segment.
Periodically (or when the current segment is full) the current segment is written to disk, and the head segment number in the superblock is updated to move to the next segment. This is referred to as a checkpoint; once the superblock has been written, the filesystem now reflects everything that happened before the checkpoint.
Note that without garbage collection / compaction (described below) the entire state of the file system at any checkpoint in the past can be recovered by simply changing the head reference to that checkpoint. A log-structured file system preserves history, which is a nice feature.
The downside of storing the entire history is that it can easily fill up the disk with old versions. To clean up the unused segments, the filesystem can periodically run compaction on the tail of the log.
To compact a segment, you examine each block (data block or inode) in the segment. By consulting the current inode table you can determine whether those blocks are the latest versions. If they are, you can copy them to the head of the log (and their inodes, and the inode table), exactly as you would if you were overwriting them with new data. Once you have done this, you can safely reuse the segment, since all of the blocks stored in it are now obsolete.
In order to determine whether a block is stale, you need to know its identity. This is stored in an additional part of the segment called the segment table. The segment table contains an entry for each block in the segment, which identifies which file (inode number) the block is part of, and which part it is (e.g. "direct block 37", or "the inode").
data can be lost if it has been written but not checkpointed. This can be mitigated by decreasing the time between checkpoints or allowing applications to ask to wait until the next checkpoint before proceeding.
most reads are absorbed by cache; writes always append to the log, so they are sequential and very fast.
blocks are located on disk in exactly (or almost exactly) the order in which they were last written. Even if reads miss cache, they will have good locality if the order in which files are read mimics the order in which they are written.
LFS is good for flash memory (solid-state disks or SSDs): flash memory degrades with each subsequent write, but LFS naturally levels out writes evenly across all segments.
SSDs also require write operations on very large segments; writing segments fits these usage characteristics very well.