next up previous
Next: Supporting volume management Up: Who wants another filesystem? Previous: Introduction

A primer on Log Structured Filesystems

The concept of a Log Structured File System was first developed by Mendel Rosenblum 1and was extended by Margo Seltzer et.al. for the BSD operating system. logfs has been available on some flavours of BSD, but does not appear to be well supported.

A couple of effort have been reported which aimed to create a log structured filesystem for Linux, possibly the most notable being LinLogFS, previously known as dtfs2.

LinLogFS was hampered somewhat by being developed on a 2.2 kernel which would have made much of the implementation much more awkward than in a 2.6 kernel. It also embodied some design decisions that do not fit with my goal. Thirdly, development seems to have stopped nearly 3 years ago. For these reasons, LinLogFS was not considered as a starting point for LaFS.

To understand how a log structured filesystem works two basic ideas are needed. The first involves storing the entire filesystem as a tree, providing a single starting point for all addressing. The second involves dividing the storage device into large segments to which new data is written.

Representing a filesystem as a single tree is not an idea unique to log structure filesystems. reiserfs works this way as does TUX2. Indeed any Unix filesystem is largely a tree already, as there is a 'root' directory and all files and directories are attached to that single root by way of one or more intermediate levels in the tree.

In a traditional Unix filesystem, the parts of the filesystem that are not addressed as part of a tree are the inode table and the map of free blocks. Rather than having locations on the device described by higher levels in the tree, these have fixed locations on the device.

Taking such a filesystem, and storing the inode table and the free-block map in special files is a simple way to create a filesystem that is fully based on a tree and this is exactly what TUX2 does.

The result of having all the metadata in files is that the only datum that has a fixed location on the device is a pointer to the root of the tree. This root then points, possibly via indirect blocks, to the file containing all inodes. One of these inodes addresses a file that contains the free block bitmap. Another inode contains the root directory. Following this pattern, the entire filesystem can be found.

Once we have all addressing for the filesystem in a single tree we have a great deal of flexibility about what data gets store on which part of the device. As there are no fixed locations, anything can be stored anywhere on the device. This leads us to the second important aspect of a log structured filesystem -- the segments.

The device is conceptually divided into a number of segments, each a few megabytes in size. When data needs to be written to the filesystem, an unused segment is found, and all dirty data is written to that segment together with any indexing information needed to find the data. This could involve:

All this data is written contiguously into one or more segments, and finally the address of the root inode is written to some fixed location on the device. From that inode, the entire new state of the filesystem can be found.

This approach seems simple and elegant, until you run out of unused segments. At that point is suddenly seems horribly flawed. To get us out of this sticky situation, we have a service called a cleaner.

As files are created, updated, and deleted, some segments that at one time were full of useful data, will eventually become full of uninteresting, unreachable data, and maybe just a few blocks of live data. It is the task of the cleaner to identify those segments which have just a few blocks of interesting data, and to gather the interesting blocks from a number of segments together and relocate them to a new, previously clean, segment. Once this is done, all those old segments can be marked as 'clean' and can be re-used.

To help the cleaner in its task we store some extra metadata in the filesystem. These include:

The summary information is not really per-segment, but is per-write-cluster. A write-cluster is a header containing the summary plus a number of data and metadata blocks that are all written at once. A write cluster may span an entire segment, but in the face of synchronous writes there could be many write clusters per segment.

From the above description, it would appear that to update a single block, and to force that change safely to disk, would require writing the block, possibly an indirect block, the inode, an indirect block to locate the inode, and the root inode, and then recording the new location of the root inode in its fixed home. Performing lots of synchronous writes, as an NFS server typically does, would cause a substantial waste in write throughput.

However because of the summary information in each segment we do not need to do this. To flush a single block to storage, all that is needed is to write out the block together with some summary information that describes it. Each write cluster contains this summary information and also the address of the next write cluster, and some information for internal validation. This allows recently written write clusters to be found at filesystem mount time, and to be incorporated into the filesystem through a process called roll-forward. Thus committing a single write cluster is all that is needed to commit a block of data.

Typically several blocks will be written together even when doing synchronous writes, so the overhead will only be one block for the summary information, rather than several blocks for all indexing and inode information.

Some useful characteristics of a log structured filesystem as outlined above are:

Some other aspects of lafs that are not general to all Log structure filesystem, are not specific to any of the needs which will be discussed in the following sections, but are none-the-less interesting are:


next up previous
Next: Supporting volume management Up: Who wants another filesystem? Previous: Introduction
Neil Brown 2003-02-06