%%% %deffont "standard" xfont "helvetica-medium-r", tfont "arial.ttf" %deffont "thick" xfont "helvetica-bold-r", tfont "arialbd.ttf" %deffont "typewriter" xfont "courier-medium-r", tfont "courbd.ttf" %% %default 1 prefix "NeilBrown - LCA - LaFS", vgap 100, right, size 2.5, fore "white", back "blue", font "thick" %default 2 size 7, vgap 100, center, prefix "", fore "orange" %default 3 leftfill, size 2, bar "yellow", vgap 10 %default 4 prefix 0, size 5, fore "yellow", vgap 30, prefix " ", font "standard" %tab 1 font "standard", size 6, vgap 60, prefix " ", icon box "red" 50, left %tab 4 font "standard", size 6, vgap 60, prefix " ", icon box "blue" 0, left %tab xx size 7, vgap 60, prefix " ", icon box "red" 50, left %tab email cont, font "typewriter" %tab 2 font "standard", size 5, vgap 60, prefix " ", icon arc "orange" 50 %tab 3 font "standard", size 4, vgap 60, prefix " ", icon delta3 "white" 40 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page %pcache 1 1 2 15 %nodefault, back "blue", font "thick", center, size 7 Who wants another filesystem? %size 4 %mark %right %newimage -xscrzoom 24 "images/cse.ppm" %again %mark %left %prefix 3 %newimage -xscrzoom 18 "images/crest.ppm" %prefix 0 %again %center or %size 5 %left %center An introduction to LaFS %prefix 0 %center %fore "cyan" %size 7 Neil Brown neilb@cse.unsw.edu.au %size 5 linux.conf.au - Perth 2003 %page What is this all about? Different Filesystem Priorities My Priorities for a File Server's file system Log Structuring for Filesystems. Volume management Supporting RAID Supporting NFS Supporting Backups Quotas Planning for Reliability %page Different Filesystems for Different Needs Ext3 ext2 compatible layout e2fsck works Well understood technology Reiserfs The power of trees Efficient for small files Good at large directories Semantic Innovation %page Different Filesystems for Different Needs - 2 XFS High throughput for multimedia JFS General all-round good performer. Others FAT, cramfs, jffs, efs, vxfs, hfs, isofs, ntfs .... Compatibility, or special purpose. LaFS Departmental File Server %page What makes a Departmental Fileserver? A few hundred gigabytes of storage Commodity hardware Good random performance Good failure modes Easy management %page How to Build a Departmental Fileserver Use a log structured filesystem Incorporate volume management Optimise for RAID and NFS Implement efficient backups Make quotas work well %page Log Structuring Idea from 10 years ago Goal of faster write throughput Implemented in Sprite OS LFS in BSD Speed results unconvincing. Two deviations from conventional FFS design Flexible layout - one big tree Free space management - a reusable log %page Tree Based Layout No fixed locations Well, maybe one for the root Inodes Stored in a file (the ifile) Not in a fixed table Free block table (or equivalent) Also stored in a file %page Tree Based Layout - 2 Inode of ifile can be easily found. Store a pointer to that root inode in superblock. Store several copies of superblock with version numbers This allows any part of the filesystem to be stored %prefix 10 anywhere on disc %prefix 0 Just this change is essentially what Daniel Phillips %prefix 10 did for TUX2 %page Free Space treated like a log Divide device into Segments (a few megabytes each) Each segment is either in-use, or free. We write linearly to the 'current' segment data, indirect, inode blocks, whatever. Write a 'cluster' at a time. When current segment is full, find a free segment. If time-to-write dwarfs time-to-seek, %prefix 10 get good throughput. %prefix 0 What if we run out of clean segments? %page Cleaning the Log As files are deleted or overwritten %prefix 10 blocks in active segments die. %prefix 0 If all blocks in a segment die, it is clean, %prefix 10 and it can be reused. %prefix 0 If not all die, we need to clean it Identify live blocks Relocate them %page Cleaning the Log - 2 Store descriptors when writing each cluster identity of each block (inode/offset) linkage information between clusters Store segment usage table Stored in a file Stores count of live blocks Also stores age information Use these to guide Selection of cluster for cleaning. Location of live blocks to be relocated. %page Summary of Log structuring Data written in large contiguous chunks (clusters). Data written to unused potions of device. Recently written data easily found. Old data is not overwritten immediately This allows 'snapshots' to be taken easily. %page Volume Management Common approach is an LVM layer Divide each device into portions Assemble these into 1 or more devices Mount a filesystem on the virtual device It divides the device into blocks ... and assembles those into files This is simple and flexible but Involves double handling Hides device details from the filesystem. %page Volume Management - 2 Better approach is to use filesystem Filesystem 'knows' about multiple devices %% Filesystem can present multiple trees File system allows devices to be added (or removed?) Not a new idea Digital/Compaq/HP's Advanced Filesystem does this ... but not common %page Volume Management - 3 Volume management and LaFS Allow device removal Cleaner empties a device Device gets removed Different priority devices New writes to fast, small device. Data cleaned to slower larger device. Similar to external journal with EXT3. Can 'see' RAID geometry of devices. %page Supporting RAID Understand striping to avoid head contention Lay out smallish files on single device Understand RAID4/5 parity issues Pad write clusters to fill whole stripes Writing over dead data avoids unclean shutdown problems Preferred arrangement is RAID4 with linear addressing Stripe-wide writes mean RAID5 has no advantage Rectangular addressing is easier than with RAID5 Linear addressing allows RAID growth %page Supporting NFS Need low-latency commits i.e. Write data and index information with minimum seeks Have low-latency commits! A single write cluster can contain data and indexing It commits with two multiblock writes (one write on some NVRAM devices) Can have NVRAM device for new data %page Supporting Backups - Wants Want to write all data to e.g. tape Want to minimised seeks when collecting data Want to avoid backing up dead blocks Want to be able to only backup new data Want to be able to run on live filesystem %page Supporting Backups - How We backup whole segments in reverse chronological order Don't backup clean segments Optionally ignore old (already dumped) segments Safe on a live filesystem Cleaning will need to be partially suspended Restoring will be interesting, but not frequent. %page Quotas User based quotas don't work e.g. group accounts Group based quotas don't work e.g. when groups used for access control Tree based quotas work well Need extra field in inode Slightly unusual semantics between trees %page Reliability Document Exhaustive technical documentation before implementation Prototype Write a prototype, then throw it away Test build test battery build test tools %page Status Documentation well under way Prototype has minimal functionality Stay tuned.... %font "typewriter", size 4, prefix 10 http://www.cse.unsw.edu.au/~neilb/projects/lafs/