An NFS filehandle stores information to identify a filesystem (or directory in the filesystem that is the export point) and a file within that filesystem.
Identifying the file within the filesystem is now handled fairly well, but there are problems with identifying the filesystem.
The problem with identifying a filesystem is in finding a small, fixed length value that uniquely identifies a particular filesystem.
Currently the default identifier is the major/minor number of the device that the filesystem is on. This works reasonably well for many cases, but it not always reliable. Partly this is because some devices can change device number depending on connection geometries (this is common with SCSI). Partly this is because a device sometimes has to be changed, particularly when moving a filesystem transparently between machines during failover.
The current kernel code allows user-space to specify a number to identify each filesystem. This is known as the 'fsid'. It should be 32bits, though some code/interface problems make it effectively 16 bits.
This is enough bits to identify every filesystem that a site might have, but there is not a lot of spare space. This means that a site needs to allocate small numbers to each filesystem uniquely. This is not particularly hard. The difficulty comes in storing the information safely.
Experience with losing 'rmtab' in 2.4 and earlier kernels shows that it is important to keep multiple copies of important state such as filesystem to fsid mappings. Ideally there should be a per-system copy, and a copy on every exported filesystem (that is writable). Thus after any sort of failure, as long as you have a filesystem to export, you should have the mapping.
The other difficulty is keeping the mapping common across a site if fail-over is likely to be used. This would require a server on each host, and for these servers to regularly communicate. They should probably negotiate ranges of the address space that they are responsible for so that allocations can happen quickly.
So, the proposal seems to be to have a daemon which maintains a mapping from a large, reliably unique filesystem identifier, such as a device type and UUID, to a small local number. This daemon would store the mapping in several local files. It would communicate with a known list of fail-over partners. They would negotiate subsets of the numberspace that they each 'own'. When a new daemon was introduced a re-negotiation could happen providing all known daemons were accessable.
Local communication with the daemon would be via a UNIX-DOMAIN socket. Inter-host communication would need a TCP port to be allocated and would need some level of security. I would perfer the port to be allocated locally and recorded in a config file (possibly /etc/exports). Security doesn't need to be very strong. Possibly a simple password stored in the config file.
(thoughts added 2004june08)
Alternately, rather than identifying the filesystem, we could satisfy ourselves with just identifying the export point as a point in the filesystem. Things will still only work properly if the correct filesystem is mounted there (As inode numbers etc are stored in the filehandle) but it is conceptually much simpler.
This means that if a single device is used to mount various different filesystems (e.g. a cdrom changer) then each one must be mounted at a different location (/mnt/cdrom/rom-name or similar). I think this is acceptable as there task of differentiating different CDroms needs to be accomplished somewhere, and doing it when mounting (which is specific to the media) is easier than doing it when exporting.
This will mean that instead of device type and UUID or similar, we just use path names, which is lots easier.
Possibly we could have syntax in the /etc/exports file something like
/mnt/cdrom/* @hostgroup(mountpoint,ro,fsid=/var/run/fsidsock)could suggest that a mount request for anything under /mnt/cdrom should cause a request to be passed to whoever is listening on /var/run/fsidsock. It can mount the filesystem (if appropriate) and return an fsid.