The Vnode Layer
WSTAT_MODE WSTAT_UID
WSTAT_GID WSTAT_SIZE WSTAT_ATIME WSTAT_MTIME WSTAT_CRTIME
The wstat() function subsumes numerous user-level functions (chown, chmod, ftruncate, utimes, etc.). Being able to modify multiple stat fields in an atomic manner withwstat()is useful. Further, this design avoids having seven different functions in the vnode layer API that all perform very narrow tasks. The file system should only modify the fields of the node as specified by the mask argument (if the bit is set, use the indicated field to modify the node).
The final function in this group of routines isfsync(). The vnode layer expects this call to flush any cached data for this node through to disk. This call cannot return until the data is guaranteed to be on disk. This may involve iterating over all of the blocks of a file.
Create, Delete, and Rename
The create, delete, and rename functions are the core functionality provided by a file system. The vnode layer API to these operations closely resembles the user-level POSIX functions of the same name.
1 0 . 4 H O W I T R E A L LY W O R K S
173
create()
Creating files is perhaps the most important function of a file system; without it, the file system would always be empty. The two primary argu- ments ofcreate()are the directory in which to create the file, and the name of the file to create. The vnode layer also passes the mode in which the file is being opened, the initial permissions for the file, and pointers to a vnid and a cookie that the file system should fill in.
Thecreate()function should create an empty file that has the name given and that lives in the specified directory. If the file name already exists in the directory, the file system should callget vnode()to load the vnode associated with the file. Once the vnode is loaded, the mode bits specified may affect the behavior of the open. IfO EXCLis specified in the mode bits, thencreate() should fail withEEXIST. If the name exists but is a directory,create()should returnEISDIR. If the name exists and O TRUNC is set, then the file must be truncated. If the name exists and all the other criteria are met, the file system can fill in the vnid and allocate the cookie for the existing file and return to the vnode layer.
In the normal case, the name does not exist in the directory, and the file system must do whatever is necessary to create the file. Usually this en- tails allocating an i-node, initializing the fields of the i-node, and inserting the name and i-node number pair into the directory. Further, if the file sys- tem supports indexing, the name should be entered into a name index if one exists.
File systems such as BFS must be careful when inserting the new file name into any indices. This action may cause updates to live queries, which in turn may cause programs to open the new file even before it is completely created. Care must be taken to ensure that the file is not accessed until it is completely created. The method of protection that BFS uses involves marking the i-node as being in a virgin state and blocking inread vnode() until the virgin bit is clear (the virgin bit is cleared bycreate()when the file is fully created). The virgin bit is also set and then cleared by themkdir()and symlink()operations.
The next step in the process of creating a file is for the file system to call new vnode()to inform the vnode layer of the new vnid and its associated data pointer. The file system should also fill in the vnid pointer passed as an argument tocreate()as well as allocating a cookie for the file. The final step in the process of creating a file is to inform any interested parties of the new file by callingnotify listener(). Once these steps are complete, the new file is considered complete, and the vnode layer associates the new vnode with a file descriptor for the calling thread.
mkdir()
Similar to create(), themkdir() operation creates a new directory. The difference at the user level is that creating a directory does not return a file handle; it simply creates the directory. The semantics from the point of view
of the vnode layer are quite similar for creating files or directories (such as returningEEXISTif the name already exists in the directory). Unlike a file, mkdir()must ensure that the directory contains entries for “.” and “..” if necessary. (The “.” and “..” entries refer to the current directory and the parent directory, respectively.)
Unlikecreate(), themkdir()function need not callnew vnode()when the directory creation is complete. The vnode layer will load the vnode separately when anopendir()is performed on the directory or when a path name refers to something inside the directory.
Once a directory is successfully created, mkdir() should call notify listener()to inform any interested parties about the new directory. After callingnotify listener(),mkdir()is complete.
symlink()
The creation of symbolic links shares much in common with creating di- rectories. The setup of creating a symbolic link proceeds in the same manner as creating a directory. If the name of a symbolic link already exists, thesym- link()function should returnEEXIST(there is no notion ofO TRUNCorO EXCL for symbolic links). Once the file system creates the i-node and stores the path name being linked to, the symbolic link is effectively complete. As with directories and files, the last action taken bysymlink()should be to call notify listener().
readlink()
Turning away from creating file system entities for a moment, let’s con- sider thereadlink()function. The POSIX API defines thereadlink()func- tion to read the contents of a symbolic link instead of the item it refers to. Thereadlink() function accepts a pointer to a node, a buffer, and a length. The path name contained in the link should be copied into the user buffer. It is expected that the file system will avoid overrunning the user’s buffer if it is too small to hold the contents of the symbolic link.
link()
The vnode layer API also has support for creating hard links via thelink() function. The vnode layer passes a directory, a name, and an existing vnode to the file system. The file system should add the name to the directory and associate the vnid of the existing vnode with the name.
The link()function is not implemented by BFS or any of the other file systems that currently exist on the BeOS. The primary reason for not im- plementing hard links is that at the time BFS was being written, the C++ user-level file API was not prepared to deal with them. There was no time to modify the C++ API to offer support for them, and so we felt that it would be better not to implement them in the file system (to avoid confusion for programmers). The case is not closed, however, and should the need arise,
1 0 . 4 H O W I T R E A L LY W O R K S
175
we can extend the C++ API to better support hard links and modify BFS to implement them.
unlink() and rmdir()
A file system also needs to be able to delete files and directories. The vnode layer API breaks this into three functions. The first two, unlink() andrmdir(), are almost identical except thatunlink()only operates on files andrmdir()only operates on directories. Bothunlink()andrmdir()accept a directory node pointer and a name to delete. First the name must be found in the directory and the corresponding vnid loaded. Theunlink()function must check that the node being removed is a file (or symbolic link). Thermdir() function must ensure that the node being removed is a directory and that the directory is empty. If the criteria are met, the file system should call the vnode layer support routineremove vnode()on the vnid of the entity being deleted. The next order of business for either routine is to delete the named entry from the directory passed in by the vnode layer. This ensures that no further access will be made to the file other than through already open file descriptors. BFS also sets a flag in the node structure to indicate that the file is deleted so that queries (which load the vnid directly instead of going through path name translation) will not touch the file.
remove vnode()
The vnode layer support routineremove vnode()marks a vnode for dele- tion. When the reference count on the marked vnode reaches zero, the vnode layer calls the file systemremove vnode()function. The file systemremove vnode()function is guaranteed to be single threaded and is only called once for any vnid. Theremove vnode()function takes the place of a call towrite vnode(). The vnode layer expects the file systemremove vnode()function to free up any of the permanent resources associated with the node as well as any in-memory resources. For a disk-based file system such as BFS, the per- manent resources associated with a file are the allocated data blocks of the file and extra attributes belonging to the file. Theremove vnode()function of a file system is the last call ever made on a vnid.
rename()
The most difficult of all vnode operations isrename(). The complexity of therename()function derives from its guarantee of atomicity for a multistep operation. The vnode layer passes four arguments torename(): the old di- rectory node pointer, the old name, the new directory pointer, and the new name. The vnode layer expects the file system to look up the old name and new name and callget vnode()for each node.
The simplest and most commonrename()case is when the new name does not exist. In this situation the old name is deleted from the old directory and
the new name inserted into the new directory. This involves two directory operations but little more (aside from a call tonotify listener()).
The situation becomes more difficult if the new name is already a file (or directory). In that case the new name must be deleted (in the same way that unlink()orrmdir()does). Deleting the entity referred to by the new name is a key feature of therename()function because it guarantees an atomic swap with an old name and a new name whether or not the new name exists. This is useful for situations when a file must always exist for clients, but a new version must be dropped in place atomically.
After dealing with the new name, the old name should be deleted from the old directory and the new name inserted into the new directory so that it refers to the vnid that was associated with the old name.
The vnode layer expects that the file system will prevent unusual situa- tions such as renaming a parent of the current directory to be a subdirectory of itself (which would effectively break off a branch of the file hierarchy and make it unreachable). Further, should an error occur at any point during the operation, all the other operations must be undone. For a file system such as BFS, this is very difficult.
File systems that support indexing must also update any file name indices that exist to reflect that the old name no longer exists and that the new name exists (or at least has a new vnid). Once all of these steps are complete, the rename() operation can call notify listener() to update any programs monitoring for changes.
Attributes and Index Operations
The BeOS vnode layer contains attribute and index operations that most ex- isting file systems do not support. A file system may choose not to imple- ment these features, and the vnode layer will accommodate that choice. If a file system does not implement extended functionality, then the vnode layer returns an error when a user program requests an extended operation. The vnode layer makes no attempt to automatically remap extended features in terms of lower-level functionality. Trying to automatically map from an ex- tended operation to a more primitive operation would introduce too much complexity and too much policy into the vnode layer. For this reason the BeOS vnode layer takes a laissez-faire attitude toward unimplemented fea- tures and simply returns an error code to user programs that try to use an extended feature on a file system that does not support it.
An application program has two choices when faced with the situation that a user wants to operate on a file that exists on a file system that does not have attributes or indices. The first choice is to simply fail outright, inform the user of the error, and not allow file operations on that volume. A more sophisticated approach is to degrade functionality of the application grace- fully. Even though attributes may not be available on a particular volume, an
1 0 . 4 H O W I T R E A L LY W O R K S
177
application could still allow file operations but would not support the extra features provided by attributes.
The issue of transferring files between different types of file systems also presents this issue. A file on a BFS volume that has many attributes will lose information if a user copies it to a non-BFS volume. This loss of information is unavoidable but may not be catastrophic. For example, if a user creates a graphical image on the BeOS, that file may have several attributes. If the file is copied to an MS-DOS FAT file system so that a service bureau could print it, the loss of attribute information is irrelevant because the destination system has no knowledge of attributes.
The situation in which a user needs to transfer data between two BeOS machines but must use an intermediate file system that is not attribute- or index-aware is more problematic. We expect that this case is not common. If preserving the attributes is a requirement, then the files needing to be trans- ferred can be archived using an archive format that supports attributes (such aszip).
A file system implementor can alleviate some of these difficulties and also make a file system more Be-like by implementing limited support for attri- butes and indices. For example, the Macintosh HFS implementation for the BeOS maps HFS type and creator codes to the BeOS file type attribute. The resource fork of files on the HFS volume is also exposed as an attribute, and other information such as the icon of a file and its location in a window are mapped to the corresponding attributes used by the BeOS file manager. Hav- ing the file system map attribute or even index operations to features of the underlying file system format enables a more seamless integration of that file system type with the rest of the BeOS.
Attribute Directories
The BeOS vnode layer allows files to have a list of associated attributes. Of course this requires that programs have a way to iterate over the attri- butes that a particular file may have. The vnode operations to operate on file attributes bear a striking resemblance to the directory operations:
op_open_attrdir (*open_attrdir); op_close_attrdir (*close_attrdir); op_free_cookie (*free_attrdircookie); op_rewind_attrdir (*rewind_attrdir); op_read_attrdir (*read_attrdir);
The semantics of each of these functions is identical to the normal direc- tory operations. Theopen attrdirfunction initiates access and allocates any necessary cookies. Theread attrdirfunction returns information about each attribute (primarily a name). Therewind attrdirfunction resets the state in the cookie so that the nextread attrdircall will return the first entry. The close attrdirandfree cookieroutines should behave as the corresponding
directory routines do. The key difference between these routines and the normal directory routines is that these operate on the list of attributes of a file.
Working with Attributes
Supporting attributes associated with files requires a way to create, read, write, and delete them, and to obtain information about them. The vnode layer supports the following operations on file attributes:
op_write_attr (*write_attr); op_read_attr (*read_attr); op_remove_attr (*remove_attr); op_rename_attr (*rename_attr); op_stat_attr (*stat_attr);
Notably absent from the list of functions are create attr() and open attr(). This absence reflects a decision made during the design of the vnode layer. We decided that attributes should not be treated by the vnode layer in the same way as files. This means that attributes are not entitled to their own file descriptor in the way that files and directories are. There were sev- eral reasons for this decision. The most important reason is that making attributes full-fledged file descriptors would make it very difficult to manage regular files. For example, if attributes were file descriptors, it would be pos- sible for a file descriptor to refer to an attribute of a file that has no other open file descriptors. If the file underlying the attribute were to be erased, it becomes very difficult for the vnode layer to know when it is safe to call theremove vnodefunction for the file because it would require checking not only the reference count of the file’s vnode but also all the attribute vnodes associated with the file. This sort of checking would be extremely complex at the vnode layer, which is why we choose not to implement attributes as file descriptors. Further, naming conventions and identification of attributes complicate matters even more. These issues sealed our decision after several aborted attempts to make attributes work as file descriptors.
This decision dictated that all attribute I/O and informational routines would have to accept two arguments to specify which attribute to operate on. The first argument is an open file descriptor (at the user level), and the second argument is the name of the attribute. In the kernel, the file descriptor argument is replaced with the vnode of the file. All attribute operations must specify these two arguments. Further, the operations that read or write data must also specify the offset to perform the I/O at. Normally a file descriptor encapsulates the file position, but because attributes have no file descriptor, all the information necessary must be specified on each call. Although it may