uses AFS for metadata operations but allows a high-performance data transfer over a SAN path and dCache [Fuh07], a replication-based storage system used in high-energy physics grid applications. In dCache, a NFS-based layer is used for metadata operations but other transport protocols, such as HTTP oder gridFTP, must be used for data transfers. There are already efforts underway to standardize this kind of hybrid solution: NFS in Version 4.1 will allow multiple alternatives to data transfer via RPC such as object protocols comparable to parallel file systems or direct block device access like in SAN file systems.
The fact of existence of hybrid storage systems is important: at least for some it may be possible to separate the access protocol (eg. NFS) from the storage infrastructure. That means that scalability, performance and semantical properties of the protocol and the protocol im- plementation can also be considered separately.
2.6 Semantics of data and metadata access
To correctly compare different file and storage systems, one must keep in mind the behaviour guaranteed by the file systems in question, termed the data access semantics. Two important considerations are the manner by which concurrent and possibly conflicting modifications are handled and the manner by which data persistence can be guaranteed. Both considera- tions heavily influence the performance of distributed file systems.
2.6.1 Concurrent access to file data
To reduce the amount of synchronisation and locking, distributed file systems often relax consistency guarantees as compared to local file systems, which usually adhere to rules known as ’POSIX-semantics’.
POSIX semantics
POSIX specifications require that immediately after a write call issued by a process has returned all subsequent read-operations from the same or other processes pertaining to the same file region will return the data previously written [Gro04]:
After a write() to a regular file has successfully returned:
• Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.
• Any subsequent successful write() to the same byte position in the file shall overwrite that file data.
A special case are writes to files opened with the O APPEND-flag. In this case each write()call is guaranteed to set the file pointer to the end of file before it is executed.
Efficiently implementing this behaviour in a distributed file system can be difficult be- cause operations must be synchronised and/or (in the case of O APPEND-writes) serialized. Therefore alternative, less strict semantics have been proposed, some of which require the execution of explicit steps to refresh any cached data.
Nonconflicting write semantics
The term nonconflicting write was introduced by Rob Ross in the specification for PVFS2 [LRT05]. While write operations to disjunct file regions behave similarly to POSIX, the re- sults of simultaneous writes to overlapping file regions are undefined. Nonconflicting write semantics can be implemented by caching data only on servers and disabling client caching.
Close-to-open semantics
Close-to-open semantics guarantees that when one process closes an open file, another pro- cess that subsequently opens the file will be able to read the changes. A particularly impor- tant consideration here is that opening the file is a necessary precondition for this guarantee. The most important file system using close-to-open semantics is NFS [CPS95].
Open-to-close semantics
Open-to-close semantics, also known as transaction semantics guarantees that after a close or fsync, all other processes can read the changes. Re-opening the file is not necessary as is the case with close-to-open semantics described above. Open-to-close semantics is imple- mented in AFS.
Immutable semantics
Immutable (read-only) semantics completely forbids write access. While this is not useful for active file systems, most real and virtual copies of active file systems (e.g., snapshots or repli- cas) exhibit immutable semantics [Cam98, HLM02]. A desireable property of immutability is that it is trivial to reason about freshness of cached or replicated data, because data never changes once written.
Version semantics
File systems using version semantics can maintain several versions of a file[SFHV99, SGSG03]. Changes are applied using transaction semantics (e.g., between open() and close()). During a transaction every process works on its own version of the file. If changes are made,
2.6. Semantics of data and metadata access 25
new versions are created. Version semantics differs from immutable semantics in its abil- ity to limit the number of versions by merging changes and resolving conflicts [RHR+94]. This process, unfortunately, requires knowledge about the internal structure of file data and thus cannot be implemented by a file system alone but also needs application support, and perhaps even user interaction. Still, version semantics is used in file systems for mobile applications (eg. disconnected mode in CIFS).
2.6.2 Internal metadata semantics
User-level applications do not interact with internal metadata directly. Because allocation of storage is conducted together with file data modifications, the rules for data operations apply.
2.6.3 External metadata semantics
Metadata semantics defines how the file system behaves and what it guarantees during con- flicting and concurrent operations on external metadata.
Uniqueness of file names
File systems guarantee the uniqueness of file names in a directory. New directory entries emerge by opening a file with the O CREATE flag, generating a link or symlink, renaming a file, or creating a new directory. In all cases, an error status code will be returned if an entry with the same name already exists in the directory. This uniqueness requirement effectively determines the performance of directory operations.
Atomic rename
POSIX states that the rename() function that changes the file path must be atomic. This property is often used by applications to make high-level operations atomic by creating and writing files using a temporary name or location and then moving them into the final path [Ber00]. Atomicity is only guaranteed when the source and destination are inside the same local file system: otherwise, an EXDEV error status code is returned because the action would involve file data movement.
In the case of distributed file systems, the behavior of rename() is similar to the lo- cal case: when the namespace is assembled from separately managed subvolumes it is not possible to atomically move a file between two separate sub-namespaces. For example NFS implements an NFS3ERR XDEV status [CPS95]. In spite of a single mountpoint on the client, a server will respond with an error to requests that try to move a file between separate server file systems.
Race conditions on directory listings
From an application perspective, reading the contents of a directory is an iterative operation based upon calling opendir() and subsequent readdir() until all entries are fetched. The process is not atomic which means that if the content of the directory changes while an application iterates through the contents, the results will be undefined. This is true both for local and distributed file systems.
Visibility of changes
A side effect of the race condition is that the visibility of changes to metadata, such as in the creation of new files, can be delayed. Some protocols (e.g., NFS) allow time-based caching of directory entries. Because the changes might not be visible until the cache expires, the only way to check the existence of a file is to try to open() it; simply listing and iterating over directory contents is not sufficient.
2.6.4 Persistence semantics
Concurrent access semantics defines what state processes can see when they interact with a file system. This does not necessarily correspond to the state of the file system in stable storage. If an interruption (crash) occurs, data in volatile memory caches could be lost, and therefore applications need a way to ensure their modifications achieve persistent storage. This process can have a great influence on data and metadata performance. Depending upon the storage subsystem, persistence may mean not only writing to disk but also to non-volatile memory (NVRAM).
Normal operations
During normal operations, both data and metadata are stored in a cache in RAM and flushed to disk when it either becomes full or memory pressure is created because memory is needed for other applications. Additionally, cache modifications are written to disk on regular in- tervals. However, just waiting for a certain period does not guarantee a successful modi- fication, as there is still a possibility of I/O-errors while flushing the cache. The operating system reports such errors to applications as a result to fsync() or close() calls.
File data persistence
To ensure that data is stored persistently, both implicit and explicit techniques can be applied at a single or entire file system level and, in some systems, also on all files in a particular directory.