Storage Grids A Technology Survey

(1)

Issue August 2006

Storage Grids

A Technology Survey

Pages 19

Preliminary Note

Storage technology is characterized by a high dynamic of its underlying technologies, concepts, and solution offers. A relative new approach for scale-out storage applications is called storage grids. It is self-evident that a common understanding of such a new technology needs some time to develop and that there is still only small spreading of this topic in the storage mainstream.

To get a better understanding of storage grids, Fujitsu Siemens Computers and the Heinz Nixdorf Institute of the University of Paderborn analyzed the state of the art and future directions of storage grids. This white paper includes a survey of the architectural concepts and describes the basic technologies.

Authors of this White Paper are:

Dr. André Brinkmann, Heinz Nixdorf Institute Paderborn and Sascha Effert, Heinz Nixdorf Institute Paderborn

It is important to notice that this white paper is intended to offer a technology survey. Being a forward-looking paper, it can happen that products, concepts, and solutions described in this paper are not part of the Fujitsu Siemens Computers portfolio and will potentially also not become parts of it in the future.

Grid Computing and Storage Grids

The enabling technology grid computing has become the driving force of scientific computing in recent years. The concept of grid computing is based on pioneering work of I. Foster und C. Kesselman in the mid nineties of the last century [4]. The motivation of grid computing is to offer computing resources as

transparently as electrical power which is provided by a power grid. These combined resources of large parallel computing centers enable scientists to solve classical scale-out problems like weather forecasts or the search for smallest, subatomic particles. The meaning of grid computing is used today in an even broader sense. A grid is defined as a set of distributed, heterogeneous resources, which are connected by a local or wide area network and which can be used by a user or application like a single, virtual system. The grid itself builds the infrastructure that is offered by a middleware to the user.

Storage grids try to transfer the idea of grid computing into the storage domain. The administration software of a storage grid can be seen as a middleware that manages the storage systems inside a storage grid under a unified management console. The storage systems of a storage grid can either be distributed over wide area

Gigabit Ethernet-Switch

Storage Grid

Clients Storage Bricks

Connections between Clients and Storage Grid

(3)

network or can be used inside a single computing center. The term middleware is used as a definition for a software system that manages distributed state information. The state information is generated by the

creation and registration of digital entities inside the logical namespace of the storage grid (see e.g. [24], [14] and [13]).

From the storage vendor perspective, there is still no common usage of the term storage grid. In any case, all commercial definitions of storage grids include the provisioning of a global, logical name space and a unified management of all elements of the storage grid environment. The key software component of a storage grid is either a distributed file system (in case of a NAS appliance) or a block-level virtualization technology (in case of scalable RAID systems), allowing the adaptation of the grid to the storage environment. The meaning of storage grids varies from scalable RAID arrays to the joint management of resources of a storage area network and up to scalable NAS-appliances. Distributed storage grids that span more than a single server room are still uncommon outside academic research. The connection between a client and a storage brick in today’s commercial storage grids is mostly based on the iSCSI, NFS, or CIFS protocol and an Ethernet network (see Figure 1).

Basic components of a storage grid are storage bricks. The term storage brick has been introduces by Chaitan Baru and Bala Iyer and characterizes a storage appliance that does not only include storage capacity, but also computing and communication power. The computing capabilities enable a storage brick to contain a

software management stack and to act as a storage appliance. The software stack inside a storage brick is responsible for a seamless integration of the brick into the grid environment. Integrating a new storage brick therefore only involves the assignment to a storage resource pool, all other administration tasks, like

authentication or rights management, are handled by the grid middleware. This seamless integration of new bricks is seen as the key to a simplified management, where the management burden does not increase with the size of the storage capacity. A characteristic element of storage grids is that adding new storage bricks does not only increase storage capacity, but that the computing and communication resources of the brick also improve the performance of the entire storage grid.

A subclass of storage grids arises by the introduction of dedicated computing resources into the storage grids that are not only used for management and administration purposes but also for the processing of the stored data. Pioneering work has been proposed by the Parallel Data Lab of the Carnegie Mellon University (see [18] and [15]). The fundamental idea of the proposed active disk architecture is that disk controllers become more and more sophisticated and that they can be used to transfer data intensive work from server systems into the storage devices. The tasks can range from the optimization of the disk accesses up to applications inside the disk controllers, like the search for data elements inside a data base. In case of huge amounts of data, this concept introduces a simple approach for the parallelization of complex tasks and therefore promises big performance gains.

Storage Grid Architectures

Two important aspects of storage grid hardware are the interconnection network between the nodes of the storage grid and the connections between the nodes and the persistent storage. From the software

perspective, the most important performance challenge is to ensure an even balancing of the data requests among the nodes. Otherwise, peak demands can (and will) endanger system scalability.

In the following, we will discuss the basic principles of the hardware architecture and we will present methods and solutions for the management of storage grids. In section Storage Interconnects, current standard technologies for direct and distributed storage interconnects are introduced. The management of storage systems is addressed in section Storage Management. The main focus of the section is on the basic technologies storage virtualization and distributed file systems. The scope of the section Cluster

Interconnects is on interconnection technologies between the grid nodes and section Node Management handles the management of cluster architectures.

Storage Interconnects

DAS and SAN

The different technologies for the interconnection of server nodes and persistent storage can be distinguished in two categories. Direct Attached Storage(DAS) describes storage systems, which are directly connected with a single (or very few) server node(s). Opposed to DAS environments, the storage systems inside a Storage Area Network (SAN) are connected with the servers via a dedicated physical network. In a SAN, a

(4)

large number of servers can be connected with the same physical storage systems. As we will see in

subsection SAS and SAS-expander, the new serial SCSI technology is building a bridge between the concept of direct attached storage and storage networks and tries to mix the best properties of the two approaches. ATA and SATA

The Advanced Technology Attachment (ATA) standard defines a set of interfaces for the direct interconnect between storage systems, like hard disks or CD-ROMs, and computer systems. Besides the standard

definition ATA, the technology has also been named as Integrated Drive Electronics (IDE) or Enhanced IDE (EIDE). The names IDE and EIDE are pointing to the controllers, which have been integrated into the storage systems (see also Figure 2).

With the market introduction of Serial ATA, the ATA standard has been renamed into Parallel ATA (PATA). The prefix parallel characterizes the 40/80-pin ATA cable that transports data in 16 bit chunks. The 80-pin cables have been introduced with the Ultra DMA 66 standard and contain, besides the 40 pins of former ATA standards, additional 40 ground pins. These ground pins became necessary to reduces mutual capacitive couplings between the wires that can lead to crosstalk and therefore to signal failures. The necessity of additional ground pins already indicates the problem of parallel interconnects. On the one hand, the mutual capacitive coupling increases with higher frequencies, on the other hand it becomes more and more difficult to keep the parallel data in-phase with increasing cable lengths. Therefore, PATA is only defined for cable lengths up to 46 cm and is therefore insufficient for the usage as SAN interconnect. To overcome the drawbacks of PATA at higher frequencies and cable lengths and to enable higher data throughput, Serial ATA (SATA) has been introduced in 2003. The first generation of SATA interfaces supports frequencies up to 1.5 GHz and uses a Low Voltage Digital Signalling (LVDS) scheme. Using differential signal transmission, a signal change triggers both used wires to change their voltage level at the same time, but with different phases. LVDS is able to substantially reduce interference liability and enables the increase of wire length to 1 m while simultaneously boosting the data transfer rate. Besides the 1.5 GBit/s standard, the 3 GBit/s standard and corresponding storage devices are already available.

Important technical innovations of the standard are hot swapping of storage devices and the introduction of Native Command Queuing (NCQ). NCQ enables reordering of requests inside the storage systems to increase data throughput [7]. Furthermore, current SATA hard disks possess better reliability and higher meantime between failures than former ATA disks. The integration of advanced error detection schemes, queuing mechanisms and the elimination of jumpers enable the usage of SATA hard disks in web servers or in RAID environments [10].

SCSI

The Small Computer System Interface (SCSI) is the eldest of the addressed storage interconnects. Classical

ATA Standards

Interconnect Standard Year Performance Features

IDE 1986 Pre-Standard

ATA 1994 PIO modes 0-2,

multiword DMA 0

EIDE ATA-2 1996 16 MB/s PIO modes 3-4,

multiword DMA modes 1-2, LBAs

ATA-3 1997 16 MB/s SMART

ATA/ATAPI-4 1998 33 MB/s Ultra DMA modes 0-2,

CRC, overlap, queuing, 80-wire

Ultra DMA 66 ATA/ATAPI-5 2000 66 MB/s Ultra DMA modes 3-4

Ultra DMA 100 ATA/ATAPI-6 2002 100 MB/s Ultra DMA mode 5,

48 Bit LBA

Ultra DMA 133 ATA/ATAPI-6 2003 133 MB/s Ultra DMA mode 6

Serial ATA ATA/ATAPI-7 2002 150 MB/s

Serial ATA II ATA/ATAPI-8? 2004 300 MB/s Enhanced queuing

Serial ATA III ATA/ATAPI-9? 2007 600 MB/s

(5)

SCSI hard disks use a parallel interconnect for the data transfer and distinguish between dedicated data and control wires. The SCSI standard is able to address up to 4 (later 16) devices, which are connected in a daisy chain. The daisy chain connects the host bus adapter, which is typically used inside a computer system, with the SCSI controllers of the connected storage devices. Every device is connected with its predecessor and its successor. The ends of the daisy chain have to be terminated by a resistor network and every element of the daisy chain has to have a unique id.

There are a number of versions of the parallel SCSI standard and their maximum throughput lies between 5 MByte/s and 320 MByte/s (see Figure 3). The design of a parallel SCSI standard with a transfer rate of 640 MByte/s has been skipped in favor of a serial SCSI architecture. The movement from a parallel SCSI architecture towards serial SCSI systems should overcome the drawbacks of parallel interconnects, which have already been described in section ATA and SATA and should ensure the technological scalability of future SCSI standards.

Parallel SCSI systems have been designed as direct attached storage for the attachment to a single server and the scalability of SCSI environments is very restricted. The operation of multiple servers on a single SCSI bus works insufficient and is only used in some fail-over environments for two servers. Besides the usage inside server systems, parallel SCSI systems are used as reasonably priced high-quality alternatives to fibre

Primary Commands (for all devices) (SPC, SPC-2, SPC-3, SPC-4)

Architecture Model (SAM, SAM-2, SAM-2, SAM-4)

Serial Attached SCSI (SAS, SAS-1.1, SAS-2) Automation Drive Interface – Transport Protocol (ADT, ADT-2) SCSI Parallel Interface (SPI-2, SPI-4, SPI-5) Related standards and technical reports (SDV, PIP, SSM, SSM-2, EPI) Serial Bus Protocol (SBP-2, SBP-3)) IEEE 1394 Fibre Channel Protocol (FCP, FCP-2, FCP-4, FCP-4) Fibre Channel (FC) SSA SCSI 3 Protocol (SSA-S3P) SSA-TL2 SSA-PH1 or SSA-PH2 SCSI RDMA Protocol (SRP, SRP-2) Infiniband (tm) iSCSI Internet SCSI Enclosure Services (SES, SES AM1, SES-2)

Object Based Storage Devices (OSD, OSD-2) Bridge Controller Commands (BCC) Automation Drive Interface – Commands (ADC, ADC-2) SCSI Block Commands (e.g., disk drive)

(SBC, SBC-2, SBC-3)

Reduced Block Commands (e.g., disk drive) (RBC, RBC AM 1)

SCSI Stream Commands (e.g., tape drive)

(SSC,SSC-2, SSC-3)

SCSI Media Changer Commands

(e.g., juke box) (SMC, SMC-2, SMC-3) Multi Media Commands (e.g., DVD) (MMC, MMC-2, MMC-3, MMC-4, MMC-5, MMC-6) SCSI Controller Commands (e.g., RAID) (SCC-2)

Figure 4: SCSI architecture [8].

SCSI Standards

Interconnect Standard Year Performance Features

SASI 1979 Shugart Associates

SCSI-1 SCSI-1 1986 2 MB/s Asynchronous, narrow

SCSI-2 SCSI-2 1989 10 MB/s Synchronous, wide

SCSI-3

Fast-Wide SPI/SIP 1992 20 MB/s

Ultra Fast-20 annex 1995 40 MB/s

Ultra 2 SPI-2 1997 80 MB/s LVD

Ultra 3 SPI-3 1999 160 MB/s DT, CRC

Ultra 320 SPI-4 2001 320 MB/s Paced, Packetized, QAS

Split command sets, transport protocols, and physical interfaces into separate protocols

(6)

channel hard disk inside RAID environments.

SCSI has become much more than a standard for host bus adapters that can interconnect hard disks and computer systems. Figure 4 depicts the different levels of the SCSI standard. The upper level defines a set of commands to transport data between device types and over interconnect technologies. The interconnects range from classical, parallel SCSI to serial data transfer technologies, like Fibre Channel, FireWire, Serial Attached SCSI, and even to the iSCSI protocol, which enables the transport of block oriented SCSI

commands over the Internet. SAS und SAS-Expander

Serial Attached SCSI (SAS) starts to replace present parallel SCSI interfaces on the physical transmission layer. Parallel SCSI reached its physical limits with the current Ultra-320 SCSI standard and has to cope with similar problems such as parallel ATA interfaces. The changeover to serial transmission technologies should overcome the problems of different signal transmission times on long wires and of crosstalk at high

transmission frequencies [21].

In contrast to parallel SCSI interfaces, SAS defines point-to-point connections between devices. An interesting feature of SAS controllers is that they support direct attachment of native SAS hard disk as well as the integration of SATA disks. The attachment of SAS hard disks to a SATA controller is not possible. The Serial Management Protocol (SMP) enables the management of SAS networks, which can be built by expander devices.

Current SAS devices enable a throughput of up to 3 GBit/s (net performance 300 MByte/s) in each direction and for each physical interface, where the used copper wires allow lengths of up to 8 meters. The data can be exchanged in full duplex mode. Upcoming versions of the SAS standard should support throughput rates of 6 GBit/s and 12 GBit/s. Interconnection between SAS devices is handled by the concept of ports, where each port contains one physical interface. A set of ports of a single SAS device can be combined as one logical group and can be connected to the same SAS address.

To enable redundant architectures, SAS envisions dual porting inside SAS hard disks, where each port has got a different address. This means that each SAS hard disk has got two ports and each port is assigned to a different host bus adapter. Besides reliability issues, this concepts enables the doubling of the theoretical transfer rate of the hard disk devices, which, nevertheless exceeds the capabilities of current hard disks.

Edge expander device set Edge expander device set

Fanout expander device

One fanout expander device End device End device Edge expander device Edge expander device Edge expander device

Edge expander device

End device

End device End deviceEnd device End deviceEnd device

Any of the physical links could be wide

128 edge expander device sets

Maximum of 128 SAS addresses per edge expander device set

(7)

To overcome the limitations of a single host bus adapter, SAS has introduces the concept of fanout and edge expanders (see [16]). Each fanout and edge expander has its own SAS address. Edge expanders can be grouped into sets of edge expanders, where each group contains 128 SAS addresses for up to 127 external SAS devices. A fanout expander is able to connect up to 128 groups of edge expanders, enabling up to 16256 SAS devices in a single environment. At most one fanout expander is allowed in an environment. Combining edge expanders forms tree-like networks, which are not allowed to contain cycles [15].

Based on their simple interconnection topology and the fact that each SAS port is only allowed to be interconnected with a single SAS address at each point in time, SAS is not classified as a storage area network. Nevertheless, the strict distinction between SAS environments and SANS becomes blurred. Prototypes of first SAS switches are already available. By the interconnection of different SAS-switches, customers are able to increase the size of their SAS fabrics and by the means of zoning it becomes possible to configure different storage domains.

Fibre Channel

Fibre Channel (FC) technology has been developed as backbone technology for general networks. While Ethernet has been successful for local area networks, Fibre Channel has been able to become the most widespread technology for storage area networks. Comparable to most modern network technologies, Fibre Channel is delivering data via serial interconnects. Fibre Channel can be used full duplex with up to 4 GBit/s (net throughput of 400 MByte/s in each direction). Data is sent in most cases over fiber optics, even if the standard is also specified for copper interconnects. An advantage of Fibre Channel is its simple, block-oriented protocol stack, which can easily be implemented in HW inside a host bus adapter. Therefore, the protocol does not stress the CPU with a high load.

Fibre Channel switched fabrics (FC-SW) are able to form arbitrary network topologies via point-to-point interconnections, where each connection can be used with full wire speed. Furthermore, it is possible to put multiple HBAs inside a device to increase network reliability through multipathing.

iSCSI

Ethernet in combination with the TCP/IP protocol suite has become the dominating network technology inside enterprises. The wide spreading of the technology results in low costs of the corresponding components; nearly all computer systems already integrate Ethernet network interfaces. In contrast to knowledge about the Fibre Channel technology, most enterprises already possess the knowledge about cabling and management of Ethernet based networks.

Internet SCSI (iSCSI) has been developed as an extension of the SCSI protocol environment for IP based networks and therefore to enable Ethernet technology to become the foundation of storage area networks. Based on the distinction of SCSI commands and the transport layer for SCSI commands (see also Figure 4), a protocol has been designed that transports block based commands via the stream oriented TCP/IP protocol [20].

iSCSI connects computer systems as iSCSI initiators with iSCSI targets, which can be e.g. hard disks or tape devices. From the application point of view, the iSCSI target behaves similar to a locally attached SCSI device or a remotely attached Fibre Channel disk. Therefore there are no required changes to the application interfaces.

iSCSI is an end-to-end protocol that is transparent for the underlying IP network. New demands only arise for clients, servers, and storage systems, not for switches or routers. The protocol simply sets up on existing TCP/IP stacks. Nevertheless, iSCSI and TCP/IP impose higher demands on the CPU than Fibre Channel networks. To reduce the CPU demand of the iSCSI and TCP/IP protocol, dedicated host bus adapters have been developed, which are able to offload the iSCSI and TCP/IP processing from the server into the HBA. Currently, there is a controversial discussion, whether software protocol stacks or hardware offload engines are better suited to reduce CPU load. The offload engine is able to shift the whole TCP/IP protocol stack (and sometimes even the iSCSI stack) from the CPU into the HBA, but produces a huge amount of

interrupts. On the other side, modern computer systems and their memory subsystems seem to be able to be fast enough to compute the protocols even for upcoming network standards, like 10 Gigabit Ethernet, completely in software and at wire speed [19].

(8)

InfiniBand

InfiniBand has been developed as system bus that is also designed to act as interface to storage subsystems (see also Figure 4). Up to now, InfiniBand is most widespread in high-performance computing, where its low latencies and high throughput rates significantly improve network scalability. To enable simpler system designs, there are already first switches that translate between InfiniBand and Fibre Channel and there are also first storage systems, which can be directly attached to an InfiniBand network. Missing native

InfiniBand hard disks, these storage devices still have to build up the backend network via Fibre Channel or SAS-devices. From an abstract point of view, InfiniBand storage systems are still a combination of a switch that translates the InfiniBand transport protocol into other SCSI transport protocols and a standard storage system.

Besides the usage as frontend network to accessing hosts, InfiniBand can also be used as backend network inside the storage systems. For example, there are NAS-cluster environments, where the nodes of the cluster communicate via InfinBand among each other to reduce network latencies and to improve throughput. Furthermore, they are able to improve performance by the usage of Remote DMA (RDMA) protocols. Nevertheless, the interface from these NAS clusters to the clients is still based on standard Ethernet.

Storage Management

Storage Virtualization

It has been shown that the TCO of storage systems is dominated by maintenance costs. For every dollar invested into storage hardware, additional 4 to 8 dollar have to be invested into the administration of the storage system. Therefore, the usage of storage management solutions generates a high potential for cost reductions. To fully use the potential of a storage solution and to decrease storage administration costs, the usage of storage management solutions in general and the usage of storage virtualization seems to be essential.

Storage virtualization builds an abstraction layer between the physical implementation of storage and its logical representation seen by users and applications. Storage systems are grouped and managed in sets called storage pools. The administrator defines logical volumes built from storage pools, which can be used in the same manner as locally attached volumes. The logical properties of a virtual volume have not to be directly related to the physical storage systems inside the corresponding storage pool. It is possible to increase performance and reliability of the physical storage systems inside a pool by the usage of software RAID technologies inside the storage virtualization. Furthermore it is possible that the capacity of a virtual volume is bigger than the accumulated capacity of the underlying storage systems. Nevertheless, if the used capacity of the virtual volume reaches its physical constraints, additional capacity has to be added to the storage pool.

Storage virtualization is a basic technology and builds the foundation of high performance and cost efficient storage management. Especially for storage grid environments, storage virtualization is the key technology that ensures flexibility, ease of management, and scalability. Advantages on this level of storage

management can be directly transformed into advantages on higher levels of the storage hierarchy. In the following section, we will discuss the suitability of different storage virtualization approaches in the context of storage grid environments.

Storage Virtualization for Storage Area Networks

Storage virtualization solutions can be implemented at nearly any position inside a storage environment (see Figure 6). The functionality of a virtualization solution is not restricted by its position inside the storage environment, but its usage can be restricted to dedicated elements inside the environment.

In case of host-based volume managers, e.g., the virtualization is directly performed inside the server systems. The volume manager does not include any coordination between different servers attached to the same physical storage systems. Therefore, the configuration of the individual servers becomes complex and error-prone, especially for large environments. Furthermore, it is not possible to implement distributed file systems inside dynamically growing environments without centralized or distributed coordination

mechanisms, disqualifying host based volume managers as foundation of storage grid environments.

The most wide-spread form of storage virtualization is its direct implementation inside mid-ranged and high-end storage systems. These systems are only able to apply virtualization functionality inside a storage system

(9)

or atop of certified storage systems, decreasing the ability to use standard off-the-shelf components for storage grid environments.

In-band storage virtualization solutions are based on a dedicated SAN appliance that is implemented between accessing servers and the physical storage systems. The advantage of an in-band solution is its independence concerning the used operating systems and servers. Besides the cabling of the in-band appliance, no

additional changes concerning the servers or storage systems have to be performed. The virtualization is performed directly inside the SAN appliance. The disadvantage is directly based on this central position inside the SAN. Every control and data access to the storage systems has to be routed through the SAN appliance, leading to a bad scalability of the environment. Adding new servers and storage systems often also requires an upgrade of the SAN appliance. Furthermore, it becomes difficult to keep the cache

consistency between the different SAN appliance, either leading to bad performance, bad manageability, or data inconsistencies. Therefore, in-band solutions are not suited for scale-out environments like storage grids.

In case of an out-of-band storage virtualization, the virtualization of the storage environment is directly performed inside an agent inside the accessing hosts. Requests to a logical volume are mapped inside the server into a tupel of physical storage system and block number on the storage system. The information concerning the transfer from a logical into a physical address is stored inside a metadata appliance that resides outside the data path, ensuring a simple management and good scalability.

The split-path technology inside the line cards (ASICs) of a storage switch is closely related to the out-of-band virtualization. In both cases, the performance of the system scales within the number of ports inside the storage environment and the management of metadata is performed outside the data path. The scalability of both solutions is linear in the computing performance, the number of servers, and storage systems.

A survey on storage virtualization technologies is also available as Fujitsu Siemens Computer white paper „Storage Virtualization“, based on last years technology reviews.

Object Storage Devices

This section highlights the influence of object-based storage interfaces, which have been developed at the Parallel Data Lab (PDL) of the Carnegie Mellon University, on the development of distributed file systems and how these interfaces can be used inside active storage grid architectures. Traditionally, a client accesses

Out-of-Band SAN Appliance Storage Switch In-Band Appliance Storage Arrays Hosts LVM LVM LVM Positions of Storage Virtualization inside SAN environment Control Path Data Path

Control and Data Path

(10)

its data on a storage system via a block interface and it has to manage the layout of its file system on the disk. The layout of the files over the storage systems and the corresponding management of inode-structures as well as the data transport are seen as main tasks of a file system. If these tasks can be handled inside the storage systems, it would be possible to relieve the file system load of a server by more than 90% and to scale performance in the number of attached storage systems [9].

Object storage devices enable exactly this transfer of the inode-management into the storage systems. The basic principle of this approach is shown in Figure 7. In the traditional model, shown in the left part of the figure, the file system receives request via the system call interface. The user component of the file system checks the access rights of the user and manages statistics about number of accesses and access times. Based on inode-structures, which are stored on the storage devices, the file system manages the assignment between files and their location on the storage systems. Furthermore, the file system is responsible for sending user data via the block interface to the storage systems, respectively to read user data via the block interface from the storage systems. The storage system only has to execute requests received via this block interface. In the object-based storage model, the storage component of the file system is moved into the storage systems themselves. Accesses to storage devices are not performed by a standard block-level SCSI

command, but by a command to an object, where each file can be handled as a single object. An object is a logical entity that contains a continuously addressable data region. Besides the user data, the object contains metadata, like the physical position of the object on the storage device or user attributes, like Quality of Service (QoS)-requirements or requirements concerning the replication of the object. The object-based storage model contains exactly one root object for each storage device, user objects to store data and group objects to manage the set of user objects.

Access to an object is performed similar to a file access by using an object id and an offset inside the object. The command set is defined as T10 standard Object-Based Storage Device Commands that passed the T10 committee in 2004 [27]. In contrast to file accesses, it is necessary to get a timely restricted, encrypted access grant to the object from a metadata server before accessing an object on a storage system. After the grant expires, it is not possible to access the object again without the renewal of the grant. The grants ensure that every computer system can access every object, but that no computer system can manipulate objects for which it is not authorized for. The management of object rights inside a metadata server induces much less effort than the management of all inode-information and object-based file systems become easier to scale than traditional file systems.

Block Interface

Object Interface

Applications

Block I/O Manager

(a) Traditional model (b) OSD model

Storage device _{Storage device}

System Call Interface System Call Interface

File System, User component File System, User component File System, User component File System, User component File System, Storage component File System, Storage component File System, Storage component

(11)

The concept of object storage devices only required 10 years from its introduction by the Carnegie Mellon University (see e.g. [5] and [6]) until its ratification as SNIA/T10 standard. There have already been very interesting new developments based on this concept during the standardization, with the Lustre file system perhaps being the most important one. Further developments show an interesting property of object-based storage devices for storage grid architectures. Interfaces are able to work directly on objects and they do not have to consider the internal representation of the object by a file system. Based on this concept, it becomes much simpler to enable data access from heterogeneous operating systems.

XAM

Today’s object-based storage systems are able to reduce the server load by coupling storage and file system functionality inside the storage systems and by using their computing resources. Based on this concept, the SNIA currently defines the eXtensible Access Method (XAM)-standard. Based on SNIA, XAM “is an

application to storage interface which empowers metadata to instrument the automation of information-based management”. XAM will give applications a standard interface to access object storage devices to achieve interoperability, storage transparency, and automation for ILM-based applications [1].

As outlined by SNIA, XAM is especially targeted at the archiving of information. An important aspect in this process is the metadata, which is stored together with files, images or audio-information. The metadata enables XAM to be well suited for seeking statistical information. Beside the search for data, the metadata allows applications to search inside an archive based on dedicated rules, e.g. to convert between different file formats, if these formats grow old and will probably be unreadable in the near future. Furthermore, it can be assumed that often the storage systems will be exchanged long before the stored data becomes obsolete. XAM integrates interfaces to transparently move data between storage systems without erasing the metadata. The standardization process of XAM started end of 2004. Version 1.2 of the proposed standard is currently handled by SNIA FCAS-TWG (Fixed Content Aware Storage Technical Work Group).

File Systems for Storage Grids

A major distinction between storage grid architectures and standard storage environments is the parallel data access to a storage grid. Each brick of a storage grid is able to deliver a consistent interface to the data. From a file server perspective, storage grids are thought to overcome performance bottlenecks, which occur from the ever growing data volume and the growing of the number of accessing clients.

To enable the parallel access to a single, distributed file system, a number of different approaches have been introduced. Parallel access to a single file systems ranges from network file systems over the combined management of different file servers under a single management console, handled by a file system

virtualization layer, up to the provisioning of parallel and cluster file systems. The glossary of the different distributed file system approaches is not consistent throughout literature. Inside this document, we try to use a consistent nomenclature for distributed file systems that corresponds as far as possible with its general usage.

1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Panasas

NSIC NASD SNIA/T10 OSD

CMU NASD Lustre

OSD V1 Standard

OSD V2

IBM / Seagate / Emulex OSD V1

Prototype

(12)

Distributed File Systems

Inside a distributed file system (DFS), both the clients and the servers and storage systems are distributed over the nodes of a distributed system. In contrast to a standard, single-user file system, the data has not to be stored inside a central repository, but can be spread over multiple, mutually independent storage nodes. The term distributed file system can be seen as superordinate concept for file systems which enable access from a set of clients and does not impose a certain implementation. In some implementations of a distributed file system, the servers have to run in dedicated computer systems, whereas other approaches allow the joint usage of client and server software inside a single machine. A characteristic of a distributed file system is that the different servers and clients are independent of each other as far as possible.

In contrast to the definition used inside this document, the file system protocols NFS and CIFS, which enable a parallel access from multiple clients to a file server, are also often referred to as distributed file systems. In this document, we will name these file system protocols network file systems (see next section).

From the user perspective of a distributed file system, the DFS should behave similar or even identically to a conventional, locally centralized file system. The number of servers inside the system and the data

distribution should be transparent to the applications running on client systems. The mapping from the files to the systems storing the data should be completely handled by the DFS.

In a conventional file system, the time for a data access is mainly composed from the accesses to the storage systems. In a DFS, there are additional costs for the handling of the access inside the server process and the data transport over the network. The performance aim of a distributed file system is, from the perspective of an accessing client, to be as fast as a conventional, centralized file system. The aggregated bandwidth of all clients should, of course, exceed the performance of a single-user file system by multiple orders.

It is desirable for a distribute file system that the clients are not only able to concurrently access the same file from different computer system, but that these files can be changed in parallel without getting data

inconsistencies between the different clients. The control mechanisms can either be directly incorporated into the distributed file system or it can be performed by a separate protocol.

Network File System

Network file systems have been designed for the concurrent access to files and resources by different users over a local or wide area network. The first file servers based on network file systems have been deployed throughout the seventies, the most widespread network file system and synonym for the name of this class of distributed file systems, the Network File System (NFS), has been introduced in 1985. Further important network file systems are the Andrew File System (AFS) and the Common Internet File System (CIFS). A network file system has not to be confused with an actual file system implementation. From the user perspective, network file systems offer file server functionality to their clients. The interface is based on primitive operations, like open or close of a file. The file management is performed by a local file system inside the file server and the file itself is either stored on a direct attached secondary storage system or it is accessed through a SAN.

NFS

The success of the original version of NFS is based on its simple architecture and the free access to the protocol specification that has been published by SUN as a free internet standard [26] [1]. NFS enables a client to mount a directory from a server and to integrate the directory into its device tree. The access to the files is transparent for applications running on the client system; files can be accessed in the same manner as locally stored files. The administration of a NFS server is quite simple; the administrator has only to specify the exported directories and to define a set of client systems and their access rights. The communication between clients and servers is defined by the machine independent eXternal Data Representation (XDR), that is independent from the used operating systems and the network infrastructure [24].

The aim of the NFS development has been a simple error handling and a resource efficient architecture that is also well suited for servers with modest memory and computing equipment and that is able to support a large number of clients. This aim has been reached by going without internal states inside the NFS protocol. All information required for an access is inside a single request, respectively its response. After the failure of a client or a server, no state information is lost.

(13)

The performance of NFS environments can be achieved by caching on client and server side. On the server side, the number of accesses to storage devices is reduced by the buffer cache of the local file system. On the client side, a request is only send to a server, if its answer is not already available inside the local cache of the client. To overcome consistency problems between different clients, NFS has introduced cache aging. If a data block stays longer inside the client cache than defined by a given timeout value, the request has to be send to the NFS server. Opening a file inside the cache also triggers a request to the server to get the last modification time of the file. If the last modification appeared after placing the file into the cache, then the file is cleared from the cache and the file has to be re-requested from the server. Furthermore, all metadata has to be synchronized every 30 seconds with the server. Besides these mechanisms inside the NFS protocol, it is possible that data inconsistencies occur.

The original NFS protocol has got a central concept and is thought for single file server systems. This single server approach would limit the scalability of the server environment. Nevertheless, it is relative simple to build parallel NFS servers based on NFS version 3. Using NFS version 4 for the communication between client and server computers, additional aspects have to be considered (see [22]). NFS version 4 introduces a lock mechanism that is able to ensure consistency of a file on the byte level. Based on this lock mechanism, the protocol becomes stateful. The lock management is performed inside the NFS version 4 server and not, as done in previous versions, inside an external Network Lock Manager (NLM).

NFS version 4 introduces a number of new and interesting features. A set of requests can be joined to a single combined request, can be executed inside the NFS server and the result can be returned by a single reply. The protocol has been enhanced to enable an efficient usage in wide area networks. E.g., encryption of data became part of the specification. In former versions of the NFS protocol, ciphering has been possible through Secure RPC, but this mechanism has been seldom used, because it has not been available in all installations.

Furthermore, it is planned to integrate object-based storage elements into the next upcoming NFSv4 release [23]. The object-based approach should help to distribute the load between clients and servers by integrating a dedicated metadata management. Under the term parallel NFS (pNFS), a client gets a grant to a file from a metadata server and accesses the corresponding storage system directly. Therefore, it becomes possible to scale the performance of NFS servers in the number of the attached client systems. Nevertheless, the corresponding internet draft explicitly outlines that the transfer of additional responsibilities into the clients can lead to security vulnerabilities, if the storage systems do not support fine-granular security.

CIFS

The term Common Internet File System (CIFS) has been introduced by Microsoft and contains an enhanced version of the Server Message Block (SMB)-Protocol. SMB is a communication protocol for file, print and other services in network environments and is especially widespread in Microsoft Windows based

environments. It is also used inside the freely available Samba and Samba-TNG software, which enable the use of Microsoft Windows based resources inside UNIX environments. SMB is not a file system in its proper usage, but it can be compared with NFS services.

SMB has been introduced in 1983 by Barry Feigenbaum from IBM. Over time, the protocol has been heavily extended by different companies and groups like Microsoft, SCO, Thursby, IBM, Apple, and the Samba team. Most extensions are delivered by Microsoft and are not freely available to the public. Originally, SMB used the NetBios port 139 and the name resolution has been performed by NBT name services or via DNS, but it can also run directly in TCP/IP networks.

File System Virtualization

If files are delivered by single file servers inside each department of a company, than the addition of new storage capacity or the movement of files between different file servers can involve heavy administration overhead. Furthermore, data can not be accessed during data movements and productivity losses can arise. Comparable to storage virtualization on the block level, virtualization solutions for file servers have been developed, which are able to reduce the administration complexity of distributed file servers. Similar to block level virtualization, file system virtualization groups a set of file servers into a single entity that is managed by a central instance. File system virtualization solutions can be based on hardware- or software concepts. The transparent movement of data between different servers, which is enabled by file system virtualization, is seen as a key technology for Information Lifecycle Management (ILM).

(14)

An example for a software-based approach is the Microsoft Distributed File System, often abbreviated as DFS. It consist of a number of services for client and server systems, which enable an efficient

administration of a set of file servers by using a single distributed file system. As defined for standard distributed file systems, DFS enables the user of the file system to abstract from the location, in this context from the server, of a file. File systems of different servers can be managed and used under a single root directory. When the client accesses a file, it receives a link to the target server that enables the client to transparently read and write data on the target server.

Distributed DFS-based systems are integrated into the active directory service inside Microsoft Windows environments. To increase the reliability of the environment it is possible to store information about the DFS root on multiple servers. Information about files and root directories can be replicate by the Microsoft File Replication Service (FRS). Advanced services, like an improved DFS Replication Service, have been added to Windows Server 2003 R2.

Cluster File Systems

A cluster file system is a symmetrical, out-of band file system. Inside a cluster file system, each node can be client, server, and metadata server at the same time and the responsibility for the storage capacity can be distributed over all clients inside the cluster.

A drawback of this approach is based on the frequent misuse of the cluster file system concept, when nodes are responsible for computing-intensive applications and are also used as file servers at the same time. Furthermore, it is possible that the failure of a single node leads to the failure of the whole cluster, if the internal storage inside the cluster is used as cluster storage. Therefore it is fundamental to include proper replication and failover mechanisms into the cluster file system. Besides the use of internal hard disks, it is possible to attach the cluster nodes to a SAN and therefore to overcome problems related to the usage of internal storage.

Another possibility to overcome reliability and management issues inside cluster file systems is the introduction of a storage virtualization layer below the file system layer. The virtualization solution has to pool the available storage resources, which can be the hard disks inside the cluster nodes or SAN storage, and to present them as logical disks to the cluster nodes. Besides a simplified dynamic management of storage capacity, the virtualization layer becomes responsible for adding redundancy information, which enables a failover after a node failure. Inside large cluster environments, it is important to ensure that the recovery after a node or hard disk failure can be performed as fast as possible. This can be e.g. achieved by a parallel recovery process, where each node of the system takes part in.

An important aspect of cluster file systems is the metadata management that has to be performed by every participating server. By default, the metadata is managed by the node that accesses the corresponding file. If multiple servers access the same file, the management of the metadata has to be delegated dynamically between the accessing servers. Access to the metadata has to be managed by a higher-ranking manager that is not responsible for the actual metadata handling and that can be implemented in a distributed fashion. Parallel File Systems

The distinction between a parallel file system and a cluster file system is that the parallel file system contains a dedicated metadata server, respectively a dedicated metadata cluster. Before a client can access a file, it has to get the corresponding grant from the metadata server. Based on this grant, the client can read from or write to a storage system.

The purpose of a parallel file system is to become a more structured file system by grouping the tasks into metadata management and data accesses. The metadata server has a single purpose and no conflicts with other tasks inside the server can arise and therefore can not lead to a blocking inside the metadata server. If the metadata server is not build as a scalable cluster, it is in principle possible that bottlenecks occur in the metadata handling, which lead to a decreased scalability of the storage environment. Nevertheless, Lustre as one of the most scalable file systems only supports a single active metadata server without really restricting scalability.

Object-based parallel file systems: Parallel file systems can be grouped into object-based parallel file systems and SAN file systems. In an object-based parallel file system, a file is in general handled as an object. This object is managed by an object storage server (OSS). Before accessing this object, the accessing computer system has to get a grant for this object. As described in section “Object Storage Devices”, this

(15)

grant is encrypted and has to be sent to the object-based storage devices with every data access. After the grant expires, it is not possible to access the object again without its renewal. If a server inside a cluster fails, it is not necessary to perform expensive fencing protocols to ensure the consistency between the servers to prevent the failed server from accessing data. Instead, it is only necessary to wait for the grant to expire to ensure that access from the failed server to the data is impossible.

This concept increases security inside SAN environments. A server is only able to access a file system that is explicitly assigned to the server. This mechanism ensures that it becomes even in bigger organizations impossible that data is accessed from different departments without assigning the corresponding permissions to the department. Furthermore, the scalability of an object-based parallel file system over the boundaries of a single server room becomes much simpler. The security mechanisms even enable the usage of parallel file systems over a wide area network.

Access of an object storage server to its objects can either happen through directly attached storage devices or through a SAN. It is important to ensure that after the failure of an OSS its tasks can be seamlessly transferred to another OSS. Therefore, nearly the same precautions have to be taken into considerations as for cluster file systems. To failover from one OSS to another OSS, it is necessary that both storage servers are able to access the same storage devices. This can either be ensured by an attachment to a storage area network, by a storage virtualization solution on the block level or by an integrated block level virtualization inside the parallel file system.

At the moment, most parallel file system do not yet contain much storage management functionality like data replication, snapshots, data migration, or hierarchical storage management. Nevertheless, most of them are able to export CIFS, NFS, or even pNFS directly.

Block-based parallel file systems: The architecture of a block-based parallel file system with its

partitioning into metadata server and storage systems is similar to the architecture of an object-based parallel file system. However, the access of a client to the data is sent directly over a storage area network to a hard disk or RAID system, not to an object-based storage device. Therefore, block-based parallel file systems are also called SAN file systems.

The advantage of a SAN file system compared to an object-based file system is its simpler architecture, no dedicated OSS has to be integrated between the clients and the storage devices. On the other side, the absence of object storage servers can be seen as a big drawback of SAN file systems, because it is nearly impossible to restrict a client from accessing data. A badly configured client or a client that consciously tries to damage the system is able to corrupt all data stored on storage systems which are accessible from this client. If a SAN file system is deployed enterprise wide, it is in principle possible that all departments are able to access all data in a miss-configured environment. If access to data is granted across the boundaries of a server room, further security issues occur.

SAN file systems are currently deployed by storage system vendors and vendors of scalable serer architectures.

Summary File Systems

The previous sections have shown that cluster file systems and the different forms of parallel file systems are able to scale from a few storage nodes up to hundreds of nodes and that these file systems are able to build the foundation for additional services.

The usage of file-system virtualization is well suited for the consolidation of distributed NAS servers. In principle, it is also possible to tune the performance of these NAS-servers based on the mapping between files and servers. Nevertheless, the fixed mapping between files and servers can easily lead to bottlenecks inside scale-out architectures or lead to a huge administration overhead and therefore restricts the

performance and manageability in large environments.

Regarding dedicated implementations of the different file system types, it can often be seen that there is no support for the replication of data or that this support is only available as one item on the file systems roadmap. Many file systems assume that the servers are connected over a SAN with a set of RAID-systems or that at least the internal storage of the servers is protected by a RAID-scheme. Nevertheless, it has been shown that even the usage of RAID 5 is not sufficient to permanently secure data against node failures in large environments [2]. It follows that scale-out architecture require the usage of a RAID 6 scheme or comparable approaches to cope with the failure of at least two hard disks or storage nodes.

(16)

Especially cluster and parallel file systems often have be designed for usage in high performance computing (HPC) environments, which make high demands on the computing performance, but make less demands concerning availability. Therefore, reconfiguration of these file systems often includes that the environment has to be taken offline, leading to productivity losses. Furthermore, many distributed file systems do not support standard management interfaces, e.g. for taking snapshots via Microsoft VSS.

Selecting a distributed file system for storage grid environments, it is also important to consider the

scalability limitations of the file system concerning the size of a single file system and the maximum number of supported nodes inside a the storage grid. Since the scalability to hundreds of nodes should not become impossible by the limitations of the file system, it is important that the file system supports file system sizes of at least 256 TByte. Otherwise, the storage grid architecture can become fragmented very soon, leading to a difficult manageability.

Cluster Interconnects

The nodes of a storage grid, the storage bricks, can be connected with each other via different network technologies. These interconnects between the nodes differ in terms of bandwidth, latency, and cost. Nevertheless, the scalability properties of a storage grid are significantly influenced by the physical properties of the interconnection technology between the nodes.

The bandwidth of an interconnection technology is defined as the maximum amount of data that can be transferred per second over a link. The term bandwidth does not distinguish between user data and protocol data, like the header or tail of an Ethernet frame. Nevertheless, information that is required for the

regeneration of data, like the 25 % overhead of the 8B/10B encoding, is not considered as being part of the bandwidth.

The latency of a network can either be defined as the time that a packet needs for the transfer between two nodes or as roundtrip time. The time to send a packet from one node to another node starts with the transfer of the first bit from the source node and ends with the arrival of the first bit at the destination node. The time from the arrival of the first bit at the destination node until the arrival of the last bit at the destination node is defined as transmission delay.

The roundtrip time of a packet is defined as the time that a packet needs from the source node to the destination node and back from the destination node to the source node. The start and the end of the

roundtrip time are defined similar to the start and end of the latency between two nodes. An advantage of the roundtrip time is that it can easily be measured by software tools, like the ping-service, without

synchronizing the clocks between the two nodes.

Measuring the latency between two nodes does not only include the transport of data over a link, but also the stack that is required to send the data. In case of an Ethernet interconnection between two nodes, this stack is usually based on the TCP/IP protocol and is most often implemented in software. This software stack includes the movement of data inside the computer system and the latency therefore also measures the performance of the computer hardware and the performance of the software implementation. In more complex environments, where the source node and the destination node are connected via a switch, the latency of the switch also influences the measured latency.

From the perspective of the physical implementation of the interconnection technologies, the different approaches for cluster environments become more and more similar. All technologies offer or will offer in the near future a bandwidth of 10+ GBit/s. Differences can be seen on the protocol layer. Ethernet is closely related to the TCP/IP protocol stack that is used in local area networks and in wide area networks. Based on the protocol and software overhead of the TCP/IP protocol, latencies of Ethernet interconnects are still a magnitude higher than latencies of dedicated cluster interconnects.

Node Management

An important aspect, especially for active storage grids, is the configuration and administration of the storage bricks. Active storage grids will become more and more similar to standard cluster architectures, enabling the user to send (in principle) arbitrary tasks to the storage grid. Therefore, in an active storage grid, the administration of the nodes does not only include storage tasks, but also the scheduling of dedicated processes, e.g. for transforming files between different formats or for the encryption of data or even for the scheduling of prior unknown user processes, which are sent to the active storage grid. Currently, there are a

(17)

number of different approaches for the scheduling of task inside parallel computing environments, with a diversity of functionalities.

The different approaches for the management of tasks inside a cluster environment can be distinguished based on their scheduling approach. One approach allocates a set of nodes for a process for a defined time period, enabling a well-specified resource-planning. Another approach is based on a queue-based task management. A new task has to wait until all other tasks in front of it inside the queue have finished.

Therefore it is only possible to send tasks to the system which will be performed at an arbitrary point in time and the results will be sent back after the tasks have finished.

Trends for Storage Grids

In the field of interconnection technologies for storage systems, there is a clear trend to serial interconnects between the computer systems and the storage system. After parallel interconnects have met their physical boundaries with the introduction of the SCSI Ultra-320 technology, both low cost hard disks based on SATA technology and enterprise hard disks based on the SAS standard use serial interconnects. The attractiveness of these technologies is not only their higher performance or the simpler cabling schemes, but also the interoperability between SATA hard disks and SAS controller and the possibility to use the SAS technology to build (still simple) SAN-environments.

The resulting opportunities are only used rudimentary at the moment. Currently, there are only few JBOD- or RAID-systems available, which are based on the SAS technology and which can be used by a set of servers or which can use both, SAS and SATA hard disks. Furthermore, SAS switches are only available as prototypes or as small edge expanders. Nevertheless, based on a Gartner study, SAS hard disk will already have the biggest market share in multi-user hard disks in 2007 and their market share will grow up to 40 % in 2009.

Based on an increase of the market share and the resulting competition, the prices for SAS products are expected to fall. Furthermore, SATA hard disks with their high capacities will become an interesting alternative, e.g. in the archiving market. The increase in performance and in life-span of SATA hard disks compared to parallel ATA disks make them attractive, especially in price-sensitive environments.

Furthermore, it is possible to overcome the performance drawbacks of SATA hard disks in scale-out architectures by an efficient scheduling of the available spindles.

Examining interconnection technologies for HPC cluster environments, (of course) the same trend to higher bandwidth, lower latencies and serial interconnects can be observed. The biggest market momentum can be seen for the Infiniband technology, which is supported by a broad set of leading computing manufacturers. It will be interesting to observe the development from 1 GBit/s Ethernet to 10 GBit/s Ethernet. The wide spreading of Ethernet in enterprises and the resulting cost advantages can make 10 GBit/s Ethernet an attractive alternative in HPC and cluster environments, even if the latencies are higher than the latencies of dedicated cluster interconnects. Here, it will be interesting to evaluate next generation TCP/IP offload engines in terms of performance and reduction of the CPU load. Furthermore, the broad introduction of 10 GBits/s Ethernet is longed for by iSCSI vendors to become competitive with the FC protocol, which already offers a bandwidth of 4 GBit/s.

For the management of distributed storage environments, both block based storage virtualization and distributed file systems have been able to proof their usability. Based on these technologies, it becomes possible to dramatically simplify the data management in cluster environments. Here, especially new object-based file systems will become interesting. These file systems are object-based on object storage devices (OSD), which offer an object based extension of the SCSI command set. OSDs enable the file system to abstract from the inode management and to transfer this inode management into the storage systems. This abstraction leads to higher performance file systems by offloading the most work intensive part of the file system into the (parallel) storage environment. Object-based file systems are especially well suited for the archiving of data, because the single files can be stored as objects and can be assigned to a dedicated OSD node. This node can be made responsible for the management of the object, e.g. in terms of an ILM strategy.

(18)

Target Markets for Storage Grids

The added value of storage grid architectures is their scalability in terms of capacity and performance and a simple, centralized management. The aim of storage grids is to overcome existing infrastructure problems, which exist in the interaction between server and cluster architectures with the storage systems. Today, NAS-architectures often lead to bottlenecks if they are attached to high performance computing clusters and if they have to generate thousands of files per minutes. Here, the parallel generation of files can lead to a substantial increase in performance.

Many industry analysts do not yet see storage grids as a mature technology which already addresses a broad market of enterprise customers. Established storage vendors only now start to develop storage grid products and see these products as long term options. Applications of storage grid environments are therefore still restricted to niche markets and special market segments.

(19)

Delivery subject to availability. We reserve the right to make technical changes without notification, and to correct mistakes or omissions.

All terms and conditions (TCs) stated are recommended introductory prices, given in euro without VAT (unless otherwise indicated). All hardware and software names used are trade designations and/or registered trademarks of their corresponding manufacturer.

Issue by: Simon Kastenmüller Telephone: ++49 (0)89 3222 1934 Simon.Kastenmueller@fujitsu -siemens.com http://www.fujitsu-siemens.de/ Extranet: http://extranet.fujitsu-siemens.de/

Literature

1. B. Callaghan, B. Pawlowski, P. Staubach, and SUN Microsystems. NFS Version 3 Protocol Specification. Internet-Standard RFC 1813, June 1995.

2. ClusterFS. Lustre 1.4.6 Operations Manual. User documentation, 2006.

3. R. Elliott. Serial Attached SCSI - General overview. Presentation, September 2003.

4. I. Foster and C. Kesselman. The Grid - Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 2nd edition, 2003.

5. G. Gibson, D. Nagle, and K. Amiri. File server scaling with network-attached secure disks. In Proceedings of ACM International Workshop on Measurement and Modeling of Computer Systems (Sigmetrics 97), pages 272-284, June 1997.

6. G. A. Gibson and R. Van Meter. Network attached storage architecture. Comm. Of the ACM, 43(11):37-45, November 2000.

7. A. Huffman and J. Clark. Native Command Queuing (NCQ). Joint Whitepaper from Intel and Seagate, July 2003.

8. International Committee for Information Technology Standards (incits) Technical Committee T10. SCSI Standards Architecture. T10.org Web Page, 2006.

9. A. Krimkevich. Object Based Storage. Presentation at NAS Industry Conference, 2005.

10. H. Mason, S. Peiffer, and H. Smith. SAS and SATA team up for enterprise. SNS Europe, 6(3):28-30, May 2006.

11. M. Mesnier, G.R. Ganger, and E. Riedel. Object based storage. IEEE Communications Magazine, pages 84-90, August 2003.

12. D. McAdam. Recovery Management. The New Business Data Protection Requirement. Technical report, Data Mobility Group, May 2005.

13. R. Moore and A. Jagatheesan. Data grid management systems. In Proceedings of the 21st IEEE/NASA Conference on Mass Storage Systems and Technologies (MSST), April 2004. 14. A. Rajasekar, M. Wan, R. Moore, and T. Guptil. Data grids, collections and grid bricks. In

Proceedings of the Twentieth IEEE/NASA Conference on Mass Storage Systems and Technologies (MSST), San Diego, USA, April 2003.

15. Rancho technology. 3 Gb 12 Port SAS Edge Expander. Press Release, November 2004. 16. Rancho technology. RTSASR-12X 12-Port Serial Attached SCSI (SAS) Edge Expander. White

Paper, June 2005.

17. E Riedel, C. Faloutsos, G. R. Ganger, and D. F. Nagle. Data Mining on an OLTP System (Nearly) For Free. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, USA, May 2000.

18. E. Riedel, G. A. Gibson, and C. Faloutsos. Active Storage for Large-scale Data Mining and Multimedia Applications. In Proceedings of the 1998 Very Large Data Bases conference (VLDB), August 1998.

19. P. Sarkar, S. Uttamchandani, and Kaladhar Voruganti: Storage Over IP: When Does Hardware Support Help? In Proc. of the 2nd Usenix Conference on File and Storage Technologies (FAST 2003). March 2003.

20. J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner. Internet Small Computer Systems Interface (iSCSI). Internet-Standard RFC 3720, April 2004.

21. Seagate. SCSI Inflection Point: The New Era of Serial Attached SCSI. White Paper TP-528, 2004. 22. S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame M. Eisler, and D. Noveck. Network

File System (NFS) version 4 Protocol. Internet-Standard RFC 3530, April 2003. 23. S. Shepler, M. Eisler, D. Noveck. NFSv4 Minor Version 1. Internet Draft, August 2006.

24. R. Srinivasan. XDR: External Data Representation Standard. Internet Standard RFC 1832, 1995. 25. H. Stockinger, O. Rana, R. Moore, and A. Merzky. Data management for grid environments. In

Proceedings of the European High Performance Computing and Networks Conference, Amsterdam, Holland, 2001.

26. SUN Microsystems. NFS: Network File System. Internet-Standard RFC 1094, March 1989. 27. R. Weber. Information technology - SCSI Object-Based Storage Device Commands (OSD). T10

Working Draft Projekt T10/1355-D, July 2004.

Storage Grids A Technology Survey

Storage Grids

A Technology Survey

Contents

Preliminary Note

Grid Computing and Storage Grids

Storage Grid Architectures

Storage Interconnects

Storage Management

Storage Virtualization

Object Storage Devices

Block Interface

Object Interface

Applications

Applications

Block I/O Manager

Block I/O Manager

XAM

File Systems for Storage Grids

Cluster Interconnects

Node Management

Trends for Storage Grids

Target Markets for Storage Grids

Literature