Storage Failure - Openstack Cloud Application Development 1st Edition 2015 {PRG} pdf

An OpenStack instance makes use of either ephemeral storage or persistent storage, or even a combination of both. Ephemeral storage is defined as storage that may not be permanent. For example, the storage associated with the instance could be deleted if the instance itself is terminated. Persistent storage is defined as storage that is permanent. If an instance is terminated, the persistent storage associated with the instance is typically not deleted, but may be detached and made available to be attached to another instance.

Persistent storage is typically implemented as object storage or block storage. Object storage is often implemented using Swift or some other product that implements the Swift API, such as Ceph. When using object storage, containers are created and binary objects are stored inside the containers. Instances can retrieve the stored objects using the API implemented by the object storage system. Block storage shows up to instances as block devices in the operating system, which can then be mounted on a directory or used as a raw device.

Ephemeral storage is similar to block storage in the way that it appears to the instance as block devices. This means that instances can mount the block device on a directory or use it as a raw device. Ephemeral is configured by default in

OpenStack to use the storage from the disks in the compute nodes. It is possible to configure ephemeral storage using other architectures too, but using compute node disks for ephemeral storage is the most common usage. Unless an instance is launched using a “boot from volume” method, the instance will be created using the compute node ephemeral storage.

One of the most common hardware failures encountered in an OpenStack environment is disk failure. Disk failures can have a wide effect on instances, depending on how OpenStack is configured and how the disk failure affects the device it is installed in. Ephemeral storage will likely have a greater effect on

instances than persistent storage. With ephemeral storage, data is very unlikely to be replicated and the chance for data loss will be higher. For persistent storage, data is often replicated and can be accessed in multiple ways. If a single disk fails in a persistent storage cluster, the instance may not even notice, since the data remains available and consistent.

Let’s look at the case where the instance is running on ephemeral storage. The operating system is on an ephemeral root disk that lives on a compute node. The compute node could be configured with some kind of RAID that replicates data behind the scenes. A single disk failure may not affect the instance at all, much like a single disk failure in persistent storage. However, it is not uncommon to see a RAID device experience a disk failure that results in the block device going into read-only mode inside the instance. Even though the data is still available and can be read by the instance, writes are blocked. Read-only mode usually occurs at the compute node level, which affects all the instances running on that node.

Rebooting the compute node is often needed to fix the issue.

If the compute node is not configured with RAID or some kind of data replication for ephemeral storage, then a loss of a disk is usually catastrophic to the instance. If the failed disk is where that instance’s storage was located, then that data is permanently lost. The instance will need to be terminated and rebuilt.

For block storage, instances may experience problems with the mounted volume. If there are serious issues in the persistent storage cluster, a volume may become unavailable to the instance. If the volume is mounted, any reads and writes may hang, waiting for a response. If the volume is not mounted, it may refuse to mount or be detected as a valid disk. This is seen more often when an instance is being launched or rebooted. If the volume is unavailable for some reason, the instance may fail to launch or need administrator interaction to get the instance to boot the rest of the way.

Most volume issues occur because of issues within the OpenStack environment and not necessarily with the persistent storage cluster. There may be an issue with Cinder or an issue with the communication between Nova and Cinder. Many of

these issues may not affect a volume that is already mounted in an instance and currently being used. However, these issues will likely affect the launching or rebooting of instances. This also affects the ability to detach volumes from one instance so that they can be attached to another instance. In most cases, this only affects the ability to access the data and does not result in data loss.

Object storage is accessed in a different manner than for block storage. Objects are pushed into storage or fetched from storage. If there are any issues in the object storage system, it usually manifests as objected being unavailable or operations timing out.

Persistent storage is most often configured with some kind of replication. Replication factors of 2 or 3 are common, but there may be cases where

replication is disabled for some reason. It is important to ask the administrators about how replication is configured in order to better understand how failures could affect access to data and the potential for data loss.

Instances have a difficult time dealing with storage devices becoming unavailable. If the application depends on data always being available, it is important that monitoring is configured to monitor storage available and integrity. For

ephemeral storage, most failures results in the instance going down as well.

However, instances should monitor for when filesystems go into a read-only state. Instances may operate fine with a read-only filesystem, especially if the

application only reads data and doesn’t write data. Monitoring may not see the issue either, nor will the issue be seen in any of the logs. Since a read-only

filesystem is an indication that there is an underlying problem with the compute node, catching it early so the application can adapt around it is a good idea.

In document Openstack Cloud Application Development 1st Edition 2015 {PRG} pdf (Page 142-144)