6.4 Performance and Scalability
6.4.1 Scalability
The scalability of RADOS is potentially limited by three primary factors: the qual- ity of the data distribution generated by CRUSH, interaction with the monitor cluster, and the effective distribution of cluster map updates. Other elements of the system are trivially par- allelizable. In particular, failure detection, failure recovery, and replication are all bounded on each OSD by the number of peers, regardless of the cluster size. For our purposes, I assert that a sufficiently large and fast network can be constructed to service many thousands of OSDs [42]. CRUSH includes some provisions for segregating replication traffic (by keeping all replicas in- side a suitably large domain), but I otherwise consider the network to be outside the scope of this work.
OSD Cluster Size 2 6 10 14 18 22 26 Per−OSD Throughput (MB/sec) 30 40 50 60 crush (32k PGs) crush (4k PGs) hash (32k PGs) hash (4k PGs) linear
Figure 6.11: OSD write performance scales linearly with the size of the OSD cluster until the
switch is saturated at 24 OSDs. CRUSH and hash performance improves when more PGs lower variance in OSD utilization.
6.4.1.1 Data Placement
As seen in Chapter 5, CRUSH produces a distribution of data that closely matches the mean and variance of a binomial or normal distribution [101], meaning it appears random, even though it is a deterministic and constrained mapping. The primary consequence is that the variance σ2 in the number PGs per OSD—and subsequently, in OSD storage utilizations and workloads—is related to the average number of PGs per OSD (µ), whereσ2≈µ. With an average of 100 PGs per OSD, the standard deviationσis 10%; with 1000 per OSD,σ drops to 3%. This behavior holds even for large clusters composed of heterogeneous (i. e. non-uniformly weighted) devices.
Figure 6.11 shows per-OSD write throughput as the cluster scales using CRUSH, a simple hash function, and a linear striping strategy to distribute data in 4096 or 32768 PGs among available OSDs. Linear striping balances load perfectly for maximum throughput to provide a benchmark for comparison, but like a simple hash function, it fails to cope with
device failures or other OSD cluster changes. Because data placement with CRUSH or a hash is stochastic, throughputs are lower with fewer PGs: greater variance in OSD utilizations causes request queue lengths to drift apart under this entangled client workload.
Although a probabilistic data distribution means that some devices may become over- loaded (i. e. handle many more than µ PGs) with small probability, PGs can be explicitly diverted away from specific devices using the overload mechanism in CRUSH. Unlike the hash and linear strategies, CRUSH also minimizes data migration under cluster expansion while maintaining a balanced distribution.
Most importantly, the computational cost of calculating a CRUSH mapping is O(log n)
for a cluster of size n, allowing mappings in the tens of microseconds for even extremely large clusters.
6.4.1.2 Monitor Interaction
The monitor cluster is designed both for extreme reliability and for high availability. In the general case, monitors do very little work—they process small messages in response to failures, but are otherwise idle. A worst case load for the monitor cluster occurs when large numbers of OSDs appear to fail in a short period. If each OSD storesµ PGs and f OSDs fail, then an upper bound on the number of failure reports generated is on the order ofµf , which
could be very large if a large OSD cluster experiences a network partition. To prevent such a deluge of messages, OSDs throttle and batch failure reports, imposing an upper bound on monitor load proportional to the cluster size.
the first few reports of a given failure will be forwarded to the elected lead monitor. Map up- dates are quickly propagated among monitors such that subsequent reports of the same failures will be reflected by the current map and result in an immediate response to the OSD. For this reason, OSDs send heartbeats to peers at semi-random intervals to stagger detection of failures, dispersing reports for a given failure over time. Furthermore, map updates returned to a report- ing OSDs will also reflect all other failures processed to date, preventing some future failure reports from being sent.
6.4.1.3 Map Propagation
The RADOS map distribution algorithm (Section 6.2.5) ensures that updates reach all OSDs after only log n hops. However, as the size of the storage cluster scales, the frequency of device failures and related cluster updates increases. Because map updates are only exchanged between OSDs who share PGs, the hard upper bound on the number of copies of a single update an OSD can receive is proportional toµ.
In simulations under near-worst case propagation circumstances with regular map updates, I found that update duplicates approach a steady state even with exponential cluster scaling. In this experiment, the monitors share each map update with a single random OSD, who then shares it with its peers. In Figure 6.12 I vary the cluster size x and the number of PGs on each OSD (which corresponds to the number of peers it has) and measure the number of duplicate map updates received for every new one (y). Update duplication approaches a constant level—less than 20% of µ—even as the cluster size scales exponentially, implying a fixed map distribution overhead. I consider a worst-case scenario in which the only OSD
Cluster size (OSDs)
64 128 256 512 1024
Map duplication (dups/actual)
0 10 20 30 40 50 80 PGs per OSD 160 PGs per OSD 320 PGs per OSD
Figure 6.12: Duplication of map updates received by individual OSDs as the size of the cluster
grows. The number of placement groups on each OSD effects number of peers it has who may share map updates.
chatter consists of ping messages for failure detection, which means that, generally speaking, OSDs learn about map updates (and the changes known by their peers) as slowly as possible. Limiting map distribution overhead thus relies only on throttling the map update frequency, which the monitor cluster already does as a matter of course.
6.4.2 Failure Recovery
Figure 6.13 shows write throughput over time as a cluster of 20 (real) OSDs recovers from two (simulated) failures at time 30. 30 clients are writing data with 2×replication and
saturating the cluster. As the failed OSDs are initially marked down, replication throughput drops because affected PGs are temporarily unreplicated, while effective client write perfor- mance correspondingly increases. At time 50 the OSDs are marked out and recovery is initi- ated. Performance drops while active and then inactive objects are re-replicated to other OSDs,
time (seconds)
0 30 60 90 120
Cluster Throughput (MB/sec) 0 100 200 300 400 500 write replication recovery
Figure 6.13: Write throughput over time as a saturated 20 OSD cluster recovers from two OSD
failures at time 30. Data re-replication begins at time 50 and completes at time 80.
time (seconds)
0 30 60 90 120 150 Cluster Throughput (MB/sec) 0
50 100 150 200 250 write replication recovery
Figure 6.14: Write throughput over time as an unsaturated 20 OSD cluster recovers from one
but throughput eventually returns to a level slightly below baseline (due to the two missing OSDs). The throughput penalty associated with recovery is exaggerated in this experiment for two reasons. First, in such a small cluster, all OSDs either share data with the failed devices, or are selected as replacements in one or more PGs. In a large system, each OSD failure will initiate recovery for only a single PG (of a hundred or more) on each of its peers. Second, the relatively naive implementation currently makes almost no attempt to balance recovery with regular workload. Nevertheless, recovery proceeds quickly in parallel and the cluster resumes performing at only a slightly degraded level after about 30 seconds.
This is evident in Figure 6.14, which shows first one and then two OSDs failures on an only partially loaded cluster. The first recovery minimally effects throughput both because more disk bandwidth is available and because fewer PGs contain data to be replicated. The second recovery involves more PGs and more data, with a greater impact on performance.
6.5
Future Work
Although RADOS was developed for use in the Ceph distributed file system [100], the reliable and scalable object storage service it provides is well-suited for a variety of other storage abstractions. In particular, the current interface based on reading and writing byte ranges is primarily an artifact of the intended usage for file data storage. Objects might have any query or update interface or resemble any number of fundamental data structures. Potential services include distributed B-link trees that map ordered keys to data values (as in Boxwood [63]), high-performance distributed hash tables [90], or FIFO queues (as in GFS [30]).
Although RADOS manages scalability in terms of total aggregate storage and ca- pacity, this dissertation does not address the issue of many clients accessing a single popular object. I have implemented a read shedding mechanism which allows a busy OSD to shed reads to object replicas for servicing, when the replica’s OSD has a lower load and when consistency allows (i. e. there are no conflicting in-progress updates). Heartbeat messages exchange infor- mation about current load in terms of recent average read latency, such that OSDs can determine if a read is likely to be service more quickly by a peer. This facilitates fine-grained balancing in the presence of transient load imbalance, much like D-SPTF [62]. Notably, this read shedding is only possible with primary-copy replication, as the OSD servicing reads must be aware of any in-progress writes in order to preserve consistency. Although preliminary experiments are promising, a comprehensive evaluation has not yet been conducted.
More generally, the distribution of workload in RADOS is currently dependent on the quality of the data distribution generated by object layout into PGs and the mapping of PGs to OSDs by CRUSH. Although I have considered the statistical properties of such a distribution and demonstrated the effect of load variance on performance for certain workloads, the interac- tion of workload, PG distribution, and replication can be complex. For example, write access to a PG will generally be limited by the slowest device storing replicas, while workloads may be highly skewed toward possibly disjoint sets of heavily read or written objects. I have conducted only minimal analysis of the effects of such workloads on efficiency in a cluster utilizing declus- tered replication, or the potential for techniques like read shedding to improve performance in such scenarios.
failures, like corrupted disk blocks. Checksums or preemptive mechanisms like disk scrubbing would dramatically improve data safety.
The integration of intelligent disk scheduling, including the prioritization of repli- cation versus workload and quality of service guarantees, is an ongoing area of investigation within the research group [109].