• No results found

Managing the performance of large, distributed storage systems

N/A
N/A
Protected

Academic year: 2022

Share "Managing the performance of large, distributed storage systems"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Managing the performance of large, distributed storage systems

Scott A. Brandt

Carlos Maltzahn, Anna Povzner, Roberto Pineiro, and Andrew Shewmaker, and Tim Kaldewey

Computer Science Department University of California Santa Cruz

Richard Golding and Ted Wong, IBM Almaden Research Centerand

LANL ISSDM Talk — July 10, 2008

(2)

Outline

• Problem: managing the performance of large, distributed storage systems

• Approach: end-to-end performance management

• General model: RAD

• Applying RAD to other resources: disk, network, and buffer cache

• Moving forward: data center performance

management

(3)

Distributed systems need performance guarantees

Many distributed systems and applications need (or want) I/O performance guarantees

Multimedia, high-performance simulation, transaction processing, virtual machines, service level agreements, real-time data capture, sensor networks, ...

Systems tasks like backup and recovery

Even so-called best-effort applications

Providing such guarantees is difficult because it involves:

Multiple interacting resources

Dynamic workloads

Interference among workloads

Non-commensurable metrics: CPU utilization, network throughput, cache space, disk bandwidth

(4)

End-to-end I/O performance guarantees

Goal: Improve end-to-end performance management in large distributed systems

Manage performance

Isolate traffic

Provide high performance

Targets: High-performance storage (LLNL), data centers (LANL), satellite communications (IBM), virtual machines (VMware), sensor networks, ...

Approach:

1. Develop a uniform model for managing performance 2. Apply it to each resource

3. Integrate the solutions

(5)

Our current target

• High-performance I/O

From client, across network, through server, to disk

Up to hundreds of thousands of processing nodes

Up to tens of thousands of I/O nodes

Big, fat, network interconnect

Up to thousands of storage nodes with cache and disk

• Challenges

Interference between I/O streams, variability of

workloads, variety of resources, variety of applications, legacy code, system management tasks, scale

(6)

Stages in the I/O path

client cache network

transport

disk storage

cache

network transport

flow control with one client

connection management between clients IO selection

and head scheduling

prefetch and writeback based on utilization, QoS

app

app

I/O scheduler

client cache network

transport

app

app integration

between client and server cache

1.

Disk I/O

2.

Server cache

3.

Flow control across network

• Within one client’s session and between clients

4.

Client cache

(7)

System architecture

Client

Storage Server

QoS Broker

Storage Server

Storage Server

Storage Server

Request Reservation

Utilization reservations

1

2 3

4

Network Server Caches

Disks

Client: Task, host, distributed application, VM, file, ...

Reservations made via broker

• Specify workload: throughput, read/write ratio, burstiness, etc.

Broker does admission control

• Requirements + workload are translated to utilization

• Utilizations are summed to see if they are feasible

• Once admitted, I/O streams are guaranteed (subject to workload adherence)

Disk, caches, network

controllers maintain guarantees

I/O

(8)

Achieving robust guaranteeable resources

• Goal: Unified resource management algorithms capable of providing

• Good performance

• Arbitrarily hard or soft performance guarantees with

Arbitrary resource allocations

Arbitrary timing / granularity

• Complete isolation between workloads

• All resources: CPU, disk, network, server cache, client cache

➡Virtual resources indistinguishable from “real”

resources with fractional performance

(9)

Isolation is key

• CPU

• 20% of a 3 Ghz CPU should be indistinguishable from a 600 Mhz CPU

• Running: compiler, editor, audio, video

• Disk

• 20% of a disk with 100 MB/second bandwidth should be indistinguishable from a disk with 20 MB/second bandwidth

Serving:1 stream, n streams, sequential, random

(10)

Epistemology of virtualization

• Virtual Machines and LUNs provide good HW virtualization

• Question: Given perfect HW virtualization, how can a process tell the difference between a virtual resource and a real resource?

• Answer: By not getting its share of the

resource when it needs it

(11)

Observation

• Resource management consists of two distinct decisions

• Resource Allocation: How much resources to allocate?

• Dispatching: When to provide the allocated resources?

• Most resource managers conflate them

• Best-effort, proportional-share, real-time

(12)

Separating them is powerful!

• Separately managing resource allocation and dispatching

gives direct control over the delivery of resources to tasks

• Enables direct,

integrated support of all types of

timeliness needs

Resource Allocation

Missed Deadline

SRT

Dispatching

unconstrained

unconstrainedconstrained

Resource Allocation Soft SRT

Real- Time

Best Effort

CPU- Bound

I/O- Bound

Hard Real- Time

Rate-Based

constrained

(13)

The resource allocation/dispatching (RAD) scheduling model

Rate Deadlines

Dispatcher

Series of jobs w/

budgets and deadlines

Share of resources

Times at

which allocation must equal share

Process

(14)

Supporting different timeliness requirements with RAD

Hard Real-time

Rate- based

Best- effort

Soft Real-time

Rate Deadlines

Dispatcher Scheduling Mechanism Runtime

System

Rate Bounds

Period WCET

Period ACET

Priority

P

i

P

i

P

i

P

i

Set of jobs w/

budgets and deadlines

Scheduling Policy

(15)

Rate-Based Earliest Deadline (RBED) CPU scheduler

Rate Deadlines

EDF + timers Scheduling

Policy Scheduling

Mechanism Runtime

System

Set of jobs w/

budgets and deadlines

• Processes have rate & period

∑rates ≤ 100%

Periods based on processing

characteristics, latency needs, etc.

• Jobs have budget & deadline

budget = rate * period

Deadlines based on period or other characteristics

• Jobs dispatched via Earliest Deadline First (EDF)

Budgets enforced with timers

Guarantees all budgets &

deadlines = all rates & periods

(16)

Adapting RAD to disk, network, and buffer cache

• Guaranteed disk request scheduling Anna Povzner

• Guaranteeing storage network performance

Andrew Shewmaker (UCSC and LANL)

• Buffer management for I/O guarantees

Roberto Pineiro

(17)

Guaranteed disk request scheduling

• Goals

• Hard and soft performance guarantees

• Isolation between I/O streams

• Good I/O performance

• Challenging because disk I/O is:

• Stateful

• Non-deterministic

• Non-preemptable, and

• Best- and worst-case times vary by 3–4 orders of magnitude

(18)

Fahrrad

• Manages disk time instead of disk throughput

• Adapts RAD/RBED to disk I/O

• Reorders aggressively to provide good

performance, without violating guarantees

A B C BE

Disk

I/O streams

Fahrrad

(19)

A bit more detail

Reservations in terms of disk time utilization and period (granularity)

• All I/Os feasible before the earliest deadline in the system are moved to a Disk Scheduling Set (DSS)

• I/Os in the DSS are issued in the most efficient way

• I/O charging model is critical

• Overhead reservation ensures exact utilization

2 WCRTs per period for “context switches”

1 WCRT per period to ensure last I/Os

1 WCRT for the process with the shortest period due to non-preemptability

(20)

Fahrrad guarantees utilization, isolation, throughput

• 4 SRT processes, sequential I/Os, reservations = 20%

• 3 w/period = 2 seconds

• 1 with period 125 ms to 2 seconds

• Result: perfect isolation between streams

Utilization Throughput

(21)

Fahrrad also bounds latency

• Period bounds latency

Fraction of I/Os

Latency (ms)

Latency bounds (period)

(22)

Fahrrad outperforms Linux

Workload

Media 1: 400 sequential I/Os per second (20%)

Media 2: 800 sequential I/Os per second, (40%)

Transaction: short bursts of random I/Os at random times (30%)

Background: random (10%)

Result: Better isolation AND better throughput

Linux Linux w/Fahrrad

(23)

Guaranteeing storage network performance

• Goals

• Hard and soft performance guarantees

• Isolation between I/O streams

• Good I/O performance

• Challenging because network I/O is:

• Distributed

• Non-deterministic (due to collisions or switch queue overflows)

• Non-preemptable

• Assumption: closed network

(24)

What we want

Client Client

Client

Server

Server

Server

30%

50%

20%

(25)

What we have

• Switched fat tree w/full bisection bandwidth

• Issue 1: Capacity of shared links

• Issue 2: Switch queue contention

(26)

Radon—RAD on the Network

• RAD-based network reservations ~ disk reservations

Network utilization and period

• Fully decentralized

• Two controls:

Window size w = # of network packets per transmission window

Offset = delay before transmitting window

• Two pieces of information:

Dropped packets (TCP)

Forward delay (TCP Santa Cruz)

(27)

Flow control and congestion control

• Flow control determines how much data is transmitted

• Fixed window sizes (TCP)

• Adjust window size based on congestion (TCP Santa Cruz)

• Congestion control determines what to do when congestion is detected

• Packet loss ➔ backoff then retransmit (TCP)

• ∆ Forward delay ➔ ∆ window size (TCP Santa Cruz)

(28)

Packet loss based congestion control (TCP)

Improving TCP Congestion Control Over Internets with Heterogeneous Transmission Media (1999)!

Christina Parsa, J.J. Garcia-Luna-Aceves. Proceedings of the 7th IEEE ICNP

Bottleneck Switch Queue Overflows Client Observes Packet Loss

(29)

Performance vs. offered load (w/nuttcp)

• The switch performs well below 800 Mbps

• Regardless of the # of clients

• Each client is limited to 600 Mbps due to SW/NIC issues

Achieved vs. Offered Load Packet Loss vs. Offered Load

(30)

Delay based congestion control (TCP SC)

Queue Settles At Operating Point

Client Observes Increased Delay

(31)

The incast problem

• On Application-level Approaches to Avoiding TCP Throughput

Collapse in Cluster-based Storage Systems (2007). Elie Krevat, Vijay Vasudevan, Amar Phanishayee, David G. Andersen, Gregory R. Ganger, Garth A. Gibson, Srinivasan Seshan. Proceedings of the PDSW

(32)

Three approaches

1.Fixed window size w/out offsets

Clients send data as it is available, up to per-period budget

Expected result: lots of packet loss due to incast/queue overflow

2.Fixed window size w/per-window offsets

Clients transmit each window with random offset in [0, laxity / num_windows]

3.Less Laxity More

Distributed approximation to Least Laxity First

No congestion increase window size

Congestion move toward wopt = (1-%laxity)*wmax

(33)

Early results

• Congestion detection

• Congestion response

(34)

Buffer management for I/O guarantees

• Goals

• Hard and soft performance guarantees

• Isolation between I/O streams

• Improved I/O performance

• Challenging because:

• Buffer is space-shared rather than time-shared

Space limits time guarantees

• Best- and worst-case are opposite of disk

• Buffering affects performance in non-obvious ways

(35)

Buffering in storage servers

• Staging and de-staging data

• Decouples sender and receiver

• Speed matching

• Allows slower and faster devices to communicate

• Traffic shaping

• Shapes traffic to optimize performance of interfacing devices

• Assumption: reuse primarily occurs at the

client

(36)

Approach

• Partition buffer cache based on worst-case needs

1–3 periods worth

• Use slack to prefetch reads and delay writes

Period transformation: period into cache may be shorter than from cache to disk

Rate transformation: rate into cache may be higher than disk can support

(37)

Data center performance management

• Data center perf. mgmt. is currently relatively ad hoc and reactive

• RAD gives us both tools and metrics for

managing the performance of large distributed systems

• CPU, client cache, network, server cache, disk

• Predict, provision, guarantee, measure, monitor, mine

• By job, node, resource, I/O stream, etc.

(38)

Goals

1. A first-principles model for data center perf. mgmt.

2. Full-system metrics for measuring performance in client processing nodes, buffer cache, network, server buffer cache, and disk

3. Performance visualization by application, client node, reservation, or device

4. Application workload profiling and modeling 5. Full system performance provisioning and

management based on all of the above

6. Online machine-learning based performance monitoring for real-time diagnostics

(39)

Conclusion

• Distributed I/O performance management requires management of many separate

components

• An integrated approach is needed

• RAD provides the basis for a solution

• RAD has been successfully applied to several resources: CPU, disk, network, and buffer cache

• We are on our way to an integrated solution

• There are many useful applications: Data

centers, TorMesh, virtualization, ...

References

Related documents

There was however growth in all the treatments of the Beske variety, but not significantly different in plant height and leaf area between treatments.. This could

Thus for these activities, constraint set 2.6 implies that a train can only depart on a track of an open track section if a train has departed on the same track in the same

Aside from the larger randomness in behaviour, there is also an important difference when uncertainty about the true reward statistics is high (lower left corner): in the absence

 After viewing the video clips, students return to their groups, where they are to decide which video facts about wolves they think are true, and then discover which of these

In this paper we proposed a mobility-based clus- tering approach that can be applied in wireless mobile ad-hoc networks in order to facilitate the implementation of efficient

16 as ALE, in normal tissue; b) Estimated survival functions for patients with High and Low Ψ associated to the inclusion of CD44’s exon v5, in normal tissue; c) Estimated

The presence at Board meetings of bilingual staff knowledgeable in board policy issues for members of the Board to request minor clarifications in their language of choice.. There

Processing Services involves the transfer of management and execution of activities of single business processes to an external service provider – in the case of payroll, from