• No results found

[NORMAL] Ensuring Collective Availability in Volatile Resource Pools via Forecasting

N/A
N/A
Protected

Academic year: 2021

Share "Ensuring Collective Availability in Volatile Resource Pools via Forecasting"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Artur Andrzejak Zuse Institute Berlin (ZIB)

andrzejak[at]zib.de

Derrick Kondo INRIA

David P. Anderson UC Berkeley

Ensuring Collective Availability in

Volatile Resource Pools via

(2)

• In 2006 there were

over 1 billion PCs-in-use

• Many of them using Google / Yahoo services

• … provided by (maybe) 0.5 – 1 million PCs

• PCs-in-use

• > 1 billion (109)

• marginally utilized • Goggle / Yahoo servers

• 0.5 – 1 million (?)

• essentially commodity hardware • well utilized

(3)

• Goal: can we deploy serious services / apps

over

unreliable resources

?

• How

unreliable

?

– mostly non-dedicated PC's (used for other purposes) – e.g. volunteer computing Grids such as SETI@home – no control over availability, frequent churn

• What are "serious" services / apps?

– examples: Google Apps, MS SkyDrive, Email, .. – high availability: around 99.999

(4)

Why non-dedicated resources?

• Abundance: > 1 billion PC's-in-use since 2006

• They are cheap

– only incremental cost of electricity

– most office PC's are underutilized: unexploited assets of corporations

• Newer PC's can have many users simultaneously

– multi-core processors – 2 GB main memory

(5)

How to do this?

• Difficult to get (many) hosts with high avail

• Instead, we strive for collective availability:

– def.: guarantee that with high probability, in a group of R N hosts, at least N remain available over time T

R = 6, N = 3 4 N survived, col. availability achieved Time T

(6)

Method and Deployment

• We use statistical

and prediction methods

to

answer the question:

Given a pool of non-dedicated hosts and a request for N hosts, how to select R N of them such that the

collective availability is maximized?

• i.e. at least N among R hosts "survive" interval T

• Deployment example

Initial

group over TUsage failed?Which of next phasePrediction Replacementfrom pool over TUsage

...

x

x

x

Initial prediction

(7)

Data for Our Study

• Availability traces for over 48'000 hosts

participating in SETI@home

• Active in Dec 1st, 2007 to Feb 12th, 2008

• Availability recorded by a BOINC client

– depends whether the machine was idle

– The definition of idle depends on user settings

• Quantized to 1 hour intervals

– regarded as available only if uninterrupted avail for the whole hour – quite conservative

(8)
(9)

Filtering Hosts By Predictability

• We want to find out, for each host, whether its

availability predictions are likely to be accurate

• I.e. we want hosts with high predictability:

– def.: expected accuracy of predictions from a model build on historical data

• To estimate it, we use

indicators of predictability

– fast to compute (at least faster than a prediction model) – use only training data

A. Filter hosts by predictability => create a pool of

"good" hosts

B. Make predictions only for hosts from

(10)

A: Predictability Indicators

We have tested, among others:

– Ave. length of an uninterrupted availability segment – Size of the compressed availability trace

• traces with predictable patterns are likely to compress better – Prediction error tested on a part of the training data

(as a control indicator)

– Number of availability state changes per week (aveSwitches)

week 1

week 2

avail avail available

"switch"

aveSwitches =

4 / 2 = 2

(11)

B: Prediction

• We have used a simple and fast classifier

– naive Bayes

• The classifier takes examples i.e. vectors of

measured avail + preprocessed data over 30

days

• Predicts for each hour over two weeks

– starting now, will the host be available in the next k hours

• this is prediction interval length, pil 1 h 1 h 1 h 1 h 1 h

(12)

What drives accuracy?

• Dependence upon

– prediction interval length, pil – training interval length

(13)
(14)

Simulating Collective Availability

• A realistic deployment scenario:

• a "user" asks for a group of N hosts over time T • we give him R > N hosts and an assurance of

collective availability = probability, that at least N hosts survive T

• The major question is:

• what is the necessary redundancy (R-N)/N to achieve a given N and

• i.e. how much overprovisioning is needed?

R hosts given to "user"

N survivals

N of them survived after time T

(15)

Simulating Collective Availability

• The major question is:

• what is the necessary redundancy (R-N)/N to achieve

a given N and ?

• Our simulations answer this:

• for a fixed R, we select randomly R hosts predicted to be available over time T

• then we compute the success rate

• ratio (# experiments with at least N hosts alive after time T) / (all experiments)

(16)

The Role of Predictability

• A little twist - we select R hosts from two groups:

– low predictability, with aveSwitches 7.47

(17)

Necessary Redundancy

(18)

Necessary Redundancy

(19)

Is this Redundancy too high?

• In high predictability group, we have required

redundancy of 35%

– One of the reviewers (DSOM 2008) called this factor disappointingly high

• However, we consider this dramatically low

– In comparison, SETI@home has 200% redundancy (also used for result validation)

– In terms of absolute savings, that equates to 165 TeraFLOPS saved in a 1 PetaFLOPS system (such as FOLDING@home) => significant power savings

(20)

Migration Overhead

• We also evaluated the overhead due to host

migration, service restart between slices of len T

• Threshold

= a multiple of pil which describes the

total time (many T's) of running an app / service

• Turnover rate TR:

– let S be a set of hosts predicted to be available at t0 – for those we predict which ones become not available

after time pil, i.e. second prediction at t0+T

– TR is the fraction of hosts which change from avail to non-avail

(21)

Turnover Rates

• about 2.5% for high predictability group

• about 12% for low predictability group

(22)

Summary

• Given that host redundancy is not an issue

("cheap" resources), high collective availability is

achievable

– even with low migration costs

• Predictability assessment

and filtering

is essential

– improves accuracy

– avoids many "wasted" predictions

• Future work:

– a new "application architecture" /programming model for collective availability

– masking failures by virtualization and VM migration – a real deployment scenario – possibly at CERN?

(23)

Thank you.

Paper can be found at:

http://www.zib.de/andrzejak

/

Artur Andrzejak, Derrick Kondo, David P. Anderson

Ensuring Collective Availability in Volatile Resource Pools via Forecasting

19th IFIP/IEEE Distributed Systems: Operations and Management (DSOM 2008) Samos Island, Greece, September 22-26, 2008.

(24)
(25)

Availability Prediction

• The core of the work are efficient

and

domain-adjusted

predictions of availability for individual

hosts

– efficient:

• fast pre-selection of predictable hosts

• use simple and fast classification algorithm

– domain-adjusted

• analyzed the "causes" of predictability and adjusted our methods to them

• Then

we use these individual predictions to

(26)

Prediction Process

Observe Host Availability

Filter by High Predictability

Predict Individual Availability

Select a High-Availability Group

(27)

Why are AveSwitches good?

• The best predictability indicator is:

• Number of availability state changes per week (aveSwitches)

• There are some "reasons" for data regularity

high prediction accuracy

1. Periodic behavior, e.g. daily periodicities

2. Long runs of availability / non-availability 3. …

• We have studied which "reasons" are dominant:

– by using data preprocessing which "helps" either 1 or 2

• results show that

"reason" 2 is dominant

(28)

Predictability Computation

• To assess the accuracy of predictability indicators, we have to compute for each host the true accuracy of model-based predictions

• To this end, we train a prediction model on the historical availability data (4 weeks @ 1 hour), and then compute the prediction error on the subsequent 2 weeks (1 hour =>

2*7*24 predictions)

– This is only the "laboratory" scenario, not to be done in real deployment

• In real deployment, the predictability indicators should tell us, for which hosts it is not worth to build model / do predictions

Training Data – building model Test Data – evaluating model

(29)

And the winner is..

• Number of availability state changes per week: aveSwitches

Spearman's rank correlation coefficient indicator vs. prediction error Scatter plot aveSwitches vs. prediction error

(30)

Open Questions

• Collective Availability needs another application

architecture

– services or apps must be able to run replicated on R nodes, and tolerate failures of at most R-N nodes

• similar to e.g. Google Map-Reduce infrastructure or SETI replication for reliability schema

– if some nodes fails, they are substituted by other nodes without harm to the execution

References

Related documents

Some progress has been made towards addres- sing equity challenge, including reducing user fees at primary health care facilities and developing a health financing strategy,

The purpose of this study was to inquire into the knowledge of medical students on the Middle East respiratory syndrome (MERS) and evaluate whether infection prevention

In contrast, trade costs, factor intensities, and the real exchange rate appear to play a minor role in explaining entry and exit in export

- Mobile networks, post offices and retailers provide the most attractive opportunities for distribution of financial services products, both on reach and feasibility. - Gas

trained spend more time in research than those prac- ticing general pediatrics exdusively, or who practice a subspecialty area consistent with training. Pedia- tricians not certified

However, while insurance is an important risk management tool for companies with complex supply chains, taking out insurance should not be seen as a panacea, as AGCS’s Paul

When both husband and wife (or registered domestic partners) are covered employees, and one employee’s coverage terminates for reasons outlined in the Termination of Insurance

Now, one works on the idea of a uniformly probable response spectrum, i.e., its ordinates represent a barrier with equal probability not to be exceeded by the