• No results found

Predictive Analytics. Omer Mimran, Spring Challenges in Modern Data Centers Management, Spring

N/A
N/A
Protected

Academic year: 2021

Share "Predictive Analytics. Omer Mimran, Spring Challenges in Modern Data Centers Management, Spring"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Predictive Analytics

Omer Mimran, Spring 2015

(2)

Information provided in these slides is

for educational purposes only

(3)

Agenda

Motivation

• Predicting the jobs resource requirements

Background and challenges

• Predictive analytics, data-stream mining (DSM)

• System overview

DSM algorithms

• Regression tree, Hoeffding tree, Multiple sliding windows (MSW)

Summary & conclusions

(4)
(5)

Reminder:

RM lectures I–III

Each job comes with

resource requirements

e.g., 2-cores X 8GB

• Specified by the user submitting the job, based on his experience, etc.

Scheduler picks the job (

RM-I

) and matches it with a server (

RM-II

)

• Best fit, worst fit, etc.

(6)

Why we need predictive analytics?

What if the jobs (users) request

too many

resources?

• 8GB while in practice the job only uses 4GB of memory?

Very common problem resulting in huge waste of resources ($$ loss)

• Even if resource matching was done optimally (RM-II lecture)

Our goal (predictive analytics)

Provide prediction for the actual resource usage of the jobs (focusing on memory)

Forward this information to the scheduler to do the matching

(7)

Background and challenges

(8)

Predictive analytics

Predictive analytics: ”encompasses a variety of techniques

from statistics, modeling, machine learning, and data mining that

analyze current and historical facts to make predictions about future,

or otherwise unknown, events.” (Nyce, Charles, 2007)

Machine Learning: “Field of study that gives computers the ability to

learn without being explicitly programmed.” (Arthur Samuel, 1959)

An Introduction to Data Mining/Machine Learning

General methodology (

CRISP-DM

):

1. Divide the data into 3 sets (training, testing, validation)

2. Use training set to create models and testing set to measure performance 3. Use validation set to select best model & test model generalization

(9)

Data-stream mining (DSM)

Data-stream:

continuous (endless) and rapid incoming data

Idea:

apply machine-learning techniques

on-line

, on the data stream

Key challenges:

1. Performance: infeasible to store/train all data, each sample is processed once

2. Quality: expected to perform at least as well as “no-stream” models

3. Adaptability: non-stationary stream, the underlying model must be altered accordingly

4. Availability: must be available for prediction at all times

(Bifet, et al., 2010; Domingos & Hulten, 2001; Aggarwal, 2007; Gama & Rodrigues, 2007; Gaber, et al., 2005; Babcock, et al., 2002)

(10)

Adaptivity challenge: concept drift

• Concept drift: scenarios in which the distribution of a certain population changes over time; hence, statistical inference is affected

(Kelly et al., 1999)

• Concept-drift types:

1. Sudden: easier to detect, with fewer examples

2. Gradual: harder to detect, often mistaken for random noise 3. Incremental: occur over long period of time

4. Recurring contexts: appear in a cyclic manner

(Tsymbal, 2004; Gama & Castillo, 2006; Zliobaite, 2009)

• Possible treatments:

1. Resetting the training data (Klinkenberg , 2004; Cohen, et al., 2008)

2. Training a shadow model (Domingos & Hulten, 2000; Ikonomovska & Gama; 2008; Bifet & Gavaldà, 2009)

(11)

Concept drift in reality

Bursts in jobs’ core and memory requirements

• Ohad Shai, Edi Shmueli, and Dror G. Feitelson, “Heuristics for resource matching in Intel's compute farm”. In Job Scheduling Strategies for Parallel Processing, Walfredo Cirne and Narayan Desai, (ed.), Springer-Verlag, 2013

(12)

Performance challenge: sliding windows

Using time windows

• A common technique in stream mining

• Better performance

• Also addresses concept drift

Time-window types

1. Landmark window: maintaining data, starting from identified relevant point 2. Tilted window: maintain all data within

a window in different aggregate scales 3. Sliding window: only recent examples

are stored in the window

(13)

Performance challenge: sliding windows

Example:

• The accuracy of protein-structure

prediction, using KNN with sliding windows of varying length

(Chen, Kurgan, & Ruan, 2006)

Challenges in Modern Data Centers Management, Spring 2015 13

The problem: how to set window size ?

Too short – lower statistical validity and stability

(14)

System overview

(15)

System overview – input from the users

Job characteristics

• User, project, priority, command-line, resource

requirements, etc.

Data only known at submission time

Categorial variables with many possible values

Challenges in Modern Data Centers Management, Spring 2015 15

(16)

System overview

1

(17)

System overview – output of the model

Prediction example

• If command = “A” and project = “Tablet” then memory=4GB

• If command = “B” and project = “Mobile” then memory=6GB

• If priority = “1” and user team = “uncore” then memory=2GB

• If project = “ServerX” then memory=16GB

• …

Challenges in Modern Data Centers Management, Spring 2015 17

(18)

System overview

1

2

3

(19)

System overview – output of the scheduler

Scheduler matches the jobs with machines/servers (

RM-II lecture

)

• Using the predicted values (not the original values specified by the user)

• More jobs fit in  higher throughput  $$ saving

Challenges in Modern Data Centers Management, Spring 2015 19

(20)

System overview

1

2

3

(21)

System overview – input to the model

Job characteristics

• User, project, priority, command-line, etc.

Actual resources

consumed by the jobs

• e.g., memory

Challenges in Modern Data Centers Management, Spring 2015 21

(22)

Performance measurements & objective

Measurements calculated per job once completed & available in DB

• Calculating actual runtime/memory consumption vs. prediction

Objective:

maximum savings + minimum of 95% accuracy

• i.e., minimize resource waste, while ensuring that 95% of the jobs will not be

under-estimated (otherwise they might be killed by the scheduler)

Measurements (calculated for all jobs which got memory prediction):

• Accuracy = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑗𝑜𝑏𝑠 𝑤𝑖𝑡ℎ 𝑚𝑒𝑚𝑜𝑟𝑦 𝑐𝑜𝑛𝑠𝑢𝑚𝑒𝑑<𝑚𝑒𝑚𝑜𝑟𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑗𝑜𝑏𝑠

• Saving = 𝑗𝑜𝑏 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 ∗ 𝑚𝑒𝑚𝑜𝑟𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

(23)

DSM algorithms

(24)

Challenge

Data available for learning

1. Jobs characteristics: user, project, command-line, etc. 2. Actual resources consumed by the jobs e.g., memory

Output

(25)

DSM algorithms: regression tree – idea

Challenges in Modern Data Centers Management, Spring 2015 25 3 4 2 Priority > 5 Priority <= 5 NumOfLoops > 0 NumOfLoops = 0 Priority=1 NumOfLoops=2

New Job Memory Prediction = 4GB

Fast Incremental Regression Tree with Drift Detection (FIRT-DD) (Elena et al., 2009)

(26)

DSM algorithms: regression tree – steps

1. Construct a tree using Chernoff bound comparing standard

deviation reduction (SDR) of all possible values as split criteria

All candidate variables values are tested,

Priority value 5 found best reducing STDEV Priority<= 5 Priority > 5 Split node using

(27)

DSM algorithms: regression tree – steps

2. Sliding window size is a pre- defined parameter

Challenges in Modern Data Centers Management, Spring 2015 27

2

Priority > 5 Priority <= 5

NumOfLoops <> 0 NumOfLoops = 0

Sliding window side = 5

Jobs 1,4,1,5,2 Jobs 4,1,5,2,3 new job value 3 added Job value 1 discarded Re-calculate prediction Median=3

(28)

DSM algorithms: regression tree – steps

3. Adaptivity –

• Track error rate using statistical PH test

• Grow a shadow sub-tree and replace

once accuracy is better Priority<= 5 Priority > 5

NumOfLoops > 0 NumOfLoops = 0 High Error CommandNum >10 CommandNum <= 10 Priority <= 5 Compare Error Rate

(29)

DSM algorithms: Hoeffding tree – idea

Challenges in Modern Data Centers Management, Spring 2015 29

fail fail pass Project = B Project = A Command Type = X Command Type = Y Project=A CommandType=X

New job predicted to fail

Hoeffding Adaptive Tree (HAT)

(30)

Entropy & information gain

Go to the beach Weather Yes Sunny Yes Sunny Yes Sunny No Sunny Yes Overcast Yes Overcast No Overcast No Overcast No Rain No Rain No Rain No Rain P(Beach = Yes) = 5/12 P(Beach = No) = 7/12 Entropy (Beach) = - 5/12𝑙𝑜𝑔2( 5 12)- 7/12𝑙𝑜𝑔2 7 12 = 0.98

P(Weather=Sunny and Beach=Yes) = 3/4 P(Weather=Sunny and Beach=No) = 1/4 Entropy(𝑆𝑠𝑢𝑛𝑛𝑦) = - 3/4𝑙𝑜𝑔2(3 4)- 1/4𝑙𝑜𝑔2 1 4 =0.81 Entropy(𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = 1 Entropy(𝑆𝑟𝑎𝑖𝑛) = 0 Entropy(S) = - 𝒊=𝟏𝒏 𝒑𝒊𝒍𝒐𝒈𝟐𝒑𝒊

(31)

Entropy & information gain

Entropy (Beach) = 0.98 Entropy(𝑆𝑠𝑢𝑛𝑛𝑦) = 0.81 Entropy(𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = 1 Entropy(𝑆𝑟𝑎𝑖𝑛) = 0

P(sunny) = P(overcast) = P(rain) = 4/12 Entropy (Beach | Weather) =

P(sunny)*Entropy(sunny) + P(overcast)* Entropy(overcast) + P(rain)*Entropy(rain) =

4/12(0.81) + 4/12(1) + 4/12(0) = 0.6

Challenges in Modern Data Centers Management, Spring 2015 31

Go to the beach Weather Yes Sunny Yes Sunny Yes Sunny No Sunny Yes Overcast Yes Overcast No Overcast No Overcast No Rain No Rain No Rain No Rain

By knowing the weather, how much information have I gained ?

Gain = Entropy(X) - Entropy(X|Y)

(32)

DSM algorithms: Hoeffding tree – steps

1. Construct a tree using information gain as split criteria and

Hoeffding bound statistical test as a stopping condition

Information Gain calculated for all candidate variables if G(Best Attr.)−G(2nd best)> ε*

Split leaf on best attribute

* ε= Hoeffding bound statistic

Project = B Project = A

Split node using Project variable

(33)

DSM algorithms: Hoeffding tree – steps

2. Sliding window size is dynamic (discussed later...)

Challenges in Modern Data Centers Management, Spring 2015 33

pass

Project = B Project = A

Command Type = X Command Type = Y

Sliding window side = 5

Jobs +,+,+,-,-Jobs +,+,-,-,-new job - added Job + discarded Re-calculate prediction fail

(34)

DSM algorithms: Hoeffding tree – steps

3. Adaptivity –

A. Window size change similar to MSW (discussed later...) B. Alternate tree:

• After a concept drift in the data stream, followed by a stable period, a new alternate tree is generated

• Track error rate on new concept

(35)

DSM algorithms: MSW – idea

Challenges in Modern Data Centers Management, Spring 2015 35

4GB 2GB Project = B Project = A Command Type = X Command Type = Y Project=A CommandType=X

New job memory prediction is 4GB

Multiple Sliding Windows (MSW)

(Mimran & Even, 2014)

10GB 4GB

Command Type = X Command Type = Y

(36)

DSM algorithms: MSW

1. Before training the model, find set of variables that impact the

memory consumption

• The method used – forward selection minimizing variance and number of

profiles:

Selected Variable Set Variable Rank Candidate Variables D 1,2,3,4,3,2,1 A, B, C, D, E, F, G D, A 6,5,4,3,2,1 A, B, C, E, F, G D, A, G 4,6,8,10,20 B, C, E, F, G D, A, G, B 3,1,1,2 B, C, E, F D, A, G, B -1, -2, 0 C, E, F

(37)

Variable set selection illustration

DSM algorithms: MSW

(38)

DSM algorithms: MSW

2. Set a sliding window per profile

Predict Label

(39)

Objective: maximum saving + minimum 95% accuracy Chosen strategy: linear prediction function (φ = 0.95, C=0.1)

DSM algorithms: MSW

3. Use any given prediction function within the window

(40)

DSM algorithms: MSW

4. Set the window size dynamically, using change detector

• Example for concept drift management of window with 850 jobs.

• Sub-window size parameter is 200 and confidence levels are: 97.5%, 95%, 90%, 90%

200 200 200 250 200 650 400 450 600 250 Older observations Division to sub-windows

1st change detection comparison, 97.5% confidence level

2nd change detection comparison, 95% confidence level

3rd change detection comparison, 90% confidence level

200 650

400 450

1st comparison is not statistically significant go to next sub-windows

Flow in case of 2ndcomparison being statistically significant

2nd comparison is statistically significant

prune window

400

(41)

4. Set the window size dynamically, using change detector

• Change detector function: Hoeffding bound

• Alternate function:

• Kolmogorov–Smirnov test

The Hoeffding bound (Hoeffding, 1963), also known as additive Chernoff bound

R - The range of the variable

(1-δ) - The statistical confidence

n - The number of examples ε =

R2 ln 1 δ 2n Courtesy of Wikipedia http://en.wikipedia.org/wiki/File:KS_Example.png

DSM algorithms: MSW

(42)

Summary & conclusions

Model Model Type Sliding

Windows

Window

Size Adaptivity

Change Detector Incremental Online Info-fuzzy Network (IOLIN)

(Cohen et al., 2008) classification 1 window heuristic

update network

Accuracy Degregation

Fast Incremental Regression Tree with Drift Detection (FIRT-DD)

(Ikonomovska et al., 2009)

regression multiple

windows

user-defined shadow model Error PH test

Concept adapting Very Fast Decision Trees

(CVFDT) (Hulten et al., 2001) classification 1 window

user-defined shadow model

Hoeffding Bound

Hoeffding Adaptive Tree (HAT)

(Bifet et al., 2009) classification

multiple windows dynamic modify window Hoeffding Bound

Multiple Sliding Windows (MSW)

(Mimran & Even, 2014)

Classification Regression multiple windows dynamic modify window Hoeffding Bound

(43)

MSW in production: predicting jobs memory usage

Deploying the model improved throughput by 10%

• By allowing the scheduler to fit more jobs on available resources

(44)

Can we do the same for the jobs runtime?

Jobs runtime behavior is more “chaotic” compared to memory

• Some jobs get killed upon startup, e.g., due to configuration issues

• Jobs sharing the same CPU create contention impacting runtime

• Environmental issues impact runtime, e.g., file system slowness

• Non-uniform server configurations having different CPU speeds

• Hyper-threading, etc.

(45)

Proposed approach – predict the extremes

• MAX (improved throughput by 5%)

• 0.5% of jobs consume ~10% of resources with high failure rate

• Predict the outliers using large windows and kill them

• MIN (not implemented yet)

• ~50% of jobs run less than 5 minutes

• Predict if job’s run time is short for better scheduling use cases

Challenges in Modern Data Centers Management, Spring 2015 45

(46)

References

• Cohen, L., Avrahami, G., Last, M., & Kandel, A. (2008, September). Info-fuzzy

algorithms for mining dynamic data streams. Applied Soft Computing, 8(4), 1283-1294.

• Ikonomovska, E., Gama, J., Sebastião, R., & Gjorgjevik, D. (2009). Regression Trees

from Data Streams with Drift Detection. Discovery Science - Lecture Notes in Computer Science (pp. 121-135). Springer Berlin / Heidelberg.

• Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data

streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '01). (pp. 97-106). New York: ACM.

• Bifet, A., & Gavaldà, R. (2009). Adaptive Learning from Evolving Data Streams. In

N. Adams, C. Robardet, A. Siebes, & J.-F. Boulicaut (Eds.), Advances in Intelligent Data Analysis VIII / Lecture Notes in Computer Science (Vol. 5772, pp. 249 - 260). Berlin / Heidelberg: Springer.

• Mimran, O. & Even, A. (2014). Data Stream Mining With Multiple Sliding Windows

For Continuous Prediction. Proceedings of the European Conference on Information Systems (ECIS). AISeL.

(47)

Thank You

(48)
(49)

• Data Dictionary Selection (DDS) criterion:

• Normalized DDS criterion:

• Normalized DDS criterion with minimum support 𝜶:

MSW feature selection

𝐷𝐷𝑆 =

𝑗=1 𝐽

𝜎𝑗2𝑁𝑗/𝑁 − 𝑃

N - The total number of observations

J - The number of profiles considered

σj- The standard deviation of profile j Nj - The number of observations in profile j

P - The number of profiles generated (i.e. distinct value combinations)

σj- The standard deviation of profile 𝑗 Nj - The number of observations in profile 𝑗 𝑃𝑖 - The number of profiles in step 𝑖

𝑖 - The step number

N - The total number of observations

J - The number of profiles considered

Challenges in Modern Data Centers Management, Spring 2015 49

𝑉0 = 𝜎2; 𝑃0 = 1; 𝑉𝑖 = 𝑗=1 𝐽 𝜎𝑗2𝑁𝑗/𝑁 ; 𝐷𝐷𝑆𝑖= 𝑉𝑖 𝑉𝑖−1 − 𝑃𝑖 𝑃𝑖−1 𝑉0 = 𝜎2; 𝑃0 = 1; 𝑉𝑖 = 𝑗=1 𝑛 𝑁𝑗 ≥ 𝛼 𝜎𝑗2𝑁𝑗/𝑁 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 𝐷𝐷𝑆𝑖 = 𝑉𝑖 𝑉𝑖−1 − 𝑃𝑖 𝑃𝑖−1 𝑗=1 𝑛 𝑁𝑗 ≥ 𝛼 𝑁𝑗 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 /𝑁

(50)

DSM algorithms: MSW

Multiple Sliding Windows (MSW) (Mimran & Even, 2014)

MSW strategy:

• Find a variable set, which divides the data into minimal set of profiles

(clusters) with minimal variance (done once)

• Set a sliding window per profile

• Use any given prediction function within the windows

References

Related documents

The Roles Of Human Cytomegalovirus Tegument Proteins Pul48 And Pul103 During Lytic Infection.. Daniel Angel Ortiz Wayne

For a better understanding of the relevance of different vehicle-to-vehicle crash parameters (opponent vehicle mass properties and configuration parameters like impact speed and

Advise supplier on potential subcontract opportunities Evaluation Submit letter of introduction and Company literature to Small Business Office.

While there were a few studies that documented students’ appropriation of popular culture to resist the dominance of the mainstream culture (for example, see Ardizzone, 2007;

This is in agreement with reports suggesting that naïve stem cells represent an early stage of embryonic development (Huang et al., 2014; Theunissen et al., 2016) whereas

HKFRS 9 issued in 2009 introduced new requirements for the classification and measurement of financial assets. HKFRS 9 was subsequently amended in 2010 to include

These steps are the rate allocation mechanism at a global and different levels of the data center tree, the server selec- tion mechanism using the allocated rate metrics and ways

Nevertheless, the fact that research into accounting education is being cited to such an extent outside the community of accounting education researchers may indicate that the