Predictive Analytics. Omer Mimran, Spring Challenges in Modern Data Centers Management, Spring

(1)

Predictive Analytics

Omer Mimran, Spring 2015

(2)

Information provided in these slides is

for educational purposes only

(3)

Agenda

• Motivation

• Predicting the jobs resource requirements

• Background and challenges

• Predictive analytics, data-stream mining (DSM)

• System overview

• DSM algorithms

• Regression tree, Hoeffding tree, Multiple sliding windows (MSW)

• Summary & conclusions

(4)

(5)

Reminder:

RM lectures I–III

• Each job comes with

resource requirements

e.g., 2-cores X 8GB

• Specified by the user submitting the job, based on his experience, etc.

• Scheduler picks the job (

RM-I

) and matches it with a server (

RM-II

)

• Best fit, worst fit, etc.

(6)

Why we need predictive analytics?

• What if the jobs (users) request

too many

resources?

• 8GB while in practice the job only uses 4GB of memory?

• Very common problem resulting in huge waste of resources ($$ loss)

• Even if resource matching was done optimally (RM-II lecture)

• Our goal (predictive analytics)

• Provide prediction for the actual resource usage of the jobs (focusing on memory)

• Forward this information to the scheduler to do the matching

(7)

Background and challenges

(8)

Predictive analytics

• Predictive analytics: ”encompasses a variety of techniques

from statistics, modeling, machine learning, and data mining that

analyze current and historical facts to make predictions about future,

or otherwise unknown, events.” (Nyce, Charles, 2007)

• Machine Learning: “Field of study that gives computers the ability to

learn without being explicitly programmed.” (Arthur Samuel, 1959)

• An Introduction to Data Mining/Machine Learning

• General methodology (

CRISP-DM

):

1. Divide the data into 3 sets (training, testing, validation)

2. Use training set to create models and testing set to measure performance 3. Use validation set to select best model & test model generalization

(9)

Data-stream mining (DSM)

• Data-stream:

continuous (endless) and rapid incoming data

• Idea:

apply machine-learning techniques

on-line

, on the data stream

• Key challenges:

1. Performance: infeasible to store/train all data, each sample is processed once

2. Quality: expected to perform at least as well as “no-stream” models

3. Adaptability: non-stationary stream, the underlying model must be altered accordingly

4. Availability: must be available for prediction at all times

(Bifet, et al., 2010; Domingos & Hulten, 2001; Aggarwal, 2007; Gama & Rodrigues, 2007; Gaber, et al., 2005; Babcock, et al., 2002)

(10)

Adaptivity challenge: concept drift

• Concept drift: scenarios in which the distribution of a certain population changes over time; hence, statistical inference is affected

(Kelly et al., 1999)

• Concept-drift types:

1. Sudden: easier to detect, with fewer examples

2. Gradual: harder to detect, often mistaken for random noise 3. Incremental: occur over long period of time

4. Recurring contexts: appear in a cyclic manner

(Tsymbal, 2004; Gama & Castillo, 2006; Zliobaite, 2009)

• Possible treatments:

1. Resetting the training data (Klinkenberg , 2004; Cohen, et al., 2008)

2. Training a shadow model (Domingos & Hulten, 2000; Ikonomovska & Gama; 2008; Bifet & Gavaldà, 2009)

(11)

Concept drift in reality

• Bursts in jobs’ core and memory requirements

• Ohad Shai, Edi Shmueli, and Dror G. Feitelson, “Heuristics for resource matching in Intel's compute farm”. In Job Scheduling Strategies for Parallel Processing, Walfredo Cirne and Narayan Desai, (ed.), Springer-Verlag, 2013

(12)

Performance challenge: sliding windows

• Using time windows

• A common technique in stream mining

• Better performance

• Also addresses concept drift

• Time-window types

1. Landmark window: maintaining data, starting from identified relevant point 2. Tilted window: maintain all data within

a window in different aggregate scales 3. Sliding window: only recent examples

are stored in the window

(13)

Performance challenge: sliding windows

• Example:

• The accuracy of protein-structure

prediction, using KNN with sliding windows of varying length

(Chen, Kurgan, & Ruan, 2006)

Challenges in Modern Data Centers Management, Spring 2015 13

• The problem: how to set window size ?

 _{Too short – lower statistical validity and stability}

(14)

System overview

(15)

System overview – input from the users

• Job characteristics

• User, project, priority, command-line, resource

requirements, etc.

• Data only known at submission time

• Categorial variables with many possible values

(16)

System overview

1

(17)

System overview – output of the model

• Prediction example

• If command = “A” and project = “Tablet” then memory=4GB

• If command = “B” and project = “Mobile” then memory=6GB

• If priority = “1” and user team = “uncore” then memory=2GB

• If project = “ServerX” then memory=16GB

• …

(18)

System overview

1

2

3

(19)

System overview – output of the scheduler

• Scheduler matches the jobs with machines/servers (

RM-II lecture

)

• Using the predicted values (not the original values specified by the user)

• More jobs fit in  higher throughput  $$ saving

(20)

System overview

1

2

3

(21)

System overview – input to the model

• Job characteristics

• User, project, priority, command-line, etc.

• Actual resources

consumed by the jobs

• e.g., memory

(22)

Performance measurements & objective

• Measurements calculated per job once completed & available in DB

• Calculating actual runtime/memory consumption vs. prediction

• Objective:

maximum savings + minimum of 95% accuracy

• i.e., minimize resource waste, while ensuring that 95% of the jobs will not be

under-estimated (otherwise they might be killed by the scheduler)

• Measurements (calculated for all jobs which got memory prediction):

• Accuracy = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑗𝑜𝑏𝑠 𝑤𝑖𝑡ℎ 𝑚𝑒𝑚𝑜𝑟𝑦 𝑐𝑜𝑛𝑠𝑢𝑚𝑒𝑑<𝑚𝑒𝑚𝑜𝑟𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑗𝑜𝑏𝑠

• Saving = 𝑗𝑜𝑏 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 ∗ 𝑚𝑒𝑚𝑜𝑟𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

(23)

DSM algorithms

(24)

Challenge

• Data available for learning

1. Jobs characteristics: user, project, command-line, etc. 2. Actual resources consumed by the jobs e.g., memory

• Output

(25)

DSM algorithms: regression tree – idea

Challenges in Modern Data Centers Management, Spring 2015 25 3 4 2 Priority > 5 Priority <= 5 NumOfLoops > 0 NumOfLoops = 0 Priority=1 NumOfLoops=2

New Job Memory Prediction = 4GB

Fast Incremental Regression Tree with Drift Detection (FIRT-DD) (Elena et al., 2009)

(26)

DSM algorithms: regression tree – steps

1. Construct a tree using Chernoff bound comparing standard

deviation reduction (SDR) of all possible values as split criteria

All candidate variables values are tested,

Priority value 5 found best reducing STDEV Priority<= 5 Priority > 5 Split node using

(27)

DSM algorithms: regression tree – steps

2. Sliding window size is a pre- defined parameter

2

Priority > 5 Priority <= 5

NumOfLoops <> 0 NumOfLoops = 0

Sliding window side = 5

Jobs 1,4,1,5,2 Jobs 4,1,5,2,3 new job value 3 added Job value 1 discarded Re-calculate prediction Median=3

(28)

DSM algorithms: regression tree – steps

3. Adaptivity –

• Track error rate using statistical PH test

• Grow a shadow sub-tree and replace

once accuracy is better _{Priority<= 5} _{Priority > 5}

NumOfLoops > 0 NumOfLoops = 0 High Error CommandNum >10 CommandNum <= 10 Priority <= 5 Compare Error Rate

(29)

DSM algorithms: Hoeffding tree – idea

fail fail pass Project = B Project = A Command Type = X Command Type = Y Project=A CommandType=X

New job predicted to fail

Hoeffding Adaptive Tree (HAT)

(30)

Entropy & information gain

Go to the beach Weather Yes Sunny Yes Sunny Yes Sunny No Sunny Yes Overcast Yes Overcast No Overcast No Overcast No Rain No Rain No Rain No Rain P(Beach = Yes) = 5/12 P(Beach = No) = 7/12 Entropy (Beach) = - 5/12𝑙𝑜𝑔₂( 5 12)- 7/12𝑙𝑜𝑔2 7 12 = 0.98

P(Weather=Sunny and Beach=Yes) = 3/4 P(Weather=Sunny and Beach=No) = 1/4 Entropy(𝑆_{𝑠𝑢𝑛𝑛𝑦}) = - 3/4𝑙𝑜𝑔₂(3 4)- 1/4𝑙𝑜𝑔2 1 4 =0.81 Entropy(𝑆_{𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡}) = 1 Entropy(𝑆_{𝑟𝑎𝑖𝑛}) = 0 Entropy(S) = - _𝒊=𝟏𝒏 𝒑_𝒊𝒍𝒐𝒈_𝟐𝒑_𝒊

(31)

Entropy & information gain

Entropy (Beach) = 0.98 Entropy(𝑆_{𝑠𝑢𝑛𝑛𝑦}) = 0.81 Entropy(𝑆_{𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡}) = 1 Entropy(𝑆_{𝑟𝑎𝑖𝑛}) = 0

P(sunny) = P(overcast) = P(rain) = 4/12 Entropy (Beach | Weather) =

P(sunny)*Entropy(sunny) + P(overcast)* Entropy(overcast) + P(rain)*Entropy(rain) =

4/12(0.81) + 4/12(1) + 4/12(0) = 0.6

Go to the beach Weather Yes Sunny Yes Sunny Yes Sunny No Sunny Yes Overcast Yes Overcast No Overcast No Overcast No Rain No Rain No Rain No Rain

By knowing the weather, how much information have I gained ?

Gain = Entropy(X) - Entropy(X|Y)

(32)

DSM algorithms: Hoeffding tree – steps

1. Construct a tree using information gain as split criteria and

Hoeffding bound statistical test as a stopping condition

Information Gain calculated for all candidate variables if G(Best Attr.)−G(2nd best)> ε*

Split leaf on best attribute

* ε= Hoeffding bound statistic

Project = B Project = A

Split node using Project variable

(33)

DSM algorithms: Hoeffding tree – steps

2. Sliding window size is dynamic (discussed later...)

pass

Project = B Project = A

Command Type = X Command Type = Y

Sliding window side = 5

Jobs +,+,+,-,-Jobs +,+,-,-,-new job - added Job + discarded Re-calculate prediction fail

(34)

DSM algorithms: Hoeffding tree – steps

3. Adaptivity –

A. Window size change similar to MSW (discussed later...) B. Alternate tree:

• After a concept drift in the data stream, followed by a stable period, a new alternate tree is generated

• Track error rate on new concept

(35)

DSM algorithms: MSW – idea

4GB 2GB Project = B Project = A Command Type = X Command Type = Y Project=A CommandType=X

New job memory prediction is 4GB

Multiple Sliding Windows (MSW)

(Mimran & Even, 2014)

10GB 4GB

Command Type = X Command Type = Y

(36)

DSM algorithms: MSW

1. Before training the model, find set of variables that impact the

memory consumption

• The method used – forward selection minimizing variance and number of

profiles:

Selected Variable Set Variable Rank Candidate Variables D 1,2,3,4,3,2,1 A, B, C, D, E, F, G D, A 6,5,4,3,2,1 A, B, C, E, F, G D, A, G 4,6,8,10,20 B, C, E, F, G D, A, G, B 3,1,1,2 B, C, E, F D, A, G, B -1, -2, 0 C, E, F

(37)

Variable set selection illustration

DSM algorithms: MSW

(38)

DSM algorithms: MSW

2. Set a sliding window per profile

Predict Label

(39)

Objective: maximum saving + minimum 95% accuracy Chosen strategy: linear prediction function (φ = 0.95, C=0.1)

DSM algorithms: MSW

3. Use any given prediction function within the window

(40)

DSM algorithms: MSW

4. Set the window size dynamically, using change detector

• Example for concept drift management of window with 850 jobs.

• Sub-window size parameter is 200 and confidence levels are: 97.5%, 95%, 90%, 90%

200 200 200 250 200 650 400 450 600 250 Older observations Division to sub-windows

1st change detection comparison, 97.5% confidence level

2nd change detection comparison, 95% confidence level

3rd change detection comparison, 90% confidence level

200 650

400 450

1st comparison is not statistically significant go to next sub-windows

Flow in case of 2nd_{comparison being statistically significant}

2nd comparison is statistically significant

prune window

400

(41)

4. Set the window size dynamically, using change detector

• Change detector function: Hoeffding bound

• Alternate function:

• Kolmogorov–Smirnov test

The Hoeffding bound (Hoeffding, 1963), also known as additive Chernoff bound

 _{R -} _{The range of the variable}

 _(1-δ) _{- The statistical confidence}

 _n_- _{The number of examples} ε =

R2 _ln 1 δ 2n Courtesy of Wikipedia http://en.wikipedia.org/wiki/File:KS_Example.png

DSM algorithms: MSW

(42)

Summary & conclusions

Model Model Type Sliding

Windows

Window

Size Adaptivity

Change Detector Incremental Online Info-fuzzy Network (IOLIN)

(Cohen et al., 2008) classification 1 window heuristic

update network

Accuracy Degregation

Fast Incremental Regression Tree with Drift Detection (FIRT-DD)

(Ikonomovska et al., 2009)

regression multiple

windows

user-defined shadow model Error PH test

Concept adapting Very Fast Decision Trees

(CVFDT) (Hulten et al., 2001) classification 1 window

user-defined shadow model

Hoeffding Bound

Hoeffding Adaptive Tree (HAT)

(Bifet et al., 2009) classification

multiple windows dynamic modify window Hoeffding Bound

Multiple Sliding Windows (MSW)

(Mimran & Even, 2014)

Classification Regression multiple windows dynamic modify window Hoeffding Bound

(43)

MSW in production: predicting jobs memory usage

• Deploying the model improved throughput by 10%

• By allowing the scheduler to fit more jobs on available resources

(44)

Can we do the same for the jobs runtime?

• Jobs runtime behavior is more “chaotic” compared to memory

• Some jobs get killed upon startup, e.g., due to configuration issues

• Jobs sharing the same CPU create contention impacting runtime

• Environmental issues impact runtime, e.g., file system slowness

• Non-uniform server configurations having different CPU speeds

• Hyper-threading, etc.

(45)

• Proposed approach – predict the extremes

• MAX (improved throughput by 5%)

• 0.5% of jobs consume ~10% of resources with high failure rate

• Predict the outliers using large windows and kill them

• MIN (not implemented yet)

• ~50% of jobs run less than 5 minutes

• Predict if job’s run time is short for better scheduling use cases

(46)

References

• Cohen, L., Avrahami, G., Last, M., & Kandel, A. (2008, September). Info-fuzzy

algorithms for mining dynamic data streams. Applied Soft Computing, 8(4), 1283-1294.

• Ikonomovska, E., Gama, J., Sebastião, R., & Gjorgjevik, D. (2009). Regression Trees

from Data Streams with Drift Detection. Discovery Science - Lecture Notes in Computer Science (pp. 121-135). Springer Berlin / Heidelberg.

• Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data

streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '01). (pp. 97-106). New York: ACM.

• Bifet, A., & Gavaldà, R. (2009). Adaptive Learning from Evolving Data Streams. In

N. Adams, C. Robardet, A. Siebes, & J.-F. Boulicaut (Eds.), Advances in Intelligent Data Analysis VIII / Lecture Notes in Computer Science (Vol. 5772, pp. 249 - 260). Berlin / Heidelberg: Springer.

• Mimran, O. & Even, A. (2014). Data Stream Mining With Multiple Sliding Windows

For Continuous Prediction. Proceedings of the European Conference on Information Systems (ECIS). AISeL.

(47)

Thank You

(48)

(49)

• Data Dictionary Selection (DDS) criterion:

• Normalized DDS criterion:

• Normalized DDS criterion with minimum support 𝜶:

MSW feature selection

𝐷𝐷𝑆 =

𝑗=1 𝐽

𝜎_𝑗2𝑁_𝑗/𝑁 − 𝑃

N - The total number of observations

J - The number of profiles considered

σ_j- The standard deviation of profile j N_j - The number of observations in profile j

P - The number of profiles generated (i.e. distinct value combinations)

σ_j- The standard deviation of profile 𝑗 N_j - The number of observations in profile 𝑗 𝑃_𝑖 - The number of profiles in step 𝑖

𝑖 - The step number

N - The total number of observations

J - The number of profiles considered

𝑉₀ = 𝜎2; 𝑃₀ = 1; 𝑉_𝑖 = 𝑗=1 𝐽 𝜎_𝑗2𝑁_𝑗/𝑁 ; 𝐷𝐷𝑆_𝑖= 𝑉𝑖 𝑉_𝑖−1 − 𝑃_𝑖 𝑃_𝑖−1 𝑉0 = 𝜎2; 𝑃0 = 1; 𝑉𝑖 = 𝑗=1 𝑛 𝑁_𝑗 ≥ 𝛼 𝜎_𝑗2𝑁_𝑗/𝑁 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 𝐷𝐷𝑆_𝑖 = 𝑉𝑖 𝑉_𝑖−1 − 𝑃_𝑖 𝑃_𝑖−1 𝑗=1 𝑛 𝑁_𝑗 ≥ 𝛼 𝑁_𝑗 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 /𝑁

(50)

DSM algorithms: MSW

• Multiple Sliding Windows (MSW) (Mimran & Even, 2014)

• MSW strategy:

• Find a variable set, which divides the data into minimal set of profiles

(clusters) with minimal variance (done once)

• Set a sliding window per profile

• Use any given prediction function within the windows