Predictive Analytics
Omer Mimran, Spring 2015
Information provided in these slides is
for educational purposes only
Agenda
•
Motivation
• Predicting the jobs resource requirements
•
Background and challenges
• Predictive analytics, data-stream mining (DSM)
• System overview
•
DSM algorithms
• Regression tree, Hoeffding tree, Multiple sliding windows (MSW)
•
Summary & conclusions
Reminder:
RM lectures I–III
•
Each job comes with
resource requirements
e.g., 2-cores X 8GB
• Specified by the user submitting the job, based on his experience, etc.
•
Scheduler picks the job (
RM-I
) and matches it with a server (
RM-II
)
• Best fit, worst fit, etc.
Why we need predictive analytics?
•
What if the jobs (users) request
too many
resources?
• 8GB while in practice the job only uses 4GB of memory?
•
Very common problem resulting in huge waste of resources ($$ loss)
• Even if resource matching was done optimally (RM-II lecture)
•
Our goal (predictive analytics)
• Provide prediction for the actual resource usage of the jobs (focusing on memory)
•
Forward this information to the scheduler to do the matching
Background and challenges
Predictive analytics
•
Predictive analytics: ”encompasses a variety of techniques
from statistics, modeling, machine learning, and data mining that
analyze current and historical facts to make predictions about future,
or otherwise unknown, events.” (Nyce, Charles, 2007)
•
Machine Learning: “Field of study that gives computers the ability to
learn without being explicitly programmed.” (Arthur Samuel, 1959)
•
An Introduction to Data Mining/Machine Learning
•
General methodology (
CRISP-DM
):
1. Divide the data into 3 sets (training, testing, validation)
2. Use training set to create models and testing set to measure performance 3. Use validation set to select best model & test model generalization
Data-stream mining (DSM)
•
Data-stream:
continuous (endless) and rapid incoming data
•
Idea:
apply machine-learning techniques
on-line
, on the data stream
•
Key challenges:
1. Performance: infeasible to store/train all data, each sample is processed once
2. Quality: expected to perform at least as well as “no-stream” models
3. Adaptability: non-stationary stream, the underlying model must be altered accordingly
4. Availability: must be available for prediction at all times
(Bifet, et al., 2010; Domingos & Hulten, 2001; Aggarwal, 2007; Gama & Rodrigues, 2007; Gaber, et al., 2005; Babcock, et al., 2002)
Adaptivity challenge: concept drift
• Concept drift: scenarios in which the distribution of a certain population changes over time; hence, statistical inference is affected
(Kelly et al., 1999)
• Concept-drift types:
1. Sudden: easier to detect, with fewer examples
2. Gradual: harder to detect, often mistaken for random noise 3. Incremental: occur over long period of time
4. Recurring contexts: appear in a cyclic manner
(Tsymbal, 2004; Gama & Castillo, 2006; Zliobaite, 2009)
• Possible treatments:
1. Resetting the training data (Klinkenberg , 2004; Cohen, et al., 2008)
2. Training a shadow model (Domingos & Hulten, 2000; Ikonomovska & Gama; 2008; Bifet & Gavaldà, 2009)
Concept drift in reality
•
Bursts in jobs’ core and memory requirements
• Ohad Shai, Edi Shmueli, and Dror G. Feitelson, “Heuristics for resource matching in Intel's compute farm”. In Job Scheduling Strategies for Parallel Processing, Walfredo Cirne and Narayan Desai, (ed.), Springer-Verlag, 2013
Performance challenge: sliding windows
•
Using time windows
• A common technique in stream mining
• Better performance
• Also addresses concept drift
•
Time-window types
1. Landmark window: maintaining data, starting from identified relevant point 2. Tilted window: maintain all data within
a window in different aggregate scales 3. Sliding window: only recent examples
are stored in the window
Performance challenge: sliding windows
•
Example:
• The accuracy of protein-structure
prediction, using KNN with sliding windows of varying length
(Chen, Kurgan, & Ruan, 2006)
Challenges in Modern Data Centers Management, Spring 2015 13
•
The problem: how to set window size ?
Too short – lower statistical validity and stability
System overview
System overview – input from the users
•
Job characteristics
• User, project, priority, command-line, resource
requirements, etc.
•
Data only known at submission time
•
Categorial variables with many possible values
Challenges in Modern Data Centers Management, Spring 2015 15
System overview
1
System overview – output of the model
•
Prediction example
• If command = “A” and project = “Tablet” then memory=4GB
• If command = “B” and project = “Mobile” then memory=6GB
• If priority = “1” and user team = “uncore” then memory=2GB
• If project = “ServerX” then memory=16GB
• …
Challenges in Modern Data Centers Management, Spring 2015 17
System overview
1
2
3
System overview – output of the scheduler
•
Scheduler matches the jobs with machines/servers (
RM-II lecture
)
• Using the predicted values (not the original values specified by the user)
• More jobs fit in higher throughput $$ saving
Challenges in Modern Data Centers Management, Spring 2015 19
System overview
1
2
3
System overview – input to the model
•
Job characteristics
• User, project, priority, command-line, etc.
•
Actual resources
consumed by the jobs
• e.g., memory
Challenges in Modern Data Centers Management, Spring 2015 21
Performance measurements & objective
•
Measurements calculated per job once completed & available in DB
• Calculating actual runtime/memory consumption vs. prediction
•
Objective:
maximum savings + minimum of 95% accuracy
• i.e., minimize resource waste, while ensuring that 95% of the jobs will not be
under-estimated (otherwise they might be killed by the scheduler)
•
Measurements (calculated for all jobs which got memory prediction):
• Accuracy = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑗𝑜𝑏𝑠 𝑤𝑖𝑡ℎ 𝑚𝑒𝑚𝑜𝑟𝑦 𝑐𝑜𝑛𝑠𝑢𝑚𝑒𝑑<𝑚𝑒𝑚𝑜𝑟𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑗𝑜𝑏𝑠
• Saving = 𝑗𝑜𝑏 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 ∗ 𝑚𝑒𝑚𝑜𝑟𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
DSM algorithms
Challenge
•
Data available for learning
1. Jobs characteristics: user, project, command-line, etc. 2. Actual resources consumed by the jobs e.g., memory
•
Output
DSM algorithms: regression tree – idea
Challenges in Modern Data Centers Management, Spring 2015 25 3 4 2 Priority > 5 Priority <= 5 NumOfLoops > 0 NumOfLoops = 0 Priority=1 NumOfLoops=2
New Job Memory Prediction = 4GB
Fast Incremental Regression Tree with Drift Detection (FIRT-DD) (Elena et al., 2009)
DSM algorithms: regression tree – steps
1. Construct a tree using Chernoff bound comparing standard
deviation reduction (SDR) of all possible values as split criteria
All candidate variables values are tested,
Priority value 5 found best reducing STDEV Priority<= 5 Priority > 5 Split node using
DSM algorithms: regression tree – steps
2. Sliding window size is a pre- defined parameter
Challenges in Modern Data Centers Management, Spring 2015 27
2
Priority > 5 Priority <= 5
NumOfLoops <> 0 NumOfLoops = 0
Sliding window side = 5
Jobs 1,4,1,5,2 Jobs 4,1,5,2,3 new job value 3 added Job value 1 discarded Re-calculate prediction Median=3
DSM algorithms: regression tree – steps
3. Adaptivity –
• Track error rate using statistical PH test
• Grow a shadow sub-tree and replace
once accuracy is better Priority<= 5 Priority > 5
NumOfLoops > 0 NumOfLoops = 0 High Error CommandNum >10 CommandNum <= 10 Priority <= 5 Compare Error Rate
DSM algorithms: Hoeffding tree – idea
Challenges in Modern Data Centers Management, Spring 2015 29
fail fail pass Project = B Project = A Command Type = X Command Type = Y Project=A CommandType=X
New job predicted to fail
Hoeffding Adaptive Tree (HAT)
Entropy & information gain
Go to the beach Weather Yes Sunny Yes Sunny Yes Sunny No Sunny Yes Overcast Yes Overcast No Overcast No Overcast No Rain No Rain No Rain No Rain P(Beach = Yes) = 5/12 P(Beach = No) = 7/12 Entropy (Beach) = - 5/12𝑙𝑜𝑔2( 5 12)- 7/12𝑙𝑜𝑔2 7 12 = 0.98P(Weather=Sunny and Beach=Yes) = 3/4 P(Weather=Sunny and Beach=No) = 1/4 Entropy(𝑆𝑠𝑢𝑛𝑛𝑦) = - 3/4𝑙𝑜𝑔2(3 4)- 1/4𝑙𝑜𝑔2 1 4 =0.81 Entropy(𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = 1 Entropy(𝑆𝑟𝑎𝑖𝑛) = 0 Entropy(S) = - 𝒊=𝟏𝒏 𝒑𝒊𝒍𝒐𝒈𝟐𝒑𝒊
Entropy & information gain
Entropy (Beach) = 0.98 Entropy(𝑆𝑠𝑢𝑛𝑛𝑦) = 0.81 Entropy(𝑆𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = 1 Entropy(𝑆𝑟𝑎𝑖𝑛) = 0
P(sunny) = P(overcast) = P(rain) = 4/12 Entropy (Beach | Weather) =
P(sunny)*Entropy(sunny) + P(overcast)* Entropy(overcast) + P(rain)*Entropy(rain) =
4/12(0.81) + 4/12(1) + 4/12(0) = 0.6
Challenges in Modern Data Centers Management, Spring 2015 31
Go to the beach Weather Yes Sunny Yes Sunny Yes Sunny No Sunny Yes Overcast Yes Overcast No Overcast No Overcast No Rain No Rain No Rain No Rain
By knowing the weather, how much information have I gained ?
Gain = Entropy(X) - Entropy(X|Y)
DSM algorithms: Hoeffding tree – steps
1. Construct a tree using information gain as split criteria and
Hoeffding bound statistical test as a stopping condition
Information Gain calculated for all candidate variables if G(Best Attr.)−G(2nd best)> ε*
Split leaf on best attribute
* ε= Hoeffding bound statistic
Project = B Project = A
Split node using Project variable
DSM algorithms: Hoeffding tree – steps
2. Sliding window size is dynamic (discussed later...)
Challenges in Modern Data Centers Management, Spring 2015 33
pass
Project = B Project = A
Command Type = X Command Type = Y
Sliding window side = 5
Jobs +,+,+,-,-Jobs +,+,-,-,-new job - added Job + discarded Re-calculate prediction fail
DSM algorithms: Hoeffding tree – steps
3. Adaptivity –
A. Window size change similar to MSW (discussed later...) B. Alternate tree:
• After a concept drift in the data stream, followed by a stable period, a new alternate tree is generated
• Track error rate on new concept
DSM algorithms: MSW – idea
Challenges in Modern Data Centers Management, Spring 2015 35
4GB 2GB Project = B Project = A Command Type = X Command Type = Y Project=A CommandType=X
New job memory prediction is 4GB
Multiple Sliding Windows (MSW)
(Mimran & Even, 2014)
10GB 4GB
Command Type = X Command Type = Y
DSM algorithms: MSW
1. Before training the model, find set of variables that impact the
memory consumption
• The method used – forward selection minimizing variance and number of
profiles:
Selected Variable Set Variable Rank Candidate Variables D 1,2,3,4,3,2,1 A, B, C, D, E, F, G D, A 6,5,4,3,2,1 A, B, C, E, F, G D, A, G 4,6,8,10,20 B, C, E, F, G D, A, G, B 3,1,1,2 B, C, E, F D, A, G, B -1, -2, 0 C, E, F
Variable set selection illustration
DSM algorithms: MSW
DSM algorithms: MSW
2. Set a sliding window per profile
Predict Label
Objective: maximum saving + minimum 95% accuracy Chosen strategy: linear prediction function (φ = 0.95, C=0.1)
DSM algorithms: MSW
3. Use any given prediction function within the window
DSM algorithms: MSW
4. Set the window size dynamically, using change detector
• Example for concept drift management of window with 850 jobs.
• Sub-window size parameter is 200 and confidence levels are: 97.5%, 95%, 90%, 90%
200 200 200 250 200 650 400 450 600 250 Older observations Division to sub-windows
1st change detection comparison, 97.5% confidence level
2nd change detection comparison, 95% confidence level
3rd change detection comparison, 90% confidence level
200 650
400 450
1st comparison is not statistically significant go to next sub-windows
Flow in case of 2ndcomparison being statistically significant
2nd comparison is statistically significant
prune window
400
4. Set the window size dynamically, using change detector
• Change detector function: Hoeffding bound
• Alternate function:
• Kolmogorov–Smirnov test
The Hoeffding bound (Hoeffding, 1963), also known as additive Chernoff bound
R - The range of the variable
(1-δ) - The statistical confidence
n - The number of examples ε =
R2 ln 1 δ 2n Courtesy of Wikipedia http://en.wikipedia.org/wiki/File:KS_Example.png
DSM algorithms: MSW
Summary & conclusions
Model Model Type Sliding
Windows
Window
Size Adaptivity
Change Detector Incremental Online Info-fuzzy Network (IOLIN)
(Cohen et al., 2008) classification 1 window heuristic
update network
Accuracy Degregation
Fast Incremental Regression Tree with Drift Detection (FIRT-DD)
(Ikonomovska et al., 2009)
regression multiple
windows
user-defined shadow model Error PH test
Concept adapting Very Fast Decision Trees
(CVFDT) (Hulten et al., 2001) classification 1 window
user-defined shadow model
Hoeffding Bound
Hoeffding Adaptive Tree (HAT)
(Bifet et al., 2009) classification
multiple windows dynamic modify window Hoeffding Bound
Multiple Sliding Windows (MSW)
(Mimran & Even, 2014)
Classification Regression multiple windows dynamic modify window Hoeffding Bound
MSW in production: predicting jobs memory usage
•
Deploying the model improved throughput by 10%
• By allowing the scheduler to fit more jobs on available resources
Can we do the same for the jobs runtime?
•
Jobs runtime behavior is more “chaotic” compared to memory
• Some jobs get killed upon startup, e.g., due to configuration issues
• Jobs sharing the same CPU create contention impacting runtime
• Environmental issues impact runtime, e.g., file system slowness
• Non-uniform server configurations having different CPU speeds
• Hyper-threading, etc.
•
Proposed approach – predict the extremes
• MAX (improved throughput by 5%)
• 0.5% of jobs consume ~10% of resources with high failure rate
• Predict the outliers using large windows and kill them
• MIN (not implemented yet)
• ~50% of jobs run less than 5 minutes
• Predict if job’s run time is short for better scheduling use cases
Challenges in Modern Data Centers Management, Spring 2015 45
References
• Cohen, L., Avrahami, G., Last, M., & Kandel, A. (2008, September). Info-fuzzy
algorithms for mining dynamic data streams. Applied Soft Computing, 8(4), 1283-1294.
• Ikonomovska, E., Gama, J., Sebastião, R., & Gjorgjevik, D. (2009). Regression Trees
from Data Streams with Drift Detection. Discovery Science - Lecture Notes in Computer Science (pp. 121-135). Springer Berlin / Heidelberg.
• Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data
streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '01). (pp. 97-106). New York: ACM.
• Bifet, A., & Gavaldà, R. (2009). Adaptive Learning from Evolving Data Streams. In
N. Adams, C. Robardet, A. Siebes, & J.-F. Boulicaut (Eds.), Advances in Intelligent Data Analysis VIII / Lecture Notes in Computer Science (Vol. 5772, pp. 249 - 260). Berlin / Heidelberg: Springer.
• Mimran, O. & Even, A. (2014). Data Stream Mining With Multiple Sliding Windows
For Continuous Prediction. Proceedings of the European Conference on Information Systems (ECIS). AISeL.
Thank You
• Data Dictionary Selection (DDS) criterion:
• Normalized DDS criterion:
• Normalized DDS criterion with minimum support 𝜶:
MSW feature selection
𝐷𝐷𝑆 =
𝑗=1 𝐽
𝜎𝑗2𝑁𝑗/𝑁 − 𝑃
N - The total number of observations
J - The number of profiles considered
σj- The standard deviation of profile j Nj - The number of observations in profile j
P - The number of profiles generated (i.e. distinct value combinations)
σj- The standard deviation of profile 𝑗 Nj - The number of observations in profile 𝑗 𝑃𝑖 - The number of profiles in step 𝑖
𝑖 - The step number
N - The total number of observations
J - The number of profiles considered
Challenges in Modern Data Centers Management, Spring 2015 49
𝑉0 = 𝜎2; 𝑃0 = 1; 𝑉𝑖 = 𝑗=1 𝐽 𝜎𝑗2𝑁𝑗/𝑁 ; 𝐷𝐷𝑆𝑖= 𝑉𝑖 𝑉𝑖−1 − 𝑃𝑖 𝑃𝑖−1 𝑉0 = 𝜎2; 𝑃0 = 1; 𝑉𝑖 = 𝑗=1 𝑛 𝑁𝑗 ≥ 𝛼 𝜎𝑗2𝑁𝑗/𝑁 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 𝐷𝐷𝑆𝑖 = 𝑉𝑖 𝑉𝑖−1 − 𝑃𝑖 𝑃𝑖−1 𝑗=1 𝑛 𝑁𝑗 ≥ 𝛼 𝑁𝑗 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 /𝑁
DSM algorithms: MSW
•
Multiple Sliding Windows (MSW) (Mimran & Even, 2014)
•
MSW strategy:
• Find a variable set, which divides the data into minimal set of profiles
(clusters) with minimal variance (done once)
• Set a sliding window per profile
• Use any given prediction function within the windows