Data-Driven Performance Management
in Practice for Online Services
Dongmei Zhang
Principal Researcher/Research Manager
Software Analytics group, Microsoft Research Asia
Utilizing data-driven approaches to help creating highly performing, user friendly, as well as efficiently developed & operated services.
Landscape of Online Service Analytics
Service Usage Service Development Service System
Areas of Service Quality
Timely detection of service anomalies • Active probing • Efficient notification Fast restoration of services • Classification • Categorization • Fault localization • Recovery recommendation Proactive fixing of service problems• Root cause analysis • Hidden problem
mining
• Classification • Categorization
Example Online Service
• Worldwide deployment
• Millions of users
• Multi-tier architecture
• Various server roles
– Content
– Component service – Database
– Telemetry
Challenges
• Problem complexity
– Multi-tier, highly complex system
• Influential factors
– Code defects, operation issues, engineering practice
• Data scale & completeness
– Large scale of data, various types of data – High-dimensional with complex relationship – Lack of end-to-end instrumentation
• Knowledge
– Incomplete, not well organized
Our Work – From Research to Impact
Monitoring & alerting • Anomaly detection Incident management • Correlation analysis • Execution pattern mining • Knowledge base Problem management • Categorization • Fault localization• Researching techniques for comprehensive analysis of
monitoring data
• Developing integrated solution targeting at real scenarios
• Making impact via close collaboration with product teams
Performance Issue Diagnosis
Given a performance problem of an online
system and a large amount of system monitoring
data, how to automatically identify a small
subset of metrics that are most relevant to the
given performance problem?
Qiang FU, Jian-Guang LOU, Qingwei LIN, Justin DING, Dongmei ZHANG, Zihao YE, and Tao XIE, Performance Issue Diagnosis for Online Service Systems, the 31st International Symposium on Reliable Distributed Systems (SRDS), short paper, 2012.
Rule-based Approaches
• Use pre-defined rules to select interested
metrics
• Advantage: High accuracy for known problems
• Limitation
– Decent domain knowledge required
– Only suitable for known problems
– Significant effort required on keeping rules
updated and consistent
Statistical Learning Approaches
• Using models learned from data to select relevant
metrics
• Classifier based approaches as typical examples
– P. Bodik, EuroSys’10
– I. Cohen, OSDI’04, SOSP’05
• Advantage: No or little domain knowledge required
• Limitations
– Difficult to deal with co-occurred issues – Difficult to deal with short-period issues – Tending to identify only general symptoms
Challenges to Classifier-based
Approaches (1)
• Over fitting due to short issue duration (false positive)
Challenges to Classifier-based
Approaches (2)
• Over fitting due to short issue duration (false positive)
Challenges to Classifier-based
Approaches (3)
• Over fitting due to short issue duration (false positive)
Approach Overview
“Global mining” + “Local ranking” for accurate
identification of relevant metrics
• Discretize real values of metrics into two states, i.e.,
normal and abnormal
• Mine association between metrics’ anomalies and
KPI violations from all history data
• Rank relevant metrics based on data in time period
of issue occurred
Detecting Anomalies in Metrics
• Get histogram of metric values from history data
• Detect anomalies by p-percentile threshold
0 10 20 30 40 50 60 70 80
Value distribution of an event
Class Association Rule (CAR) Mining
Format of class association rule
{M1=1, M2=1, …} => SLA violation antecedent consequence Mining results M1 M2 M3 Sup 7 3 3 Conf 88% 100% 38%
Ranking of Relevant Metric Set
• Each violation epoch has a set of satisfied rules based on its metric values
• We assume that the adjacent KPI violations are caused by the same cause
• We aggregate adjacent violation epochs’ satisfied rules together to find the optimal rules
Ranking Function
For the investigated period, a rule’s score is
𝑀𝐼 𝑆; 𝑀 = 𝑆 𝑀=1 𝑃 𝑆, 𝑀 log[𝑃(𝑆|𝑀) 𝑃 𝑆 ] = 𝐾1 𝐾 log 𝑐 𝑀 𝑃(𝑆 = 1) + 𝐾2 𝐾 log 1 − 𝑐 𝑀 𝑃(𝑆 = 0)
M: metrics (antecedent of a class association rule) c(M): confidence of CAR(M)
K1: number of epochs with M=1 & S=1 in the investigated period
Evaluation
• Service setup
– Internal online service, customer-facing, geographically distributed – 13 deployed web front-end servers
– 1129 metrics: 549 performance counters + 580 service events – 36 performance issues within 2 months
Healing Suggestion Recommendation
Given a newly occurred issue x on a service and a
set of historical issues X, how to automatically
identify the proper healing action for x in order to
restore the service?
Issue
– An unplanned interruption or degradation in the quality of an online service
Healing action
Approach
• Signature extraction for issues
• Issue comparison using similarity of signatures
• Adaption of healing actions
Rui Ding, Qiang Fu, Jian-Guang Lou, Qingwei Lin, Dongmei
Zhang, Jiajun Shen, Tao Xie, Healing Online Service Systems
via Mining Historical Issue Repositories, short paper, ASE
2012
Lessons Learned from Practice
• Problem driven vs. technology driven
• Investment in building systems/tools
– Small wins helping build trust
– Bringing feedback from practice
– Creating more research opportunities
Landscape of Online Service Analytics
Development Deployment Migration Monitoring & Alerting Incident Management Problem Management Engagement & Usage Customer Support … Search Services Social Services Email Services Image Services Portal Services Storage & Computing Services