Data-Driven Performance Management in Practice for Online Services

(1)

Data-Driven Performance Management

in Practice for Online Services

Dongmei Zhang

Principal Researcher/Research Manager

Software Analytics group, Microsoft Research Asia

(2)

Utilizing data-driven approaches to help creating highly performing, user friendly, as well as efficiently developed & operated services.

Landscape of Online Service Analytics

Service Usage Service Development Service System

(3)

Areas of Service Quality

Timely detection of service anomalies • Active probing • Efficient notification Fast restoration of services • Classification • Categorization • Fault localization • Recovery recommendation Proactive fixing of service problems

• Root cause analysis • Hidden problem

mining

• Classification • Categorization

(4)

Example Online Service

• Worldwide deployment

• Millions of users

• Multi-tier architecture

• Various server roles

– Content

– Component service – Database

– Telemetry

(5)

Challenges

• Problem complexity

– Multi-tier, highly complex system

• Influential factors

– Code defects, operation issues, engineering practice

• Data scale & completeness

– Large scale of data, various types of data – High-dimensional with complex relationship – Lack of end-to-end instrumentation

• Knowledge

– Incomplete, not well organized

(6)

Our Work – From Research to Impact

Monitoring & alerting • Anomaly detection Incident management • Correlation analysis • Execution pattern mining • Knowledge base Problem management • Categorization • Fault localization

• Researching techniques for comprehensive analysis of

monitoring data

• Developing integrated solution targeting at real scenarios

• Making impact via close collaboration with product teams

(7)

Performance Issue Diagnosis

Given a performance problem of an online

system and a large amount of system monitoring

data, how to automatically identify a small

subset of metrics that are most relevant to the

given performance problem?

Qiang FU, Jian-Guang LOU, Qingwei LIN, Justin DING, Dongmei ZHANG, Zihao YE, and Tao XIE, Performance Issue Diagnosis for Online Service Systems, the 31st International Symposium on Reliable Distributed Systems (SRDS), short paper, 2012.

(8)

Rule-based Approaches

• Use pre-defined rules to select interested

metrics

• Advantage: High accuracy for known problems

• Limitation

– Decent domain knowledge required

– Only suitable for known problems

– Significant effort required on keeping rules

updated and consistent

(9)

Statistical Learning Approaches

• Using models learned from data to select relevant

metrics

• Classifier based approaches as typical examples

– P. Bodik, EuroSys’10

– I. Cohen, OSDI’04, SOSP’05

• Advantage: No or little domain knowledge required

• Limitations

– Difficult to deal with co-occurred issues – Difficult to deal with short-period issues – Tending to identify only general symptoms

(10)

Challenges to Classifier-based

Approaches (1)

• Over fitting due to short issue duration (false positive)

(11)

Approaches (2)

(12)

Approaches (3)

(13)

Approach Overview

“Global mining” + “Local ranking” for accurate

identification of relevant metrics

• Discretize real values of metrics into two states, i.e.,

normal and abnormal

• Mine association between metrics’ anomalies and

KPI violations from all history data

• Rank relevant metrics based on data in time period

of issue occurred

(14)

Detecting Anomalies in Metrics

• Get histogram of metric values from history data

• Detect anomalies by p-percentile threshold

0 10 20 30 40 50 60 70 80

Value distribution of an event

(15)

Class Association Rule (CAR) Mining

Format of class association rule

{M₁=1, M₂=1, …} => SLA violation antecedent consequence Mining results M1 M2 M3 Sup 7 3 3 Conf 88% 100% 38%

(16)

Ranking of Relevant Metric Set

• Each violation epoch has a set of satisfied rules based on its metric values

• We assume that the adjacent KPI violations are caused by the same cause

• We aggregate adjacent violation epochs’ satisfied rules together to find the optimal rules

(17)

Ranking Function

For the investigated period, a rule’s score is

𝑀𝐼 𝑆; 𝑀 = 𝑆 𝑀=1 𝑃 𝑆, 𝑀 log[𝑃(𝑆|𝑀) 𝑃 𝑆 ] = 𝐾1 𝐾 log 𝑐 𝑀 𝑃(𝑆 = 1) + 𝐾₂ 𝐾 log 1 − 𝑐 𝑀 𝑃(𝑆 = 0)

M: metrics (antecedent of a class association rule) c(M): confidence of CAR(M)

K₁: number of epochs with M=1 & S=1 in the investigated period

(18)

Evaluation

• Service setup

– Internal online service, customer-facing, geographically distributed – 13 deployed web front-end servers

– 1129 metrics: 549 performance counters + 580 service events – 36 performance issues within 2 months

(19)

Healing Suggestion Recommendation

Given a newly occurred issue x on a service and a

set of historical issues X, how to automatically

identify the proper healing action for x in order to

restore the service?

Issue

– An unplanned interruption or degradation in the quality of an online service

Healing action

(20)

Approach

• Signature extraction for issues

• Issue comparison using similarity of signatures

• Adaption of healing actions

Rui Ding, Qiang Fu, Jian-Guang Lou, Qingwei Lin, Dongmei

Zhang, Jiajun Shen, Tao Xie, Healing Online Service Systems

via Mining Historical Issue Repositories, short paper, ASE

2012

(21)

Lessons Learned from Practice

• Problem driven vs. technology driven

• Investment in building systems/tools

– Small wins helping build trust

– Bringing feedback from practice

– Creating more research opportunities

(22)

Landscape of Online Service Analytics

Development Deployment Migration Monitoring & Alerting Incident Management Problem Management Engagement & Usage Customer Support … Search Services Social Services Email Services Image Services Portal Services Storage & Computing Services

(23)