• No results found

Managing Availability and Failure Avoidance

N/A
N/A
Protected

Academic year: 2021

Share "Managing Availability and Failure Avoidance"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

PKE

Consulting LLC

PKE

Consulting

LLC

Innovate Integrate ransform T Interaction Information Networks

Managing Availability

and Failure Avoidance

Phil Edholm

President and Principal PKE Consulting LLC

(2)

PKE

Consulting

LLC

Most organizations have some level of requirement to

understand and document Business Continuity

System availability

Disaster recovery

Operational

risk

Availability of a VoIP/UC system is a critical part of

Business Continuity planning and documentation

Many Organizations EXPECT high availability of their

communications solutions

Often Availability is Defined by Five Nines

(3)

PKE

Consulting

(4)

PKE

Consulting

LLC

Understanding Nines

Five “Nines” is 99.999% Availability

Or

.00001% Unavailability

Or

(5)

PKE

Consulting

LLC

What the Nines Mean

Availability in Percentage Availability in Number of "Nines" Equivalent Average Unavailability (in Minutes) Equivalent Average Unavailability (in Hours) 99.99990% 6 0.5256 0.00876 99.9990% 5 5.256 0.0876 99.990% 4 52.56 0.876 99.90% 3 525.6 8.76 99.0% 2 5,526 92.1

(6)

PKE

Consulting

LLC

VoIP/UC is Major Change from TDM

Power UPS/Generator Communications System Power Data Center UPS/Generator

Core Data Network

Wiring Closet UPS/Generator

TDM Trunks

Data Center Dist

IP Access (WAN, ISP, MPLS, etc) WC Distribution Servers/VMs

Communications System Core Devices SBCs and Gateways

SIP Trunks TDM Trunks

10

Years

(7)

PKE

Consulting LLC

Achieving Availability

99.999 99.99 99.9 99 99 99.9 99.99 99.98 99.8 98 Data Network Comm un ic atio ns A pp lic atio n an d E qu ipm en t 99.9989 – 99.985 4-8 Hours Per Year

5 Mins 53 Mins 9 hours 92 Hrs 2 Hrs

20 Hrs

200 Hrs • Availability is the sum

of the parts

• Multiple Items can cause a failure

• Redundancy avoids the impact of a failure

(8)

PKE

Consulting LLC

Achieving Availability

99.999 99.99 99.9 99 99 99.9 99.99 99.98 99.8 98 Data Network Comm un ic atio ns A pp lic atio n an d E qu ipm en t 99.99 – 99.85 1-4 Hours Per Year

5 Mins 53 Mins 9 hours 92 Hrs 2 Hrs

20 Hrs

200 Hrs • Even if the VoIP core

system exceeds 5 nines, the data network will define the availability

(9)

PKE

Consulting LLC

Availability is Built In

System Design Availability Reliability & Redundancy Operations

(10)

PKE

Consulting LLC

Impact of Redundancy

Element 1 Element 2 Element 3

Example of the impact of redundancy. Shown with

75% Availability, 25% Unavailability for illustration.

25% probability that Element 2 fails while Element 1 has failed

25% probability that Element 3 fails while Element 1 and 2 have failed

The resulting Unavailability is a failure probability of 25% x 25% x 25% or 1.56%. Resulting Availability is 99.44% or 2 Nines

(11)

PKE

Consulting

LLC

• Mid-sized System with 4,000 users

– 3600 in remote sites

– 300 Contact Center Agents

• Dual Data Centers

– VM Clusters

– UPS – no generators

• Comms System

– Redundant Comm Mgrs in DC1, BU in DC2

– Redundant Session Manager in DC1 and DC2

– Gateways in both DCs

– Trunking is SIP and TDM BU

• Data Network has dual paths between sites (MPLS

and fiber)

– Remote sites on MPLS and cable BU

– Single Ethernet switches and remote site routers

Example of Process

Power Core Comms Data Network Voice Trunking Overall A ver ag e Sy st em F ailur e C on tac t C en ter Sy st em Ar ea Availability

(12)

PKE

Consulting LLC • Power • Data Center – Servers/VMs – Network • Communications System – Core – Branches – Devices – Trunking • Data Network – Core – Edges – Building – WAN/MPLS – Internet Access

Calculating Availability in a Complex System

Step 1 - Define Elements of the System

Define all of the Elements that are

required to deliver services

Define all of the redundancy

Elements that are built into the

architecture

Define potential inter-Element

interactions (power fails in DC 1

and a server fails in DC2)

(13)

PKE

Consulting

LLC

• Use MTBF Data to define expected failure rate for each Element class

– Vendor Data

– Industry data

– Operational reports (good for power)

– Add in factor for Operator error to hardware and software MTBFs

• Typically in Comms Systems 30% of outages are caused by operator error

• Use MTTR Data to define repair timing

– Maximum and average

– Operational plans and commitments

– Experience

– Industry data

– Operational reports (good for power)

• Combine to generate Availability and Unavailability data for Element

– Unavailability is Minutes per Year Unavailable – the average number of minutes per year the

Element will be out of service

Calculating Availability in a Complex System

Step 2 – Define Element MTBF/MTTR

(14)

PKE

Consulting

LLC

Calculating Availability in a Complex System

Step 3 – Calculating Element Unavailability and Availability

Unavailability (%) =

Calculated MTBF (Years) Average MTTR (Hours) x 60

Add in the Operator factor for Elements that have an operator not calculated in the MTBF (typically these are hardware, software, or other Elements like servers and Ethernet switches or apps.

Calculated MTBF = Base MTBF x (1- Operational factor) (generally 30%)

Calculated Availability = 100% - Unavailability %

Minutes per year (365 x 24 x 60 = 525,600) Unavailability (%) = Unavailable minutes per year Total minutes per year

(15)

PKE

Consulting

LLC

Calculating Availability in a Complex System

Step 4 – Define Failure “Sequences”

• For Each Element area define the potential failure sequences and calculate unavailability and availability for each

(16)

PKE

Consulting

LLC

• For each failure type, define the impacts of the failure based on the redundancy

– System Failure - RED

– Reduced Capacity - ORANGE

– No Impact - GREEN

Calculating Availability in a Complex System

Step 5 – Define Impacts

(17)

PKE

Consulting

LLC

• Sum up the minutes of impact per year for each type of failure and generate availability for that system area

Calculating Availability in a Complex System

Step 6 – Total Impacts for Area

(18)

PKE

Consulting

LLC

• Sum up the minutes of impact per year for each type of failure and generate availability for that system area

Calculating Availability in a Complex System

Step 7 – Sum the Impacts for all areas

(19)

PKE

Consulting

LLC

Calculating Availability in a Complex System

Step 8 – Show Impact and Percentages

(20)

PKE

Consulting

LLC

Evaluate key failure areas

Recommend changes

Architectural

Structural

Operational

Mitigation Steps for Network

• Cellular Redundancy

• Wireless PC redundancy

• Wireless Multiplicity

• Multiple data channels

Create Survivability Tool

Calculating Availability in a Complex System

Step 9 – Analyze and Recommend

(21)

PKE

Consulting

LLC

(22)

PKE

Consulting

LLC

Enables Support Organizations to rapidly

understand what to do in a failure situation

Shows what steps to take

Red tag critical operational elements

Steps to take for survivability

Easy to understand

Avoid Cascading Failures

(23)

PKE

Consulting

LLC

• Top level location

– Data Center(s) – Core Network – Building • System – Communications Core – Video Core – Network • Failed Element – Switch – Server – App – Gateway – …..

Cascading choices

(24)

PKE

Consulting LLC

Structured Choices

DC 1 DC 2 Net Core Building Power Network Video VoIP Server Comm App SM App Gateway …. • Normal Primary

• Normal operational Element

• Current redundancy Element

• Failure Protection

• Protection Actions (to keep running)

• Restoration actions (to get operational)

• User Impact

• First Redundancy

• Current redundancy Element

• Failure Protection

• Protection Actions (to keep running)

• Restoration actions (to get back to the primary)

• User Impact

• Second Redundancy

• Current redundancy Element

• Failure Protection

• Protection Actions (to keep running)

• Restoration actions (to get back to the primary)

(25)

PKE

Consulting

LLC

(26)

PKE

Consulting

LLC

There is a major disconnect between expectations

and reality in Communications System Availability

Achieving five nines of availability is hard

Analyzing your customers networks for avilabilty will

lead to design and operational choices

An availability audit will reduce the chance of blame

if issues occur

(27)

PKE

Consulting

LLC

Send me an email and I will send either/both of

the Spreadsheets

Contact me to partner for a Business Continuity

Analysis for any of your clients

(28)

PKE

Consulting LLC

PKE

Consulting

LLC

Innovate Integrate ransform T Interaction Information Networks

Thank You

and

Questions

References

Related documents

6 Department of Environmental Health Engineering, School of Public Health, Ahvaz Jundishapur University of Medical Sciences, Ahvaz, Iran.. 7 Department of Parasitology and

Canyon County local government employment was 12% in 2004; despite its large population size, Canyon County is more in keeping with the statewide average, while percentages in

The MobiPag pilot aimed to demonstrate in a controlled real-world environment the main technologies and services in our mobile payment solution and also to support, from

The Department's Office of Long-Term Living currently operates a Home and Community Based Services Waiver which provides attendant care to mentally alert adults 18 through 59 years

Near the end of the experiment, we asked them to choose which of two forecasting methods to rely on to make incentivized forecasts—a human judge (either themselves, in Studies 1–3,

In addition to current strategies (improved complaints handling, the General Insurance Code), an option may be to develop and distribute products in partnerships with

Two regions of interest were indicated in the bone image [ROI1]: in implant neck region, [ROI2]: in periapical region (Fig.. Anatomical structures like alveolar ridge,

Crystal structures of GAF domains from PDE2 (GAF A+B with bound cGMP), CyaB2 AC (GAF A+B with bound cAMP), PDE10 (GAF-B with bound cAMP) and most recently the crystal structure of