PKE
Consulting LLCPKE
Consulting
LLC
Innovate Integrate ransform T Interaction Information NetworksManaging Availability
and Failure Avoidance
Phil Edholm
President and Principal PKE Consulting LLC
PKE
ConsultingLLC
•
Most organizations have some level of requirement to
understand and document Business Continuity
–
System availability
–
Disaster recovery
–
Operational
risk
•
Availability of a VoIP/UC system is a critical part of
Business Continuity planning and documentation
•
Many Organizations EXPECT high availability of their
communications solutions
•
Often Availability is Defined by Five Nines
PKE
ConsultingPKE
ConsultingLLC
Understanding Nines
Five “Nines” is 99.999% Availability
Or
.00001% Unavailability
Or
PKE
ConsultingLLC
What the Nines Mean
Availability in Percentage Availability in Number of "Nines" Equivalent Average Unavailability (in Minutes) Equivalent Average Unavailability (in Hours) 99.99990% 6 0.5256 0.00876 99.9990% 5 5.256 0.0876 99.990% 4 52.56 0.876 99.90% 3 525.6 8.76 99.0% 2 5,526 92.1
PKE
ConsultingLLC
VoIP/UC is Major Change from TDM
Power UPS/Generator Communications System Power Data Center UPS/Generator
Core Data Network
Wiring Closet UPS/Generator
TDM Trunks
Data Center Dist
IP Access (WAN, ISP, MPLS, etc) WC Distribution Servers/VMs
Communications System Core Devices SBCs and Gateways
SIP Trunks TDM Trunks
10
Years
PKE
Consulting LLCAchieving Availability
99.999 99.99 99.9 99 99 99.9 99.99 99.98 99.8 98 Data Network Comm un ic atio ns A pp lic atio n an d E qu ipm en t 99.9989 – 99.985 4-8 Hours Per Year5 Mins 53 Mins 9 hours 92 Hrs 2 Hrs
20 Hrs
200 Hrs • Availability is the sum
of the parts
• Multiple Items can cause a failure
• Redundancy avoids the impact of a failure
PKE
Consulting LLCAchieving Availability
99.999 99.99 99.9 99 99 99.9 99.99 99.98 99.8 98 Data Network Comm un ic atio ns A pp lic atio n an d E qu ipm en t 99.99 – 99.85 1-4 Hours Per Year5 Mins 53 Mins 9 hours 92 Hrs 2 Hrs
20 Hrs
200 Hrs • Even if the VoIP core
system exceeds 5 nines, the data network will define the availability
PKE
Consulting LLCAvailability is Built In
System Design Availability Reliability & Redundancy OperationsPKE
Consulting LLCImpact of Redundancy
Element 1 Element 2 Element 3Example of the impact of redundancy. Shown with
75% Availability, 25% Unavailability for illustration.
25% probability that Element 2 fails while Element 1 has failed
25% probability that Element 3 fails while Element 1 and 2 have failed
The resulting Unavailability is a failure probability of 25% x 25% x 25% or 1.56%. Resulting Availability is 99.44% or 2 Nines
PKE
ConsultingLLC
• Mid-sized System with 4,000 users
– 3600 in remote sites
– 300 Contact Center Agents
• Dual Data Centers
– VM Clusters
– UPS – no generators
• Comms System
– Redundant Comm Mgrs in DC1, BU in DC2
– Redundant Session Manager in DC1 and DC2
– Gateways in both DCs
– Trunking is SIP and TDM BU
• Data Network has dual paths between sites (MPLS
and fiber)
– Remote sites on MPLS and cable BU
– Single Ethernet switches and remote site routers
Example of Process
Power Core Comms Data Network Voice Trunking Overall A ver ag e Sy st em F ailur e C on tac t C en ter Sy st em Ar ea AvailabilityPKE
Consulting LLC • Power • Data Center – Servers/VMs – Network • Communications System – Core – Branches – Devices – Trunking • Data Network – Core – Edges – Building – WAN/MPLS – Internet AccessCalculating Availability in a Complex System
Step 1 - Define Elements of the System
Define all of the Elements that are
required to deliver services
Define all of the redundancy
Elements that are built into the
architecture
Define potential inter-Element
interactions (power fails in DC 1
and a server fails in DC2)
PKE
ConsultingLLC
• Use MTBF Data to define expected failure rate for each Element class
– Vendor Data
– Industry data
– Operational reports (good for power)
– Add in factor for Operator error to hardware and software MTBFs
• Typically in Comms Systems 30% of outages are caused by operator error
• Use MTTR Data to define repair timing
– Maximum and average
– Operational plans and commitments
– Experience
– Industry data
– Operational reports (good for power)
• Combine to generate Availability and Unavailability data for Element
– Unavailability is Minutes per Year Unavailable – the average number of minutes per year the
Element will be out of service
Calculating Availability in a Complex System
Step 2 – Define Element MTBF/MTTR
PKE
ConsultingLLC
Calculating Availability in a Complex System
Step 3 – Calculating Element Unavailability and Availability
Unavailability (%) =
Calculated MTBF (Years) Average MTTR (Hours) x 60
Add in the Operator factor for Elements that have an operator not calculated in the MTBF (typically these are hardware, software, or other Elements like servers and Ethernet switches or apps.
Calculated MTBF = Base MTBF x (1- Operational factor) (generally 30%)
Calculated Availability = 100% - Unavailability %
Minutes per year (365 x 24 x 60 = 525,600) Unavailability (%) = Unavailable minutes per year Total minutes per year
PKE
ConsultingLLC
Calculating Availability in a Complex System
Step 4 – Define Failure “Sequences”
• For Each Element area define the potential failure sequences and calculate unavailability and availability for each
PKE
ConsultingLLC
• For each failure type, define the impacts of the failure based on the redundancy
– System Failure - RED
– Reduced Capacity - ORANGE
– No Impact - GREEN
Calculating Availability in a Complex System
Step 5 – Define Impacts
PKE
ConsultingLLC
• Sum up the minutes of impact per year for each type of failure and generate availability for that system area
Calculating Availability in a Complex System
Step 6 – Total Impacts for Area
PKE
ConsultingLLC
• Sum up the minutes of impact per year for each type of failure and generate availability for that system area
Calculating Availability in a Complex System
Step 7 – Sum the Impacts for all areas
PKE
ConsultingLLC
Calculating Availability in a Complex System
Step 8 – Show Impact and Percentages
PKE
ConsultingLLC
•
Evaluate key failure areas
•
Recommend changes
–
Architectural
–
Structural
–
Operational
–
Mitigation Steps for Network
• Cellular Redundancy
• Wireless PC redundancy
• Wireless Multiplicity
• Multiple data channels
•
Create Survivability Tool
Calculating Availability in a Complex System
Step 9 – Analyze and Recommend
PKE
ConsultingLLC
PKE
ConsultingLLC
•
Enables Support Organizations to rapidly
understand what to do in a failure situation
•
Shows what steps to take
–
Red tag critical operational elements
–
Steps to take for survivability
•
Easy to understand
•
Avoid Cascading Failures
PKE
ConsultingLLC
• Top level location
– Data Center(s) – Core Network – Building • System – Communications Core – Video Core – Network • Failed Element – Switch – Server – App – Gateway – …..
Cascading choices
PKE
Consulting LLCStructured Choices
DC 1 DC 2 Net Core Building Power Network Video VoIP Server Comm App SM App Gateway …. • Normal Primary• Normal operational Element
• Current redundancy Element
• Failure Protection
• Protection Actions (to keep running)
• Restoration actions (to get operational)
• User Impact
• First Redundancy
• Current redundancy Element
• Failure Protection
• Protection Actions (to keep running)
• Restoration actions (to get back to the primary)
• User Impact
• Second Redundancy
• Current redundancy Element
• Failure Protection
• Protection Actions (to keep running)
• Restoration actions (to get back to the primary)
PKE
ConsultingLLC
PKE
ConsultingLLC
•
There is a major disconnect between expectations
and reality in Communications System Availability
•
Achieving five nines of availability is hard
•
Analyzing your customers networks for avilabilty will
lead to design and operational choices
•
An availability audit will reduce the chance of blame
if issues occur
PKE
ConsultingLLC