Stop Reacting; Start Anticipating Disasters
BEFORE They Occur Using Predictive
Analytics
Richard Cocchiara: IBM Distinguished Engineer; CTO – IBM Business Continuity & Resiliency Services (BCRS);
Managing Partner – IBM Resiliency Consulting Services; Member of the IBM Academy of Technology Leadership Team
CIO’s are struggling to address board level requirements that are vastly different
from what they were responsible for just 5 years ago.
How do you…?
• Increase regulatory compliance without increasing capital expenses
• Block potential incoming threats without inhibiting traffic flow, data availability and uptime • Prepare for the unexpected outage or disaster
“I need to manage complexity of compliance across my organization and silos -- and be audit-ready all the time.”
“Lack of resources, expertise and tools to cost effectively manage multi-vendor environments” “I need to protect against threats – even the ones I’m not prepared for.”
“I need to provide access to and recoverability of data at any time.”
2
Companies face a growing number of growing risks to their IT that continuously
stress their ability to deliver service to their customers.
Frequency of occurrences per year 1,000 100 10 1 1/10 1/100 1/1,000 1/10,000 1/100,000
US$1,000 US$10,000 US$100,000 US$1,000,000 US$10,000,000 US$100,000,000
Fr eq ue nt In fr eq u en t
Consequences (single occurrence loss) in dollars per occurrence
Low High
Viruses
Worms
Disk failures
System availability failures
Pandemics Natural disasters Application outages Data corruption Network problems Building fires Terrorism/civil unrest Data driven Event driven Business driven Regulatory compliance Workplace inaccessibility Failure to meet industry standards
Regional power failures Governance
Source: IBM
Data growth
Long term preservation
Mergers and acquisitions New products Marketing campaigns Audits AC failure Acid leak Asbestos Bomb threat Bomb blast Brown out Burst pipe Cable cut Capacity failure Chemical spill CO fire Coffee machine Condensation Construction Coolant leak Cooling tower leak Corrupted data Denial of Service Attack Diesel generator failure Earthquake Electrical short Epidemic Evacuation Explosion Fire Flood Fraud Frozen pipes Hacker Hail storm Halon discharge Human error Humidity Hurricane HVAC failure Hardware failure Ice storm Insects Lightning Logic bomb Lost data Low voltage Microwave fade Network failure Pandemic PCB contamination Plane crash Power grid outage Power outage Power spike Power surge Programmer error Raw sewage Relocation delay Rodents Roof cave-in Sabotage Sprinkler static Shotgun blast Shredded data Sick building Smoke damage Smoke from restaurant Programmer error Regulatory Compliance Raw sewage Snow storm Software error Electricity Strike action Swimming pool leak S/W ransom Terrorism Theft Tornado Train derailment Transformer fire UPS failure Vandalism Vehicle crash Virus Water (various) Wind storm Volcano or volcano ash
Source: Contingency Planning Research, Inc. and IBM
As budgets shrink and service level requirements increase, a company’s
business becomes even more vulnerable to IT outages.
The impact of lost data or unplanned downtime can be catastrophic, leading
to lost revenue, reputation, and competitive position.
Finances Lost deals
Disruption of cash flow Lost discounts Missed payments Drop in stock price
Loss of reputation Company reputation Damaged relationships with: a. Customers b. Suppliers c. Partners d. Lenders e. Investors Revenue
Loss of direct revenue Loss of future revenues Losses due to invoices
that cannot be completed Losses due to investments
not made
Miscellaneous costs Temporary staff needed Travel expenses incurred Equipment rental costs
incurred
Productivity
Employees who cannot perform their jobs Missed deadlines
Regulatory Inability to meet
compliance requirements
At the same time the cost of downtime increases, company’s are inundated with
disjointed information.
*Zettabytes equals 1 trillion gigabytes
The amount of information managed by enterprise data centers is expected to increase by at least 50 times over the next decade1
2010 2020 40 Zettabytes*
50x
2. Source: Aberdeen Group: “Datacenter Downtime: How Much Does it Really Cost?,” March 2012
$110K
2010 2012 $182K The average cost per hour of system downtime is increasing as more business operations become automated2
1. Source: IDC Digital Universe Study, June 2011
Average cost of one hour of downtime
5
The changing risk landscape will require a shift to a new paradigm that
anticipates and integrates daily operations, emergency management &
business continuity
• Handle both emergency and non-emergency events, tests and alerts.
• Organize response teams, enabling fast and clear communications between team members.
• Define and provide standard operating procedures for varying situations, with proper assignments, based on legal requirements or historical experience. • Track the progress and performance of procedures,
including the results of the actions for rehearsals and
events.
• Locate and manage resources with the required
capabilities and skills to handle the events. • Enable the continuous improvement of the
organization’s services and responses.
Recover Manage Prevent 6 IT: reactive Business: none
Recovery time: days or weeks
IBM has seen this shift coming for a while and understands that this is part of the
evolution from disaster recovery, business continuity and business resilience into
the era of Intelligent IT Risk.
Syndicated hardware Dedicated hardware Cloud computing
Shared recovery model
Dedicated recovery model
Virtualized model IT: proactive Business: reactive Recovery time: minutes or hours IT: proactive Business: proactive Recovery time: minutes or seconds IBM BCRS founded in 1989 IBM BCRS in the future Predictive Analytics 7 IT: anticipative Business: anticipative Recovery time: seconds or always up Disaster recovery Business continuity Business resiliency Intelligent IT Risk Management
The definition of IT risk is drawn from several synergistic points.
1. Business results are inextricably reliant on IT service thus IT must support critical business processes and key initiatives by being
– reliable, – predictable, – available and – secure.
2. “IT” is much broader than ‘infrastructure’ (boxes and network) and includes process, people, data and applications, facilities, and business and IT strategies.
3. “IT” is under intense pressure to execute thus must be: – flexible and appropriate,
– available and recoverable, – scalable and ready to perform, – secure and protected, – accurate and timely.
• Thus, “IT Risk”is “The business risk associated with the use,
ownership, operation, involvement, influence and adoption of IT within an enterprise.”
“The priority now is to connect the top-down and bottom-up views so that our risk management framework will be a truly holistic business resilience strategy.”
– Jean-Pierre Bourbonnais, CIO and Vice President of Information Technologies, Bombardier Aerospace
Leveraging information to
make better decisions
Anticipating problems to
resolve them proactively
Coordinating resources
and processes to operate
effectively
Intelligent IT Risk will use predictive risk analytics tied to response capabilities
designed to ensure continuous operations.
Predictive risk analysis integrated within the business’ daily operations helps to filter business critical information so they may anticipate problems and opportunities to make the right response faster.
• Large scale situational awareness • Mitigate risk across wider risk spectrum • Respond to risk and opportunities • Monitor multiple, diverse inputs
• Manage key risk indicators • Prepare earlier to cut response time
• Intelligent response Integrated with the
fabric of the business Continual involvement versus one time training
9
IBM has created a framework for identifying the risks associated with the
use of IT that takes a broad and integrated view starting with an
understanding of the core business requirements.
10
IT risk management requires the analysis of a broadly linked IT Risk
Spectrum that goes beyond the traditional view of business continuity.
Availability & Recoverability
keep systems running and, if necessary, recover from interruptions in line with business expectations.
Security & Data Protection
provide the appropriate access controls while protecting the business’ information and resources
Agility & Appropriateness
respond in a timely manner with the correct new or modified IT Service in support of changes in business requirements
Scalability & Performance
maintain acceptable performance based on business needs and appropriately accommodate changes in business service volume
Accuracy & Timeliness
provide accurate data, to the right people, at the right time to make informed business decisions.
IT
Risk
Spectrum™
The IBM Risk Spectrum is applied against the company’s business resilience
delivery framework and can be “decomposed” for both dependency and
parallel analysis.
People
Human resources with assigned responsibilities within the company and the processes to maintain
Components under company control that enable operations
“Exo-Structure”
Ecosystem components outside company control (power, water, food, roads, communications and governance
Suppliers Businesses and government agencies that provide the critical materials, services and information
Process
How company conducts its core business through business process modeling and IT governance
Technology
Equipment and tools that support the company’s business processes
Our methodology helps a company to understand their strategic business goals and
risks to create a real-time IT risk management system.
1
ASCERTAIN and align strategic business goals with value of IT servicesASSESS IT risks and capabilities
ACT to create an ongoing IT risk management governance system
2
3
1. Identify Strategic Initiatives against which to manage and exploit IT capabilities
2. Map strategic initiatives to Organization and IT support processes and services with measurable indicators and estimated impact to initiatives 3. Categorize IT performance metrics
against the IT Risk Spectrum.
1. Quantify IT risk to organization as the gap between required vs. actual business performance metrics 2. Conduct an IT service “all capabilities”
analysis to identify measurable IT risk and performance metrics
3. Define and prioritize the appropriate IT service risk treatment and roadmap
1. Define or integrate IT Risk management principles into an ongoing IT risk management program
2. Recommend organization structure, roles & responsibilities, and policies to help you continuously monitor and respond to changes in IT risk
3. Define communication and awareness programs
“Exploit IT Services to
Support Organizational goals” “Improve response to IT Risk”
“Create a Risk Aware Organization”
13
13
A effective IT Risk strategy includes defining and measuring Key Risk Indicators
(KRI) customized to each company’s unique requirements.
Scalability and performance Agility and appropriate-ness Security and data protection Availability and recover-ability Accuracy and timeliness People Processes Technology Infrastructure Suppliers Exo-structure IT KRI IT KRI IT KRI IT KRI IT KRI IT KRI IT KRI IT KRI de fi ned at each in te rs ectio n 14
But KRI’s are only useful if you can combine real-time monitoring and predictive
analytics with robust response capabilities.
Expert System Predictive Systems Modeling & Simulation Archives Portal Access Incident Management Directives KPI䇻s Alerts
Event Rules Workflows
Standards Based Interfaces Domain Specific Interfaces Gateway Security Monitoring Reporting Rapid Recovery Resources Semantic Models Service Bus
Analytics Response Capabilities
Data Integration Feeds for : • Weather • Geological • Traffic • Employees • Health • Financial
What ,When, Where, Why and How
Gateway IT Operations Monitoring Gateway Compliance Monitoring Gateway Event Monitoring
Companies must have access to flexible and dynamic information readily
available that can be used to assess the current situation and take appropriate
corrective action.
ke appropriate
Active Workflows
Command Center
Communication ManagementIntegrated System Monitoring Integrated System Monitorin
Role based views
Data drill down
3d Modeling Event correlation detection Click to Action Social interaction Executive Dashboard Progress Reporting Prog
ProgressressRepReportiortingng
available that can be
c
c
co
co
cor
cor
cor
cor
or
cor
or
or
or
or
o
rrec
rec
rec
rec
rec
rec
rec
rec
rec
rec
tiv
tiv
tiv
tiv
tiv
ti
tiv
tiv
tiv
tiv
ti
i
e a
e a
e a
e a
e a
e a
e a
e a
e a
e a
cti
ction.
cti
cti
cti
cti
cti
cti
cti
ti
ti
i
on
on
on
on
on
on.
on
on.
on.
o
n
n
Active Workflows.
.
s s baaseed vd viewiew
Click to Action nteractioni
3d Modeling Video Analytics
Operational Efficiency Incident / Event Management
Global Operations Work Area
Mega Centers
GOAL: Ensure that the managed environments maintain
operational efficiencies.
GOAL: Effectively manage events and return to a steady state.
Collaboration Intelligent Response
GOAL: Get the right information to the right people at the right time for rapid problem resolution
GOAL: Anticipate to provide real-time response using best practice SOPs, workflows, and resources
Plans Workflows Business Rules Available Resources Intelligent Operations
The use of predictive analytics and a robust command center allows for
improved efficiency, management, collaboration and response to events.
Daily Operations Predictive analytics
Incident Identification / Warning Emergency And Crisis Mngt Business Continuity
17 17
Rapidly respond to emergencies
Standard Operating Procedures (SOP) •Extreme Weather Event Preparation •Flash Flood Preparation •Flash Flood •Evacuation
Example Scenario:
Heavy rains are predicted to cause large scale flooding in the city where the business’ main processing center is located. The center monitors sources that predict the magnitude of the storm and possible outcomes. This will allow the center to start the SOPs that are needed for extreme weather preparation.
As the weather incident continues to affect the city, additional SOPs can be activated to send people home, begin critical backup, move operations, or mobilize additional resources. As these predetermined SOPs execute, constant situational awareness events from the center can be used to ensure the most appropriate response is delivered. 1. Predicted extreme weather 2. Situational awareness engines monitor weather feeds in the center
3. Rules engines will start automated responses via standard procedures (SOPs)
4. The center will manage the most appropriate response based on the situational awareness information and the incident in hand
Example scenario
News feeds Traffic flow Weather prediction – Deep Thunder
18
Combining predictive analytics and business continuity capabilities into an
intelligent command center provides near and long term cost efficiencies.
The right business resilience strategy can help you:
• Mitigate risk
– Avoid the costs of downtime, brand damage and market share lost to competitors, and reduce the financial impact from business disruptions • Protect brand and revenue
– Properly assessing the dynamic threats to your IT infrastructure, their potential business impact and your tolerance for risk can help you plan a realistic strategy
• Protect capital
– Analyzing cost tradeoffs can help you avoid unnecessary investments • Reduce costs
– Creating proactive SOP’s with tested response capabilities can help protect you from costs associated with failed recovery and lost data • Improve service
– You can better align a resilient infrastructure to the needs of your business to maintain service level agreements based on your tolerance for risk
Thank
you
ibm.com/services/continuity
Richard Cocchiara
IBM Distinguished Engineer
[email protected]
21
Copyright information
© Copyright IBM Corporation 2014 IBM Global Services Route 100 Somers, NY 10589 U.S.A. Produced in the United States of America February, 2014
All Rights Reserved
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of
IBM trademarks is available on the Web at “Copyright and trademark information”.
Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
Other company, product and service names may be trademarks or service marks of others.