Lean MSP Operations Needs Lean Machine Learning

(1)

Lean MSP Operations

Needs Lean Machine Learning

A White Paper About Real-time War Room Situation Management

For information about Moogsoft and servicenow, visit www.moogsoft.com.

(2)

This paper considers three types of Managed Service Provider (MSP):

1. Cloud-enabled managed hosting service provider with the root in traditional managed hosting market within managed services. Examples: Rackspace, HCL.

2. Managed Service Provider as part of a Telco Service Provider. Examples: AT&T, Verizon.

3. Traditional resellers and system integrators of technology vendor products, with signifi cant experience in break/fi x, and remote management. Example: CSC.

Executive Summary

All MSPs today must maintain their traditional hosting revenue, grow cloud business, and contain OpEx. But MSPs also need to invest in IT operations, not cutbacks, to sustain faster service delivery without sacrifi cing service quality.

Where exactly do you cut back and invest at the same time?

The answer is by thinking strategically, to go after your most signifi cant ineffi ciency that’s holding back your business.

As described in Figure 1, every MSP owner of operators must address the same challenge: How to reduce workload and time spent in handling explosive volume of events and alerts, while improving service quality?

FIGURE 1: BYOD, Cloud and DevOps amplifi es MSP operational ineffi ciency

This process chronically requires way too many manual steps that can take hours or days to complete. The process starts with tons of noise, and ends with fewer incidents/situations… but in the middle is a big gap with no automation, or situational awareness among the war room staff.

This white paper looks at a strategic solution that can automate the “noise to situation” workfl ow: Extend existing monitoring and event management systems with Real-time Situation Management software. No need to rip and replace – although you can.

“MSP success is highly dependent upon

increased margins by virtue of reduced

operating expenditures. An IT specialist full

time equivalent (FTE) costs $150k/year, on

average, while an IT generalist costs $75k/

year, on average.”

(3)

Why Is Incident Detection and Remediation So Reactive and Lengthy?

Very simply put, most NOC today still relies on a “noise in, noise out” event and incident management approach to detect and remediate outages (Figure 2).

FIGURE 2: Noise in, noise out alert management

In modern IT environments, outages are more often the result of simultaneous, cascading, and transient events and faults across multiple technology domains – exasperated by virtualization, mobility and cloud. True culprits of outages are often buried deep among millions of events and thousands of alarms – generated daily and without context.

Yet, the increasing pace of IT complexity and change instantly leaves any infrastructure to services mapping inaccurate, most notably the Confi guration Management Database (CMDB).

This renders ineffective an event management system that depends on static rules, requiring a 100% accurate topology model.

When relying on these outdated models and rules to triangulate outages that are full of noise, you get noise in, noise out.

“Our enterprise command

center handled 9 million

events in the past 30 days.”

– Kalyan Kumar, Senior Vice President and Chief Technologist for IT Operations at HCL Technologies

Note: HCL is a global MSP brand, recognized as a Gartner Magic Quadrant Leader in Data Center Outsourcing and Infrastructure Utility Services, along with CSC, IBM and others.

(4)

Why Is Incident Detection and Remediation So Reactive and Lengthy?

Figure 3 depicts the resulted workfl ow, spanning from event collection and processing, to incident management and problem remediation, to ultimately service restoral and RCA - all while the Service Desk team and customers wait.

FIGURE 3: Your Current Workfl ow

The sheer volume of events often obscures the problem source. Therefore, IT ops and traditional event management systems process only priority 1 alerts based on SLAs. Or they use aggressive fi ltering to make event volume manageable. But this often hides important events including severity 2+ that contain early warnings.

NOC generalists escalate still voluminous alerts to experts operating in different silos, without context. Multiple experts are often troubleshooting separately, but not collaborating to solve the same problem.

There is no way of seeing how alerts are related. This leads to multiple tickets raised off multiple critical alerts. Multiple tickets all point to the same problem.

After an outage has occurred, tickets are often merged into a master ticket, a manual time-consuming process, and a poor use of any domain expert’s time.

Once an incident is being worked on by operations and domain experts are called in, the Service Desk lacks visibility into what’s going on. Finally, after an incident has been resolved, there is no easy and automatic way to update a knowledge article, and make it for correlation with future incidents.

“74% of end user problems

are not detected by IT.”

-Forrester Research

“80% of the mean time to

resolve is wasted on trying

to locate the issue.”

(5)

Pros and Cons of Various Approaches

The industry is aware of the “lack of context” problem, and there have been various attempts to solve it.

Traditional event management tools do offer some automatic fi ltering and correlation capabilities. This might work if your environments are static, with always up-to-date Confi guration Management Database (CMDB) and topology models. See table 1.

TABLE 1: Pros and Cons of Traditional Event Management Tools

Traditional Event Management Tools Pros Cons

Examples: IBM Tivoli Netcool BMC TrueSight CA Spectrum EMC Smarts

Cover all technology domains Proven to work in static and stable application architectures

No longer effective in handling complex and dynamic IT Require extensive rules and models

Filter out important early warnings

Allow too much noise Proprietary programming

Another approach is to “unify or modernize monitoring”. This is an important step for collecting key metrics and graphing them in real-time to identify spikes of resource utilization. It can be a substitute for fi nding trouble spots if your NOC operator knows exactly what to look for.

But the operator still has to manually defi ne the metrics, thresholds, and create the right charts and dashboards to spot unusual spikes. It is virtually impossible to troubleshoot in real-time, as events and alerts are still coming in. See table 2.

TABLE 2: Pros and Cons of Monitoring Tools

Monitoring Pros Cons

Traditional Monitoring IBM, HP, CA, BMC

You need it to collect metrics, events and logs

Cover all technology domains Dashboard and reports All-in-one

Proven to work in static and stable application architectures

Fallen behind modern monitoring Monolithic, not best-of-breed Expensive

Modern Monitoring (Composable Monitoring)

AppDynamics, Zenoss, Solarwinds, Nagios

You need it to collect metrics, events and logs.

Cover all technology domains Real-time metrics and charts Based on many open source technologies

Faster, simpler and cheaper than traditional monitoring

Lack of real-time, cross-domain correlation and contextualization

(6)

Log analyzers are a more recent technology that provides the important “Google for Logs” approach. It is great for forensic analysis, as no one can predict what logs need to be kept and thrown away at any moment. So it’s better to keep most of it and meet audit requirements.

But there are two issues here. (a) You need large log fi les to arrive at a powerful server cluster fi rst, which often takes hours; and (b) what “search phrases” should you NOC staff use? How do they know which technology silo to start digging into? What if there are cascading issues across multiple domains? And what are their causal and collateral relationships? How did causes and impacts look over the timelines?

TABLE 3: Pros and Cons of Log Analyzers

Log Analyzers Pros Cons

Examples

Splunk, Sumo Logic, Elastic Search, Loggly

Cover all technology domains Great if your NOC operators know what search phrase and which domain silo to dig into

Limited real-time, cross-domain analytics

Mainly historic search Delay in getting logs High licensing cost

Some newer tools claim to support the correlation between a change management event and a potential service disruption. But again, what if your topology model and CMDB are not up-to-date? The correlation will fail, as the model between “changes” and “services” will cause “noise data in, noise analysis out”.

(7)

What An Ideal Solution Looks Like

If you have it your way, an ideal solution would include all these capabilities:

See not only events across your entire stack (apps, private and public clouds, and deep into all infrastructure domains), but also the dynamically formed correlation among them - no need to wait for a perfectly up-to-date Confi gurations Management Database (CMDB) and any topology model.

See real, incident-triggering warnings earlier – much earlier: Like seeing how and where a hurricane will land in 24-48 hours. Your L1 staff should be push-notifi ed with severity 2-4 warnings way before they trigger severity 1s incidents.

Triage faster – much faster: In minutes, your L1 staff should already know where and how multiple issues are cascading - there is no time to stitch together a dizzy array of charts and log analysis. They should have already notifi ed the right L2-L4 experts – not spamming every expert.

Work on a few “situations”, not 1000s of “alerts”: Your L2-L4 experts should immediately see fewer, higher level incidents – related alerts already grouped together, with narrative of causes and impacts. “Situations”, not individual alerts, can drastically cut Mean-Time-to-Restoration.

Know instantly which past remediation can restore services: Have a machine-calculated situation score refl ecting its similarity with past situations. Give your teams automated access to past situations with successful RCA & remediation.

Close the loop automatically: Automate remediation and knowledge recycling for future situations.

Essentially, you need a solution that can transform your “Serial Alert Workfl ow” to “Parallel, Real-time Situation Management”, described next.

“Situational-aware machine learning and socialized

workfl ows is the future of service assurance. We

need these essential innovations to support more

customers and ensure our service quality, while

keeping operational cost low and effi ciency high.”

– Kalyan Kumar, Senior Vice President and Chief Technologist for IT Operations at HCL Technologies.

(8)

“Real-time Situation Management” Workfl ow Detects Incidents 24 Hours Earlier

You need to capture the entire narrative of an incident and present it as higher-level situation. Because this data-driven approach can analyze a situation from multiple angles, IT teams have less actionable alerts to deal with, and hence can deal with many more.

A - Clean: Remove noise, de-duplicate, and blacklist events, despite a partial and inaccurate CMDB underneath. Real-time machine learning and natural language processing algorithms replace hard-coded rule and models. Examine loosely defi ned, text rich events across domain silos from applications, clouds to infrastructures. Well suited in dynamic IT environments with software infrastructures, virtualization and cloud and migration toward continuous application delivery (i.e. DevOps).

B - Contextualize: Eliminate troubleshooting triage by showing the resulting alerts in context - clustering related alerts into situations, which are then decorated with service-specifi c details. A data driven approach, change- and error-tolerant algorithms that automatically identify clusters of related alerts, vs. always assuming singular root cause.

C - Collaborate: Use social collaboration technology to orchestrate push notifi cations of relevant domain experts, getting them together in a virtual war room – known as “Situation Rooms”. Here, the experts can log communications, query other tools, and capture the remediation process, automating knowledge recycle and keeping in sync with the Service Desk team.

(9)

A Real-Time Situation Management Solution: Incident.MOOG

Incident.MOOG is Real-time Situation Management software. Using machine-learning and social collaboration technologies, it detects warnings earlier, provides situational awareness to IT operations teams, and enables faster cross-domain remediation.

Despite its breakthrough accuracy and speed in detecting anomalies, Incident.MOOG is remarkably simply to use. When installed on top of existing monitoring tools, or traditional event management software, Incident.MOOG gives your operations team an immediate, 360o context view across all IT domains (e.g. app, DB, server, storage, network, and private and public clouds).

The software uses open standards, such as JavaScript to specify how events can be ingested into JSON data format, so there is virtually no learning curve. Your NOC staff doesn’t need to learn any proprietary, vendor-supplied programming language.

After reducing event noise and simplifying alerts into fewer situations, Incident.MOOG sends 99% fewer tickets to your Service Desk software (ServiceNow, BMC Remedy, any others).

The entire Proof-of-Concept (POC) typically takes about 15 days, including installation, data ingestion, virtually automatic tuning, and results presentation. The screenshot on the right shows a global MSP was able to reduce more than 9 million raw events down to around 1,900 situations, over a 30-day production period.

(10)

TABLE 6: Evaluation Table for ITOA Technologies

Conclusions

Every operations leader at a cloud-aspiring MSP must innovate her monitoring and service assurance workfl ow, or risk being marginalized by public cloud service providers.

Real-time Situation Management, a type of IT Operations Analytics (ITOA) tool, can detect warnings for incidents earlier, automatically, in real-time, so your team can be more proactive and remediate faster. Make sure your entire environment is restored with 360o visibility with context. Make sure your blended apps, hybrid clouds and heterogeneous infrastructures are all covered.

Make sure your one, resource-limited NOC can scale to support more customers. Transform your Enterprise Command Center today.

To fi nd out more, read this most recent MSP case study -> Link to HCL case study

To try it yourself, visit www.moogsoft.com and contact us (http://moogsoft.com/contact-us).

The following table helps you understand capabilities and features needed to achieve situation management. Use it to guide your decision process.

(11)

Conclusions

Every operations leader at a cloud-aspiring MSP must innovate her monitoring and service assurance workfl ow, or risk being marginalized by public cloud service providers.

Real-time Situation Management, a type of IT Operations Analytics (ITOA) tool, can detect warnings for incidents earlier, automatically, in real-time, so your team can be more proactive and remediate faster. Make sure your entire environment is restored with 360o visibility with context. Make sure your blended apps, hybrid clouds and heterogeneous infrastructures are all covered.

Make sure your one, resource-limited NOC can scale to support more customers. Transform your Enterprise Command Center today.

To fi nd out more, read this most recent MSP case study -> Link to HCL case study

To try it yourself, visit www.moogsoft.com and contact us (http://moogsoft.com/contact-us).

For more information, visit www.moogsoft.com. U.S. 140 Geary Street

Offi ce 1000

San Francisco, CA 94108

U.K. The Sanctuary 23 Oakhill Grove Surbiton KT6 6DU

NY +1 646 843 0455 Singapore +65 3158 4393