Ticket Management
& Best Practices
Trouble Ticketing System Objectives
Assumption: Network Operations Centers are governed by two principles
1.
Pursuit of excellence in customer service
2.
Operating the most cost effective NOC possible
Common elements of these two principles:
−
Have as few outages as possible, with the shortest MTTR achievable
−
Have systems perform work for you to act quickly & effectively with regular updates to customers
In support of these principles, what do we need from a trouble ticketing
system?
−
Efficiency – Easy to use with simplicity & automation
−
Customer service – updates, interaction and self-service
−
Analytical capabilities – Ability to reduce trouble ticket volume through analytics; Repository of
data for reporting on drivers of troubles and high TTR with focus on fault and TTR reduction
CURRENT STATE
An effective trouble ticketing system is the foundation of operational
success
Trouble Ticketing System Requirements
•
Efficient
−
Intuitive / easy to train & use
−
Automation & Auto-population of data- have system perform as much of the work as possible
•
Flexible & Extensible
−
Easily customizable (preferably by in-house personnel)
−
Extensible with add-ins and additional functionality
−
Capable of integration (input/output) with other systems
• Inventory systems for circuit details (single system for NCC technicians)
• Alarm systems for ticket / alarm automation
• APIs for integration with external vendors/customers
•
Extensive data capturing and reporting capability
−
Association of each case with customer circuit ID, account, locations
−
Root cause codes (4 or 5 levels)- down to circuit pack part number for equipment failures
−
Responsible party (customer or provider)
−
MTTR (auto calculated)
−
Type II vendor performance (where involved)
−
Site / City / State / Region / Legacy Network (as applicable)
CURRENT STATE
The trouble ticketing system should be designed to accommodate current
operational needs, and extensible to support future requirements
Trouble Ticketing System Architecture
Trouble Ticket Lifecycle
1.
Create
2.
Update
3.
Close
4.
Analyze
5.
Act
Capturing, analyzing, and acting on accurate trouble ticket data is critical to
operational improvement
Best in class trouble ticket handling requires 5 phases of trouble ticket
handling…
Traditional life cycle of a
trouble ticket; open,
update, close
Value-add phases of trouble
ticket lifecycle; analyze and
act on the data
Trouble Ticket Lifecycle – Phase 1 (Case Creation)
1.
Create Ticket
−
Primary Objectives for Creating Case:
• Associate case to service & validate contacts on file
• Capture all customer information in first conversation (pursue one call resolution)
• Establish case framework with all information needed to successfully work and close ticket
−
Populate required trouble ticket fields
• Use a standardized case subject e.g. “Customer XYZ, DS-3 HIYX/123456//ZYO Down Hard”
• Associate to customer circuit ID (and thereby account)
−
Capture all available information
• Customer description of issue, including circuit status (Is circuit currently hard down or degraded?, intrusive testing permitted?)
• Troubleshooting steps customer has taken- Have they checked CPE for power (for hard down issues)?
• Customer information to include trouble ticket number, contact name, phone number and email address(es)
• Access process to customer location (if applicable)?
−
Update customer with details on next steps towards resolution
• Provide customer with service provider ticket number
• Automatically send email to customer with ticket details, link to portal (if applicable)
• Suggest additional troubleshooting steps customer might take (bounce ports, restart router, enable lasers, etc.)
• Articulate next steps in process- e.g. dispatching a field technician to site, engaging our Tier II technicians, etc and set expectations for next communication to customer
−
Immediately initiate troubleshooting and repairs
• Queues cost customers- the outage may be significantly impacting the customer’s business- act like it
• Immediately route to appropriate Tier II organization or other fix agent
Case creation stage pertinent to ensure correct association to circuit and
account for SLA reporting and customer updates
Single Customer vs. Network (Multi-customer) Cases
Network Case
- Case comments
- Case status
- Closure Codes
- Case Closure
Customer Case #1
Customer Case #2
Customer Case #3
Customer Case #4
Customer Case #5
……….
……….
……….
Customer Case #N
Network Case Handling (multiple customer circuits affected)
-
For issues affecting multiple customers, create a single parent / network case that reflects the
overall event and a child / customer case for each service impacted
-
All information entered into the Parent case cascades down to child cases
-
Proactively create the cases (ideally with system automation) and email customers upon case
opening; Attempt to inform customer of event BEFORE they contact the NOC to report service
issue
-
Enter “public” comments frequently into the parent case to send emails to each affected
customer; over-communicate updates to significantly reduce call volume into NOC (and increase
customer satisfaction over event handling)
Trouble Ticket Lifecycle – Phase 2 (Case Updates)
2.
Update Ticket
−
Primary objectives for working case:
• Resolve issue as quickly as possible (i.e. Work with a great sense of urgency)
• Over-communicate with affected customers until issue fully resolved
• Thoroughly document event and actions taken
−
Tactical Approach
• Enter thorough, detailed case comments- include names, phone numbers, IP addresses, location details, equipment alarm logs, etc. The more detail, the better.
• Document every action taken, every conversation held- “if it isn’t in the ticket, it didn’t happen”
• Case comments should have automatic timestamps for reconstruction of event
• Case status changes should automatically drive MTTR logs
> Case Created (starts MTTR clock)
> Repair in Process
> Technician dispatched
> Technician arrived
> Service restored (stops clock)
• Enter “Public” comments as frequently as possible, never less than once per hour for long duration events
−
Escalate as needed
• Engage higher level resources as needed and involve Tier III, Engineering, vendor resources as required- don’t get stuck
• Update management on critical issues- don’t let management team be caught by surprise
• When required resources are not reachable (e.g. field technicians), escalate up their management chain immediately- “once around and up”
Outages will occur; acting with urgency and providing frequent updates to
customers improves customer satisfaction and reduces attrition
Trouble Ticket Lifecycle – Phase 3 (Case Closure)
3.
Close Ticket
−
Primary Objectives for Closing Case:
• Close-out communications with customer- “wrap it up”
• Capture closure code details for subsequent reporting
• Summarize the event in 2 to 3 sentences for future internal and external consumption
−
Communicate with Customer that case is being closed
• Summarize case details, provide preliminary RFO, let customer know that case is being closed or placed into monitor status
−
Create succinct, descriptive closing summary
• “Customer reported DS-3 down hard, dispatched technician and isolated to failed DSX-3 module, replaced DSX-3 module to restore”
−
Capture closure codes with accurate detail
• Level 1: Zayo owned equipment or fiber
• Level 2: Equipment Failure
• Level 3: Telect
• Level 4: DSX-3 Module
• Specific part number captured by equipment replacement request
−
Review MTTR logs for accuracy, correct if needed
−
Close case or set to monitor status with auto-close (i.e. try not to touch it again)
Accurate case closure codes are required for reporting on drivers of trouble
volumes and high TTR; customer consumable closure summaries reduce
Trouble Ticket Lifecycle – Phase 4 (Analyze Ticket Data)
4.
Analyze Ticket Data
−
Primary Objectives for Analyzing Trouble Ticket Data:
• Determine most frequent causes of trouble tickets
• Determine drivers of high TTR
• Identify chronic issues (before the customer does)
−
Analyze trouble by closure codes
• Create pareto charts to determine top drivers of trouble volumes
• Determine fault frequency rate of equipment issues; expect <2.5% failures per annum
• Review cases from different perspectives
> Troubles by vendor, equipment make/model, circuit pack (part number), software load
> Troubles by service type
> Troubles by site
> Troubles by legacy network
−
Identify and review chronic issues
• Identify repeat/recurring troubles on specific circuits
• Repeat events at site (high temp, low temp, power loss, card failures, circuit errors, etc.)- may be indicative of power/grounding/lighting/cabling issues
• Specific routes subject to failure- fiber cuts, power outages, intermittent errors, (e.g. PMD identification)
−
Analyze Root Cause of Faults and drivers of high TTR
• Analyze events with high MTTR to determine drivers
> Regional, state, city (sparing, technician locations, tools, training, OSP repair processes and capabilities, local management)
> Equipment type (NOC technician training & capabilities, OSS systems, software issues, vendor support)
> OSP repair processes (cut isolation and repair approach, 3rd party performance, OSP restoration contractors and capabilities)
−
Customer responsible troubles
• Drivers of customer responsible troubles
• Specific circuits with high volumes of customer responsible issues
• Specific customers with high volumes of customer responsible issues
−
Weekly, monthly, quarterly, and annual analyses provide different perspectives
Invest the time required to thoroughly analyze trouble ticket metrics to
determine root cause drivers of outages and high MTTR
Pareto Chart Analysis of Equipment Failures
•
Top 3 levels of closure codes provide view down to
equipment manufacturer
•
In this example, fault frequency rate of Force10
equipment determined to be >8% across ~500 network
elements (vs. several thousand Accedian and Westell
Devices)
•
Data used to create business case for removal of
equipment as part of network modernization; resulted in
significant improvement in trouble ticket volumes res
Trouble Ticket Lifecycle – Phase 5 (Improvement Activity)
5.
Act on Trouble Ticket Data Analysis
−
Primary Objectives:
• Reduce trouble ticket volumes
• Reduce mean time to restore
−
Identify Opportunities to eliminate outages and reduce MTTR
• Determine actions that can be taken to eliminate outages (Software upgrades, equipment replacements, process improvements, training, systems, power audits, etc.)
• Engage technology vendors and demand product improvement as appropriate (e.g. >2.5 annual fault frequency rate)- don’t accept subpar technologies)
• Hold type II providers to high standards; report to them on their performance and request corrective action plans as appropriate. Ensure that vendor performance influences buying decisions
• Identify potential to reduce MTTR (troubleshooting processes & training, field technician locations, tools & equipment, sparing, restoration and power contractors, etc.)
• If it is worth doing, put it on an Action Item register and make a commitment to completing; Create impactful corrective actions and assign an individual that is accountable for each action item with a due date. Don’t create trivial corrective actions as this diminishes importance of urgent action items
• Have a system for following each action item to completion; for larger organizations consider dedicating an employee just to this function- it’s that important
−
Determine methodology to reduce customer responsible troubles
• Inform customers of potential chronic issues on their side; suggest potential improvement initiatives, noting that some customers may not have the ability to identify chronic issues or capability to reduce issues
• Enable customer self-service where possible (DNS updates, routing updates, equipment PMs, circuit status, etc.)- pursue advanced portable capabilities
• Bill customers for repetitive abuse of the system- i.e. using service provider to troubleshoot customer equipment or isolate among multiple providers
• In rare cases, consider “firing your customer”
Acting on the trouble ticket data dramatically improves network
performance & customer service while reducing operational costs
Trouble Ticket Lifecycle – General Commentary
−
Always attempt to contact the customer before they contact you
• Proactively notify of outages
• Over-communicate with frequent case updates; don’t make customer ask
• Internal escalation- if appropriate, escalate to management before customer escalates and have upper management engage
−
Attempt to interact with customers in the method(s) that they prefer
• Carrier customers prefer to interact via phone
• Enterprise customers (particularly IP) generally prefer to interact via email or portal
−
Enable self-service; allow customers to service themselves for non-outage requests
• DNS Updates
• Routing Updates
• Bandwidth utilization
• Contact updates
−
Build a NOC model focused on continuous improvement
• Continuous reduction in fault frequency rate and trouble ticket volumes
• Improvement in MTTR until consistently meeting target objectives
−
Pursuit of these initiatives results in delivering on the most critical NOC objectives:
• Delivering the best in customer service