Users are Complaining that the
System is Slow – What Should I
Do Now? Part 1
Jeffry A. Schwartz
July 15, 2014 SQLRx Seminar
SQLRx.com
Overview
Most of you have had to deal with vague user
complaints quoting “slowness” or “hangs”
Obviously, this information is not very helpful with
getting to the source of the problem
Remember that erratic response times
• Proven to be more frustrating and counterproductive than consistently
slow response times
• People usually remember the 90th or 95th percentile response times
as “average”
Database is often the first place to be blamed
Dilemmas
• Many times analysts do not know where to start
- Process especially time-consuming when analysts inexperienced in performance
Overview
DBAs need techniques for determining
• Which hardware or software components, including servers, are in trouble?
• Causes of poor performance - What is to blame?
- Is the software itself the problem?
- Are the hardware’s problems caused by • Inefficient software?
• Simply too much work?
• For SQL Server servers, which queries are most troublesome? • How to develop appropriate solutions
SQLRx.com
Today’s Session
First of several 30-minute presentations
Discusses methodology for pinpointing the sources
of problems using PerfMon data alone
Subsequent presentations will delve into more SQL
Server-centric metrics and tools
Discusses high-level hardware-related and
preliminary SQL Server performance analysis
Describes information and techniques applicable to
2005+ SQL Server and all versions of Windows
Analysis Objectives
Use measurements to corroborate or discredit user
perceptions of performance
• Capture performance data on ALL servers involved in user experience
- IIS, Application, Database, etc.
- If SAN is shared, then collect on ALL servers that use it
• If perceptions are valid
- When?
- How bad?
- Duration?
- How do users know it is bad from 11:00 AM – noon?
If hardware in trouble, prove whether application is
the cause or just the amount of work
• Too many database trips per transaction?
SQLRx.com
Analysis Methodology
Use Windows Performance Monitor, a.k.a.
PerfMon, to determine
• When problems occur and their duration
• Which servers and hardware components could be involved • Are the physical servers or SQL Server short of memory? • Use graphs to visually correlate problem periods on servers
Using PerfMon to focus analyses is highly
recommended by Microsoft
Analysis Techniques Covered
Interpretation and usage of informative
performance counters
- Processor - Memory
- Physical I/O - SQL Server
• Expand upon and clarify PerfMon explanations
Provide
• Useful graphical techniques
• Insights acquired from over 1,000 customer engagements • Possible courses of action
SQLRx.com
PerfMon Hardware and SQL Server
Metrics
Invaluable hardware and SQL Server metrics
• % User Time (Processor)• % Privileged Time (Processor) • % Interrupt Time (Processor) • % DPC Time (Processor) • % Idle Time (each Disk)
• Avg. Disk sec/Transfer (each Disk)
• Avg. Disk sec/Write (each Disk)
• Available Bytes (Memory)
• Page Life Expectancy (SQLServer:Buffer Manager)
Assessing Your System
Potential problems exist if
consistently
…
Counter Criterion
% Processor Time > 70%
% Privileged Time > 30% (Processor) % Interrupt Time > 20% (Processor) % DPC Time > 25% (Processor)
% Idle Time < 40% for any Disk LUN and especially SQL LUNs Avg. Disk sec/Transfer > 0.040 seconds (40 ms)
Avg. Disk sec/Write > 0.040 seconds (40 ms) Available Bytes < 1 GB (Memory)
SQLRx.com
Useful Processor Performance
Counters
Processor Object
• Interrupt counters are isolated to specific processors and frequently ignored (often single-threaded through interrupt processor)
• Windows only reports % Privileged Time (contains kernel mode and interrupt times) and interrupt times
- Actual kernel (Windows) time must be computed manually - REAL kernel mode time is difference between % Privileged
Time and the sum of the interrupt times (%Interrupt and %DPC)
System Object
• Processor Queue Length (waiting list length ONLY) - Can be affected by application design
Processor Usage Overview
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Fri Jul 21 6:00 AM 7:30 AM 9:00 AM 10:30 AM 12:00 PM 1:30 PM 3:00 PM 4:30 PM 6:00 PM Mon Jul 24 6:30 AM 8:00 AM 9:30 AM 11:00 AM 12:30 PM 2:00 PM 3:30 PM 5:00 PM 6:30 PM Tue Jul 25 7:00 AM 8:30 AM 10:00 AM 11:30 AM 1:00 PM 2:30 PM 4:00 PM 5:30 PM -900 1,800 2,700 3,600 4,500 5,400 6,300 7,200 8,100 9,000DPC Interrupt Kernel User Deferred Procedure Calls Queued Hardware Interrupts
SQLRx.com
Physical I/O Measurements
Critical for SQL Server systems because they
are often I/O constrained
I/O time measured
directly
by disk driver,
which provides transfer times to Windows
I/O time = service time
+ queue time
due to
driver’s location in I/O path
• Disk response time
Queuing
not always
the cause of large I/O
times
• May not be possible to improve large service (working) times because of physical or financial constraints
Physical I/O Measurements
Most frequently (and badly) misunderstood
Windows metrics
Data reported for
• Each physical disk or LUN (PhysicalDisk)
- When RAID implemented in hardware, many actual disk drives can appear as one “physical” disk or LUN
- Use disk idle and disk transfer times to detect contention
within RAID itself (see I/O Performance Counters Incomplete slide)
SQLRx.com
Useful Disk Performance Counters
Physical Disk
• Avg. Disk sec/Transfer
- Should be 0.020 seconds (20 ms) at most unless I/O size huge
• % Idle Time
- Surprising how often this is ignored!
- Once this reaches zero, no more I/Os can be processed
- Performance usually degrades as it approaches zero
- Easier to represent as utilization, i.e., 100 - % Idle Time
• Disk Transfers/sec, Disk Bytes/sec
- Beware of disk specs because they usually cite very large I/Os
- Beware of “cliffs,” even on SSDs
• Read and Write-specific counters also valuable, especially when a
read/write performance disparity exists or using RAID 5
Logical Disk
• Same counters available plus space-related ones
Interpreting Performance Counters
Disk Queue lengths
• Unlike processor queue length, this INCLUDES I/Os in progress
• By far, most commonly quoted and used disk performance measurement
- Assumes relationships among transfer times, utilizations, and queue lengths
- Actually least useful, except when outrageously high or from a VM guest on a busy host (> 50% processor utilization)
- Use Transfer times and % Idle instead
• Interpretation very difficult because # of physical disks in a LUN is usually unknown unless the values are obscenely high
SQLRx.com
Misunderstood PerfMon Counters
Many PerfMon counters misunderstood, e.g., % Disk Time
• Many people continue in 2014 to believe that this metric is a
utilization metric! It most definitely is NOT!
PerfMon explanation does not help!
• “% Disk Time is the percentage of elapsed time that the selected
disk drive was busy servicing read or write requests.”
% Disk Time actually = 100 * Avg. Disk Queue Length
• Artificially constrained to 100% by PerfMon
• Actually useless, but frequently referenced and interpreted as disk
“busy” times
Actual busy = 100 - % Idle Time
• Can indicate a capacity constraint even when performance is
SQLRx.com
Disk LUN Driver Activity
Mon Jul 31 6:00 AM 8:00 AM 10:00 AM 12:00 PM 2:00 PM 4:00 PM 6:00 PM Tue Aug 1 7:00 AM 9:00 AM 11:00 AM 1:00 PM 3:00 PM 5:00 PM Wed Aug 2 6:00 AM 8:00 AM 10:00 AM 12:00 PM 2:00 PM 4:00 PM 6:00 PM Disk 03 E Disk 04 E Disk 05 D Disk 00 C 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Disk 03 E Disk 01 Disk 04 E Disk 03 F Disk 05 D Disk 02 F Disk 00 C Disk 06 D
Disk LUN Driver Activity
Disk 00 C D Disk 02 T 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Mon Mar 1 2:49 PM 2:52 PM 2:55PM 2:58PM 3:19PM 3:22PM 3:25PM 3:28PM 3:31PM 3:34PM 3:37PM 3:54PM 3:57PM 4:00PM 4:03PM 4:06PM 4:09PM 4:12PMDisk 00 C D Disk 01 R S V Disk 02 T
SQLRx.com
Disk LUN I/O Times
Disk 00 C D Disk 01 R… 0 50 100 150 200 250 300 Mon Mar 1 2:49 PM 2:52 PM 2:55PM 2:58PM 3:19PM 3:22PM 3:25PM 3:28PM 3:31PM 3:34PM 3:37PM 3:54PM 3:57PM 4:00PM 4:03PM 4:06PM 4:09PM 4:12PM
Disk 00 C D Disk 02 T Disk 01 R S V
Disk I/O Byte Traffic Overview
Mon Jul 17 5:00 PM Tue Jul 18 6:00 AM 8:00 AM 10:00 AM 12:00 PM 2:00 PM 4:00 PM 6:00 PM Wed Jul 19 7:00 AM 9:00 AM 11:00 AM 1:00 PM 3:00 PM 5:00 PM Thu Jul 20 6:00 AM 8:00 AM 10:00 AM 12:00 PM 2:00 PM 4:00 PM 6:00 PM Disk 13 Disk 12 Disk 07 0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 Disk Bytes/secondSQLRx.com
I/O Performance Counters Incomplete
Some important metrics not measured
directly
• Avg. Disk Service Time per Transfer • Avg. Disk Queuing Time per Transfer
Missing values can be computed using the
Utilization Law
Utilization Law
U = X * S
• U => utilization of a resource • X => completion rate
• S => average service time required of a resource
System assumed to be in steady state
Not perfect, but an excellent tool
SQLRx.com
Using Utilization Law to Compute
Missing I/O-Related Times
Restate the Utilization Law
S = U / XAll calculations use PhysicalDisk counters
• LogicalDisk counters can be used, if necessaryItalicized
entities are PerfMon counters
Disk Utilization
= (100 -
% Idle Time
) / 100
Disk service time (sec)
=
Disk Utilization
/
Disk Transfers/sec
Disk queue time (sec) =
Avg. Disk
RAID Example Calculations #1 and #2
LUN #1 LUN #2 Disk Utilization 36.57% Disk Utilization 77.67% Disk Transfers/sec 0.65 Disk Transfers/sec 30.89Avg. Disk sec/Transfer
2.0095 seconds!
Avg. Disk sec/Transfer
2.4424 seconds!
Disk service time
0.3657 / 0.65 = 0.563 seconds or
563 milliseconds
Disk service time
0.7767 / 30.89 = 0.025 seconds or 25 milliseconds
Disk queue time
2.0095 – 0.563 = 1.447 seconds or 1,447 milliseconds
Disk queue time
2.4424 – 0.025 = 2.4174 seconds or 2,417 milliseconds Bytes/Transfer 1,307 Bytes/Transfer 22,437
SQLRx.com
RAID Example #1 vs. #2
I/O times (
2.0095
vs.
2.4424
) outrageously
high
Queuing occurred on both disks
Low I/O rate of Disk #1 appears to contribute
to high service times
• 1,307 bytes should not require 563 milliseconds
How could this happen?
RAID Example #1 vs. #2
Problems began when faster processor complex
attached to an existing disk subsystem
• Customer blamed new processor for poor performance
• Customer wanted vendor to take it back because architecture was supposedly defective and slower than original
• In reality, it was MUCH faster and it was swamping the disk subsystem!
Solution was to reconfigure disk drives
• Customer refused to state exactly what they changed • Probably multiple LUNs shared same physical drives
- Becoming a more common problem as many LUNs are spread across the same physical disks
SQLRx.com
Database I/O Counters
Page reads/sec and Page writes/sec counters
Measures
physical
I/Os, not logical I/Os
• SQL Trace measures logical I/OsMay indicate
• Insufficient database memory
• Applications improperly accessing database • Improper database table implementation
Plot reads and writes on same graph
• Highlights changes in workload behavior• Heavy write activity may coincide with periods of poor performance, especially when RAID 5 disks involved
I/O Activity vs. Scans and Forwarded
Records Graph
SQLRx.com
Detecting Insufficient SQL Memory
Use Page Life Expectancy
• Measures time unlocked buffers allowed to remain in buffer pool • Aged number that tends to drop quickly and increase slowly
If Page Life Expectancy too low (< 300)
• Allocate more memory to SQL Server or optimize queries• Malformed queries that read inappropriate amounts of data can cause low Page Life Expectancy because of data churn
Page Lookups/sec Counter
Measures number of times SQL Server
attempted to find page in buffer pool
SQLRx.com
Page Life Expectancy vs. Page
Lookups Graph
SQL Server Page Life Expectancy & Page Lookups
0 50,000 100,000 150,000 200,000 250,000 Fri Mar 9 9:30 AM 11:15 AM 1:00 PM 2:45 PM 4:30 PM 6:15 PM 8:00 PM 9:45 PM 11:30 PM Sat Mar 10 1:15 AM 3:00 AM 4:45 AM 6:30 AM 8:15 AM 10:00 AM 11:45 AM 1:30 PM 3:15 PM 5:00 PM 6:45 PM 8:30 PM 10:15 PM Sun Mar 11 12:00 AM 1:45 AM 4:30 AM 6:15 AM 8:00 AM 9:45 AM 11:30 AM 0 100 200 300 400 500 600 700 800 900 1,000
Lookups Page Life Expectancy Page Life Expectancy Threshold
Batch Requests/sec
Number of select, insert, and delete
statements
Each
of these statements triggers a batch
event, which increments the counter
Note: Also includes
each of these statement
SQLRx.com
Batch Requests vs. Page Lookups
Graph
Conclusions
PerfMon should
always
be used to focus
application troubleshooting and tuning
efforts
Extremely
important to combine Windows
system performance for
ALL
servers used by
the application
• Especially true for processor, memory, and I/O
• Include SQL Server PerfMon and internal metrics when applicable
SQLRx.com
Follow-Up
Please attend next session
•
Will discuss SQL Server Dynamic Management Views
(DMVs)
Please complete the four question
evaluation to let us know
• How we can help