Users are Complaining that the System is Slow What Should I Do Now? Part 1

(1)

Users are Complaining that the

System is Slow – What Should I

Do Now? Part 1

Jeffry A. Schwartz

July 15, 2014 SQLRx Seminar

(2)

SQLRx.com

Overview

Most of you have had to deal with vague user

complaints quoting “slowness” or “hangs”

Obviously, this information is not very helpful with

getting to the source of the problem

Remember that erratic response times

• Proven to be more frustrating and counterproductive than consistently

slow response times

• People usually remember the 90th_{or 95}th_{percentile response times}

as “average”

Database is often the first place to be blamed

Dilemmas

• Many times analysts do not know where to start

- Process especially time-consuming when analysts inexperienced in performance

(3)

Overview

DBAs need techniques for determining

• Which hardware or software components, including servers, are in trouble?

• Causes of poor performance - What is to blame?

- Is the software itself the problem?

- Are the hardware’s problems caused by • Inefficient software?

• Simply too much work?

• For SQL Server servers, which queries are most troublesome? • How to develop appropriate solutions

(4)

SQLRx.com

Today’s Session

First of several 30-minute presentations

Discusses methodology for pinpointing the sources

of problems using PerfMon data alone

Subsequent presentations will delve into more SQL

Server-centric metrics and tools

Discusses high-level hardware-related and

preliminary SQL Server performance analysis

Describes information and techniques applicable to

2005+ SQL Server and all versions of Windows

(5)

Analysis Objectives

Use measurements to corroborate or discredit user

perceptions of performance

• Capture performance data on ALL servers involved in user experience

- IIS, Application, Database, etc.

- If SAN is shared, then collect on ALL servers that use it

• If perceptions are valid

- When?

- How bad?

- Duration?

- How do users know it is bad from 11:00 AM – noon?

If hardware in trouble, prove whether application is

the cause or just the amount of work

• Too many database trips per transaction?

(6)

SQLRx.com

Analysis Methodology

Use Windows Performance Monitor, a.k.a.

PerfMon, to determine

• When problems occur and their duration

• Which servers and hardware components could be involved • Are the physical servers or SQL Server short of memory? • Use graphs to visually correlate problem periods on servers

Using PerfMon to focus analyses is highly

recommended by Microsoft

(7)

Analysis Techniques Covered

Interpretation and usage of informative

performance counters

- Processor - Memory

- Physical I/O - SQL Server

• Expand upon and clarify PerfMon explanations

Provide

• Useful graphical techniques

• Insights acquired from over 1,000 customer engagements • Possible courses of action

(8)

SQLRx.com

PerfMon Hardware and SQL Server

Metrics

Invaluable hardware and SQL Server metrics

• % User Time (Processor)

• % Privileged Time (Processor) • % Interrupt Time (Processor) • % DPC Time (Processor) • % Idle Time (each Disk)

• Avg. Disk sec/Transfer (each Disk)

• Avg. Disk sec/Write (each Disk)

• Available Bytes (Memory)

• Page Life Expectancy (SQLServer:Buffer Manager)

(9)

Assessing Your System

Potential problems exist if

consistently

…

Counter Criterion

% Processor Time > 70%

% Privileged Time > 30% (Processor) % Interrupt Time > 20% (Processor) % DPC Time > 25% (Processor)

% Idle Time < 40% for any Disk LUN and especially SQL LUNs Avg. Disk sec/Transfer > 0.040 seconds (40 ms)

Avg. Disk sec/Write > 0.040 seconds (40 ms) Available Bytes < 1 GB (Memory)

(10)

SQLRx.com

Useful Processor Performance

Counters

Processor Object

• Interrupt counters are isolated to specific processors and frequently ignored (often single-threaded through interrupt processor)

• Windows only reports % Privileged Time (contains kernel mode and interrupt times) and interrupt times

- Actual kernel (Windows) time must be computed manually - REAL kernel mode time is difference between % Privileged

Time and the sum of the interrupt times (%Interrupt and %DPC)

System Object

• Processor Queue Length (waiting list length ONLY) - Can be affected by application design

(11)

Processor Usage Overview

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Fri Jul 21 6:00 AM 7:30 AM 9:00 AM 10:30 AM 12:00 PM 1:30 PM 3:00 PM 4:30 PM 6:00 PM Mon Jul 24 6:30 AM 8:00 AM 9:30 AM 11:00 AM 12:30 PM 2:00 PM 3:30 PM 5:00 PM 6:30 PM Tue Jul 25 7:00 AM 8:30 AM 10:00 AM 11:30 AM 1:00 PM 2:30 PM 4:00 PM 5:30 PM -900 1,800 2,700 3,600 4,500 5,400 6,300 7,200 8,100 9,000

DPC Interrupt Kernel User Deferred Procedure Calls Queued Hardware Interrupts

(12)

SQLRx.com

Physical I/O Measurements

Critical for SQL Server systems because they

are often I/O constrained

I/O time measured

directly

by disk driver,

which provides transfer times to Windows

I/O time = service time

+ queue time

due to

driver’s location in I/O path

• Disk response time

Queuing

not always

the cause of large I/O

times

• May not be possible to improve large service (working) times because of physical or financial constraints

(13)

Physical I/O Measurements

Most frequently (and badly) misunderstood

Windows metrics

Data reported for

• Each physical disk or LUN (PhysicalDisk)

- When RAID implemented in hardware, many actual disk drives can appear as one “physical” disk or LUN

- Use disk idle and disk transfer times to detect contention

within RAID itself (see I/O Performance Counters Incomplete slide)

(14)

SQLRx.com

Useful Disk Performance Counters

Physical Disk

• Avg. Disk sec/Transfer

- Should be 0.020 seconds (20 ms) at most unless I/O size huge

• % Idle Time

- Surprising how often this is ignored!

- Once this reaches zero, no more I/Os can be processed

- Performance usually degrades as it approaches zero

- Easier to represent as utilization, i.e., 100 - % Idle Time

• Disk Transfers/sec, Disk Bytes/sec

- Beware of disk specs because they usually cite very large I/Os

- Beware of “cliffs,” even on SSDs

• Read and Write-specific counters also valuable, especially when a

read/write performance disparity exists or using RAID 5

Logical Disk

• Same counters available plus space-related ones

(15)

Interpreting Performance Counters

Disk Queue lengths

• Unlike processor queue length, this INCLUDES I/Os in progress

• By far, most commonly quoted and used disk performance measurement

- Assumes relationships among transfer times, utilizations, and queue lengths

- Actually least useful, except when outrageously high or from a VM guest on a busy host (> 50% processor utilization)

- Use Transfer times and % Idle instead

• Interpretation very difficult because # of physical disks in a LUN is usually unknown unless the values are obscenely high

(16)

SQLRx.com

(17)

Misunderstood PerfMon Counters

Many PerfMon counters misunderstood, e.g., % Disk Time

• Many people continue in 2014 to believe that this metric is a

utilization metric! It most definitely is NOT!

PerfMon explanation does not help!

• “% Disk Time is the percentage of elapsed time that the selected

disk drive was busy servicing read or write requests.”

% Disk Time actually = 100 * Avg. Disk Queue Length

• Artificially constrained to 100% by PerfMon

• Actually useless, but frequently referenced and interpreted as disk

“busy” times

Actual busy = 100 - % Idle Time

• Can indicate a capacity constraint even when performance is

(18)

SQLRx.com

Disk LUN Driver Activity

Mon Jul 31 6:00 AM 8:00 AM 10:00 AM 12:00 PM 2:00 PM 4:00 PM 6:00 PM Tue Aug 1 7:00 AM 9:00 AM 11:00 AM 1:00 PM 3:00 PM 5:00 PM Wed Aug 2 6:00 AM 8:00 AM 10:00 AM 12:00 PM 2:00 PM 4:00 PM 6:00 PM Disk 03 E Disk 04 E Disk 05 D Disk 00 C 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Disk 03 E Disk 01 Disk 04 E Disk 03 F Disk 05 D Disk 02 F Disk 00 C Disk 06 D

(19)

Disk LUN Driver Activity

Disk 00 C D Disk 02 T 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Mon Mar 1 2:49 PM 2:52 PM 2:55PM 2:58PM 3:19PM 3:22PM 3:25PM 3:28PM 3:31PM 3:34PM 3:37PM 3:54PM 3:57PM 4:00PM 4:03PM 4:06PM 4:09PM 4:12PM

Disk 00 C D Disk 01 R S V Disk 02 T

(20)

SQLRx.com

Disk LUN I/O Times

Disk 00 C D Disk 01 R… 0 50 100 150 200 250 300 Mon Mar 1 2:49 PM 2:52 PM 2:55PM 2:58PM 3:19PM 3:22PM 3:25PM 3:28PM 3:31PM 3:34PM 3:37PM 3:54PM 3:57PM 4:00PM 4:03PM 4:06PM 4:09PM 4:12PM

Disk 00 C D Disk 02 T Disk 01 R S V

(21)

Disk I/O Byte Traffic Overview

Mon Jul 17 5:00 PM Tue Jul 18 6:00 AM 8:00 AM 10:00 AM 12:00 PM 2:00 PM 4:00 PM 6:00 PM Wed Jul 19 7:00 AM 9:00 AM 11:00 AM 1:00 PM 3:00 PM 5:00 PM Thu Jul 20 6:00 AM 8:00 AM 10:00 AM 12:00 PM 2:00 PM 4:00 PM 6:00 PM Disk 13 Disk 12 Disk 07 0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 Disk Bytes/second

(22)

SQLRx.com

I/O Performance Counters Incomplete

Some important metrics not measured

directly

• Avg. Disk Service Time per Transfer • Avg. Disk Queuing Time per Transfer

Missing values can be computed using the

Utilization Law

(23)

Utilization Law

**U = X * S**

• U => utilization of a resource • X => completion rate

• S => average service time required of a resource

System assumed to be in steady state

Not perfect, but an excellent tool

(24)

SQLRx.com

Using Utilization Law to Compute

Missing I/O-Related Times

Restate the Utilization Law

S = U / X

All calculations use PhysicalDisk counters

• LogicalDisk counters can be used, if necessary

Italicized

entities are PerfMon counters

Disk Utilization

= (100 -

% Idle Time

) / 100

Disk service time (sec)

=

Disk Utilization

/

Disk Transfers/sec

Disk queue time (sec) =

Avg. Disk

(25)

RAID Example Calculations #1 and #2

LUN #1 LUN #2 Disk Utilization 36.57% Disk Utilization 77.67% Disk Transfers/sec 0.65 Disk Transfers/sec 30.89

Avg. Disk sec/Transfer

2.0095 seconds!

Avg. Disk sec/Transfer

2.4424 seconds!

Disk service time

0.3657 / 0.65 = 0.563 seconds or

563 milliseconds

Disk service time

0.7767 / 30.89 = 0.025 seconds or 25 milliseconds

Disk queue time

2.0095 – 0.563 = 1.447 seconds or 1,447 milliseconds

Disk queue time

2.4424 – 0.025 = 2.4174 seconds or 2,417 milliseconds Bytes/Transfer 1,307 Bytes/Transfer 22,437

(26)

SQLRx.com

RAID Example #1 vs. #2

I/O times (

2.0095

vs.

2.4424

) outrageously

high

Queuing occurred on both disks

Low I/O rate of Disk #1 appears to contribute

to high service times

• 1,307 bytes should not require 563 milliseconds

How could this happen?

(27)

RAID Example #1 vs. #2

Problems began when faster processor complex

attached to an existing disk subsystem

• Customer blamed new processor for poor performance

• Customer wanted vendor to take it back because architecture was supposedly defective and slower than original

• In reality, it was MUCH faster and it was swamping the disk subsystem!

Solution was to reconfigure disk drives

• Customer refused to state exactly what they changed • Probably multiple LUNs shared same physical drives

- Becoming a more common problem as many LUNs are spread across the same physical disks

(28)

SQLRx.com

Database I/O Counters

Page reads/sec and Page writes/sec counters

Measures

physical

I/Os, not logical I/Os

• SQL Trace measures logical I/Os

May indicate

• Insufficient database memory

• Applications improperly accessing database • Improper database table implementation

Plot reads and writes on same graph

• Highlights changes in workload behavior

• Heavy write activity may coincide with periods of poor performance, especially when RAID 5 disks involved

(29)

I/O Activity vs. Scans and Forwarded

Records Graph

(30)

SQLRx.com

Detecting Insufficient SQL Memory

Use Page Life Expectancy

• Measures time unlocked buffers allowed to remain in buffer pool • Aged number that tends to drop quickly and increase slowly

If Page Life Expectancy too low (< 300)

• Allocate more memory to SQL Server or optimize queries

• Malformed queries that read inappropriate amounts of data can cause low Page Life Expectancy because of data churn

(31)

Page Lookups/sec Counter

Measures number of times SQL Server

attempted to find page in buffer pool

(32)

SQLRx.com

Page Life Expectancy vs. Page

Lookups Graph

SQL Server Page Life Expectancy & Page Lookups

0 50,000 100,000 150,000 200,000 250,000 Fri Mar 9 9:30 AM 11:15 AM 1:00 PM 2:45 PM 4:30 PM 6:15 PM 8:00 PM 9:45 PM 11:30 PM Sat Mar 10 1:15 AM 3:00 AM 4:45 AM 6:30 AM 8:15 AM 10:00 AM 11:45 AM 1:30 PM 3:15 PM 5:00 PM 6:45 PM 8:30 PM 10:15 PM Sun Mar 11 12:00 AM 1:45 AM 4:30 AM 6:15 AM 8:00 AM 9:45 AM 11:30 AM 0 100 200 300 400 500 600 700 800 900 1,000