1
High Availability for Microsoft
Servers
Brendan Murphy
IFIP, New York
June 2000
2
Agenda
¥ Windows 2000 dependability.
¥ Tomorrows problems.
3
Dependability Goal Setting.
From the customers perspective.
¥ Win 2k dependability > NT4.
Irrespective of additional new feature.
OpenVMS Metrics (running on VAX Systems) Post Installation Behaviour
0 10 20 30 40 50 60 70 80 90 100 V5.5 V5.5-1 V5.5-2 V6.0 V6.1 V6.2 V7.0 V7.1
Operating System Version
Rate Of System Outages
Upper Confidence Bound Average
Lower Confidence Bound
VAX Servers: ©FTSC 1999 Madison, Murphy, Davies, Compaq. VAX Servers: ©FTSC 1999 Madison, Murphy, Davies, Compaq.
Note reliability improvements
between each Ôservice packÕ
release.
Note drop in reliability
between Ôservice packÕ and
major release.
Exception
4
Making NT Measurable
Logging System Events
¥ Capturing time and type of outage.
Ð 6006 = clean shutdown.
Ð 6008 = blue screen.
Ð 6005 = Time of reboot.
¥ Capturing OS version and upgrade.
Ð 6009 = OS version on reboot.
Ð 4353 = installation of service pack.
¥ Capturing cause of system crash
Ð 1001 = brief description of crash details.
5
NT4 Failure Classes
Source: Sample from PSS Incidents: NT Server 4.0 8/5/96-4/7/98
Source: Sample from PSS Incidents: NT Server 4.0 8/5/96-4/7/98
Device
Device
drivers
drivers
16%
16%
Core NT
Core NT
43%
43%
Other
Other
third-party drivers
party drivers
16%
16%
Anti-virus
Anti-virus
12%
12%
Hardware
Hardware
Failure
Failure
13%
13%
6
Windows 2000 Bluescreen Reduction
Anti-virus
Anti-virus
n
n
Anti-virus dev labs
Anti-virus dev labs
n
n
Driver verifier
Driver verifier
n
n
Better DDK
Better DDK
Core NT
Core NT
n
n
Kernel verifier
Kernel verifier
n
n
Driver verifier
Driver verifier
n
n
Full time source code reviewers
Full time source code reviewers
n
n
Better
Better
longhaul
longhaul
testing
testing
n
n
Better component stress
Better component stress
n
n
PREfix
PREfix
source scanning
source scanning
Hardware failures
Hardware failures
n
n
Hardware
Hardware
compatibility list
compatibility list
Device drivers
Device drivers
n
n
Broader device
Broader device
test matrix
test matrix
n
n
Driver verifier
Driver verifier
n
n
Better DDK
Better DDK
Other third-party drivers
Other third-party drivers
n
n
File system dev labs
File system dev labs
n
n
Driver verifier
Driver verifier
n
7
NT4 Server Reboot Causes
OS install
OS install
27%
27%
Hardware
Hardware
install/
install/
config
config
3%
3%
Preventative
Preventative
reboots
reboots
20%
20%
OS
OS
Configuration
Configuration
7%
7%
Application
Application
install/
install/
config
config
8%
8%
Application
Application
Failure
Failure
21%
21%
System failure
System failure
14%
14%
35% of outages
35% of outages
are unplanned
are unplanned
65% of outages
65% of outages
are planned
are planned
Source: One site, 1,180 servers, 9/1/98-5/7/99, SP4
8
Windows 2000 Reboot Reduction
Preventative reboots
Preventative reboots
n
n
Published best practices
Published best practices
n
n
Resource Partitioning
Resource Partitioning
n
n
IIS restart
IIS restart
System Failure
System Failure
n
n
Bluescreen reduction
Bluescreen reduction
Unplanned reboot reduction
Unplanned reboot reduction
Planned reboot reduction
Planned reboot reduction
App install/
App install/
config
config
n
n
Windows file protection
Windows file protection
n
n
Windows installer
Windows installer
OS configuration
OS configuration
n
n
Eliminated dozens
Eliminated dozens
of configuration
of configuration
reboots
reboots
OS install
OS install
n
n
Service pack
Service pack
slipstreaming
slipstreaming
Hardware
Hardware
install/
install/
config
config
n
n
Plug-n-Play
Plug-n-Play
App Failure
App Failure
n
n
Resource Partitioning
Resource Partitioning
n
n
Task mgr Òkill proc treeÓ
Task mgr Òkill proc treeÓ
n
n
IIS restart
IIS restart
n
9
Windows 2000 Verification
Problem
Windows 2000
Verification
Features
Bug Fixes
+ featuresTest
Development
Field Tests
(Beta Testing)
38 packs/working day
288 variants/working day
7000+
>5000
Customer
Servers
5000+ Supported
Computers
Application
Test & Development
10
Testing Process.
Incremental Improvement
Daily Builds
Weekly Builds
RC/Beta Builds
RTM
11
Windows 2000
Failure Analysis.
Drivers for HCL HW 7%
Drivers for NonHCL HW 20% HW Failure 22% Anti-Virus 4% System Config 34% Other 3rd Party Kernel code 11% MSInternalCode 2% Other IFSDrivers 0%
Device
Device
drivers
drivers
16%
16%
Core NT
Core NT
43%
43%
Other
Other
third-party drivers
party drivers
16%
16%
Anti-virus
Anti-virus
12%
12%
Hardware
Hardware
Failure
Failure
13%
13%
Source: Sample from PSS Incidents:
Source: Sample from PSS Incidents:
12
Future/current dependability
issues
¥ Scalability
Ð Vertical
¥ Complex Faults.
¥ Change control.
¥ Application configuration management.
Ð Horizontal
¥ System Management.
¥ Shared data.
13
Future/current dependability
issues contÉ
¥ Measurement and characterization.
Ð System not the metal box.
¥ Measure MSN behaviour?
Ð Relate to reality.
¥ Not 99.999% availability claims.
Ð Drive correct behaviour.
¥ For both developers and customers.
14
Future / current dependability
issues É. cont
¥ Size.
Ð Geographic solutions.
Ð Reassess failure rates.
Failure rates per TB of memory.
¥ Real time change management.
¥ Cost of ownership.
Ð Decrease skill and number of system managers.
15
Future / current dependability
issues É. cont
¥ Lack of hierarchy.
Ð Storage, computer, network, applications.
¥ The system is not the computer.
Ð MSN gets a rack of computers every 2 weeks.
¥ Mobiles.
Ð Security/trust.
Ð Synchronization.
16
Microsoft Research
Development tools
¥ Code Analysis.
Ð Generic analysis applied to all new code.
¥ Decrease false positives.
Ð Automating verification of best practices(e.g. driver
development).
¥ Driver verifiers.
Ð Fault injection(ish).
Ð Software development environments.
Ð Using OS as test environment.
Ð Objective is to protect the operating system.
17
Testing
$162 million problem and rising
¥ Testing.
Ð Virtual labs.
Ð Specific labs
Ð Test development.
Ð Attribute testing.
Ð Measurements.
Ð Release criteria.
Ð Motivation is $$$$$$$$
18
Failure predictions
¥ Hardware failure prediction methodologies
applicable to software failures?
¥ Analysis of SQL and exchange using NT
logs.
¥ Apply to other applications.
19
System Fault Management.
¥ Assume no hierarchy.
¥ Assume limited knowledge between
different elements (i.e. abstraction).
¥ System is not a metal box.
¥ Need to work under failure conditions.
¥ Pattern recognition time based.
Ð Pattern variations based on system speed.
20
Conclusions/Issues
¥ Dependability is improving.
Ð In spite of using ÔwrongÕ methods?
¥ Industry has problems and research have solutions.
Ð But what is the relationship?
¥ Industry addressing current problems through
innovation and brute force.
Ð Opportunities to address future issues (risky predicting
the future).