Mission-Critical Availability
• Unified solution
• Easy to configure, manage, and monitor • Reuse existing investments
• SAN/DAS environments
• Allow using HA hardware resources • Fast seamless failover
• Detect failures reliably
AlwaysOn
SQL Server HA/DR Technologies
Pre-existent Server failover
Useful in consolidation scenarios
Shared storage (SAN / SMB)
Depends on storage redundancy
Failover takes minutes
Server restart
Multi-node instance
Passive secondary nodes
Failover Cluster Instances
(for servers)
New
Multi-database Failover
DBs that app depends on
Direct attached storage
Log synchronization
Failover takes seconds
Secondary replicas are online
Multiple Secondary Replicas Active Secondary Replicas
Availability Groups
• Multi-Site Clusters
Clusters across subnets without stretch vLAN
• TempDB on local disk
Improved performance, better SAN utilization
• Indirect Checkpoints
More predictable DB recovery (failover) times
• Flexible Failover Policy
Sensitivity to failures for automatic failover
Failover Cluster Instances
Multi-database Failover
Multiple
secondaries (4)
Sync (max 2) / Async
An Availability Group Deployment
Sync Log
AlwaysOn
SQL Server HA/DR Technologies
Pre-existent Server failover
Useful in consolidation scenarios
Shared storage (SAN / SMB)
Depends on storage redundancy
Failover takes minutes
Server restart
Multi-node instance
Passive secondary nodes
Availability Groups
(for groups of databases)
New
Multi-database Failover
DBs that app depends on
Direct attached storage
Log synchronization
Failover takes seconds
Secondary replicas are online
Multiple Secondary Replicas Active Secondary Replicas
Failover Cluster Instances (for servers) Increased Number of Secondaries Increased Availability of Readable Secondaries Add Azure Replica Wizard
Support for Windows Cluster Shared
Volumes
• SQL Server 2012: Customers using (max 4) readable
secondaries to offload read workloads
• Single technology to configure / manage • Higher throughput (~7x) than Replication • Customers asking for more replicas
• Reduce query latency in geo-distributed environments (e.g.
Bwin)
• Scale-out read workloads (e.g. Baltika) • SQL Server 2014: Max 8 secondaries
• Max 2 sync secondaries for high availability
• Secondary delay depends on network latency and I/O
• ~1s within data center ~5s between data centers
Availability Groups
• Minimal performance impact
• Commits don’t wait for async secondaries • Log sender threads share log pool
• Added transaction latency of 8 async secondaries: <1% • Scoped-out: Load balancing via connection string
• Read_Only connections still routed to first available readable
secondary
• Load balancing possible via DNS round-robin or specialized
load balancers (e.g. NLB)
Availability Groups
• SQL Server 2012: Read workloads killed during network failures • Geo-distributed environments (e.g. failure/upgrade of network
equipment, ISP failures)
• Hybrid (on-premise to Azure) deployments
• SQL Server 2014: read workloads not impacted during network
failures.. or primary down.. or cluster quorum loss..
• Readable secondaries remain available during “Resolving” state • Requires direct connections to readable secondaries (Read-only
routing not supported yet)
• Replica state and last commit time available in DMV/Dashboard
Availability Groups
Availability Groups
Increased Readable Secondaries Availability
Sync Log
“The increased readable secondaries availability means our users can still find answers online and the world keeps spinning - StackOverflow
http://nickcraver.com/blog/2013/11/18/running-stack-overflow-sql-2014-ctp-2/
Availability Groups
StackOverflow can now:
• Offload more critical read workloads to readable secondary in main data
center
• Network glitches even within the same DC can happen
• Use readable secondary in DR site while data center is down (70% reads) • Simpler to change DNS than force failover and failback
• Doesn’t result in data loss
Availability Groups
Demo
• Many customers can’t afford a DR site • Site rent + maintenance, hardware, Ops
• SQL Server 2012: Started supporting replicas on Windows Azure VMs
this year
• Offload read workloads
• Offload backups (policy compliance) • Disaster recovery
• At best region
• West US, East US, East Asia, Southeast Asia, North Europe, West Europe • Latency / political considerations
Availability Groups
Availability Groups
Add Azure Replica Wizard
Sync Log
• Low TCO
• VM and storage
• Free ingress traffic
• Case studies
• Lufthansa, Thomson Reuters, Buffalo Hospital Supply
• SQL Server 2014: “Add Replica Wizard” supports Windows Azure • E2E: From provisioning VM to starting log synchronization
• Validates environment • Handles failures
• Does cleanup
Availability Groups
Demo
Screen Shots
• 24 Enhancements on Dashboard, Error Messages, DMVs, XEvents • Simplify troubleshooting & prevent issues
• Based on feedback from customers & CSS
Availability Groups & Failover Cluster Instances
Availability Groups & Failover Cluster Instances
Enhanced Diagnostics
Title Component
Show timestamps in XEL output in UTC (not adjusted to client SSMS computer) XEvents Viewer Warning about log synchronization behavior when primary replica is async Dashboard
System function IsPrimaryReplica(database_name) System function Add AG name (and replica name and DB name if relevant) to many more XEvents to
allow better data correlation between the logs XEvents Report major HADRON Manager transitions to AlwaysOn XEvent session XEvents Add Replica name context to connection established error log entry Error Log Dump relevant output from sys.dm_hadr_database_replica_states to SQL error log
when replicas change to resolving state XEvents
Add new error message to detect AG startup failure when quorum is forced Error Log Separate error msg 41142 (replica can't become primary) - raised for two importantly
different reasons Error Log
AlwaysOn Functions/DMVs should also support FCIs where applicable DMVs Improve the CREATE AG error message “AG already exists”, to say “It’s possible that a
previous DROP AG operation, executed during cluster quorum loss, didn’t delete the AG from the cluster. If so, please retry the DROP operation”
Error Message Remove FCI setup dependency on cluster.exe (deprecated) – Use Powershell Error Log
Show timestamps in XEL output in UTC (not adjusted to client SSMS computer) Warning about log synchronization behavior when primary replica is async System function IsPrimaryReplica(database_name)
Separate error msg 41142 (replica can't become primary) - raised for two importantly different reasons
•
Cluster Shared Volume (CSV)
•
Shared disk accessible to all nodes (over SMB)
•
One or more per physical drive
•
Failover Cluster Instances on CSV
•Improves SAN utilization
Removes limitation of 24 drives •
Increases I/O resiliency
Retry read/write via other nodes •
Increases failover resiliency
Disks don’t need to be unmounted/mounted
Failover Cluster Instances
Failover Cluster Instances
AlwaysOn and Windows Server
Windows Cluster Enhancements
• Windows Server 2012 • Dynamic Quorum
Removes votes from unavailable nodes Enables “last man standing”
• Increased network resiliency Handle more exceptions Avoid node evictions • Windows Server 2012 R2
• Network names without Active Directory