Mission-Critical Availability

(1)

(2)

(3)

Mission-Critical Availability

• Unified solution

• Easy to configure, manage, and monitor • Reuse existing investments

• SAN/DAS environments

• Allow using HA hardware resources • Fast seamless failover

• Detect failures reliably

(4)

AlwaysOn

SQL Server HA/DR Technologies

Pre-existent Server failover

Useful in consolidation scenarios

Shared storage (SAN / SMB)

Depends on storage redundancy

Failover takes minutes

Server restart

Multi-node instance

Passive secondary nodes

Failover Cluster Instances

(for servers)

New

Multi-database Failover

DBs that app depends on

Direct attached storage

Log synchronization

Failover takes seconds

Secondary replicas are online

Multiple Secondary Replicas Active Secondary Replicas

Availability Groups

(5)

• Multi-Site Clusters

Clusters across subnets without stretch vLAN

• TempDB on local disk

Improved performance, better SAN utilization

• Indirect Checkpoints

More predictable DB recovery (failover) times

• Flexible Failover Policy

Sensitivity to failures for automatic failover

Failover Cluster Instances

(6)

Multiple

secondaries (4)

Sync (max 2) / Async

(7)

An Availability Group Deployment

Sync Log

(8)

(9)

AlwaysOn

SQL Server HA/DR Technologies

Pre-existent Server failover

Useful in consolidation scenarios

Shared storage (SAN / SMB)

Depends on storage redundancy

Failover takes minutes

Server restart

Multi-node instance

Passive secondary nodes

Availability Groups

(for groups of databases)

New

DBs that app depends on

Direct attached storage

Log synchronization

Failover takes seconds

Secondary replicas are online

Multiple Secondary Replicas Active Secondary Replicas

Failover Cluster Instances (for servers) Increased Number of Secondaries Increased Availability of Readable Secondaries Add Azure Replica Wizard

Support for Windows Cluster Shared

Volumes

(10)

• SQL Server 2012: Customers using (max 4) readable

secondaries to offload read workloads

• Single technology to configure / manage • Higher throughput (~7x) than Replication • Customers asking for more replicas

• Reduce query latency in geo-distributed environments (e.g.

Bwin)

• Scale-out read workloads (e.g. Baltika) • SQL Server 2014: Max 8 secondaries

• Max 2 sync secondaries for high availability

• Secondary delay depends on network latency and I/O

• ~1s within data center ~5s between data centers

Availability Groups

(11)

• Minimal performance impact

• Commits don’t wait for async secondaries • Log sender threads share log pool

• Added transaction latency of 8 async secondaries: <1% • Scoped-out: Load balancing via connection string

• Read_Only connections still routed to first available readable

secondary

• Load balancing possible via DNS round-robin or specialized

load balancers (e.g. NLB)

Availability Groups

(12)

• SQL Server 2012: Read workloads killed during network failures • Geo-distributed environments (e.g. failure/upgrade of network

equipment, ISP failures)

• Hybrid (on-premise to Azure) deployments

• SQL Server 2014: read workloads not impacted during network

failures.. or primary down.. or cluster quorum loss..

• Readable secondaries remain available during “Resolving” state • Requires direct connections to readable secondaries (Read-only

routing not supported yet)

• Replica state and last commit time available in DMV/Dashboard

Availability Groups

(13)

Availability Groups

Increased Readable Secondaries Availability

Sync Log

(14)

“The increased readable secondaries availability means our users can still find answers online and the world keeps spinning - StackOverflow

http://nickcraver.com/blog/2013/11/18/running-stack-overflow-sql-2014-ctp-2/

Availability Groups

(15)

StackOverflow can now:

• Offload more critical read workloads to readable secondary in main data

center

• Network glitches even within the same DC can happen

• Use readable secondary in DR site while data center is down (70% reads) • Simpler to change DNS than force failover and failback

• Doesn’t result in data loss

Availability Groups

(16)

Demo

(17)

• Many customers can’t afford a DR site • Site rent + maintenance, hardware, Ops

• SQL Server 2012: Started supporting replicas on Windows Azure VMs

this year

• Offload read workloads

• Offload backups (policy compliance) • Disaster recovery

• At best region

• West US, East US, East Asia, Southeast Asia, North Europe, West Europe • Latency / political considerations

Availability Groups

(18)

Availability Groups

Add Azure Replica Wizard

Sync Log

(19)

• Low TCO

• VM and storage

• Free ingress traffic

• Case studies

• Lufthansa, Thomson Reuters, Buffalo Hospital Supply

• SQL Server 2014: “Add Replica Wizard” supports Windows Azure • E2E: From provisioning VM to starting log synchronization

• Validates environment • Handles failures

• Does cleanup

Availability Groups

(20)

Demo

Screen Shots

(21)

• 24 Enhancements on Dashboard, Error Messages, DMVs, XEvents • Simplify troubleshooting & prevent issues

• Based on feedback from customers & CSS

Availability Groups & Failover Cluster Instances

(22)

Availability Groups & Failover Cluster Instances

Enhanced Diagnostics

Title Component

Show timestamps in XEL output in UTC (not adjusted to client SSMS computer) XEvents Viewer Warning about log synchronization behavior when primary replica is async Dashboard

System function IsPrimaryReplica(database_name) System function Add AG name (and replica name and DB name if relevant) to many more XEvents to

allow better data correlation between the logs XEvents Report major HADRON Manager transitions to AlwaysOn XEvent session XEvents Add Replica name context to connection established error log entry Error Log Dump relevant output from sys.dm_hadr_database_replica_states to SQL error log

when replicas change to resolving state XEvents

Add new error message to detect AG startup failure when quorum is forced Error Log Separate error msg 41142 (replica can't become primary) - raised for two importantly

different reasons Error Log

AlwaysOn Functions/DMVs should also support FCIs where applicable DMVs Improve the CREATE AG error message “AG already exists”, to say “It’s possible that a

previous DROP AG operation, executed during cluster quorum loss, didn’t delete the AG from the cluster. If so, please retry the DROP operation”

Error Message Remove FCI setup dependency on cluster.exe (deprecated) – Use Powershell Error Log

Show timestamps in XEL output in UTC (not adjusted to client SSMS computer) Warning about log synchronization behavior when primary replica is async System function IsPrimaryReplica(database_name)

Separate error msg 41142 (replica can't become primary) - raised for two importantly different reasons

(23)

•

Cluster Shared Volume (CSV)

•

Shared disk accessible to all nodes (over SMB)

•

One or more per physical drive

•

Failover Cluster Instances on CSV

•

Improves SAN utilization

Removes limitation of 24 drives •

Increases I/O resiliency

Retry read/write via other nodes •

Increases failover resiliency

Disks don’t need to be unmounted/mounted

Failover Cluster Instances

(24)

Failover Cluster Instances

(25)

AlwaysOn and Windows Server

Windows Cluster Enhancements

• Windows Server 2012 • Dynamic Quorum

Removes votes from unavailable nodes Enables “last man standing”

• Increased network resiliency Handle more exceptions Avoid node evictions • Windows Server 2012 R2

• Network names without Active Directory

(26)

(27)

(28)