• No results found

Mission-Critical Availability

N/A
N/A
Protected

Academic year: 2021

Share "Mission-Critical Availability"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)
(3)

Mission-Critical Availability

• Unified solution

• Easy to configure, manage, and monitor • Reuse existing investments

• SAN/DAS environments

• Allow using HA hardware resources • Fast seamless failover

• Detect failures reliably

(4)

AlwaysOn

SQL Server HA/DR Technologies

Pre-existent Server failover

Useful in consolidation scenarios

Shared storage (SAN / SMB)

Depends on storage redundancy

Failover takes minutes

Server restart

Multi-node instance

Passive secondary nodes

Failover Cluster Instances

(for servers)

New

Multi-database Failover

DBs that app depends on

Direct attached storage

Log synchronization

Failover takes seconds

Secondary replicas are online

Multiple Secondary Replicas Active Secondary Replicas

Availability Groups

(5)

Multi-Site Clusters

Clusters across subnets without stretch vLAN

TempDB on local disk

Improved performance, better SAN utilization

Indirect Checkpoints

More predictable DB recovery (failover) times

Flexible Failover Policy

Sensitivity to failures for automatic failover

Failover Cluster Instances

(6)

Multi-database Failover

Multiple

secondaries (4)

Sync (max 2) / Async

(7)

An Availability Group Deployment

Sync Log

(8)
(9)

AlwaysOn

SQL Server HA/DR Technologies

Pre-existent Server failover

Useful in consolidation scenarios

Shared storage (SAN / SMB)

Depends on storage redundancy

Failover takes minutes

Server restart

Multi-node instance

Passive secondary nodes

Availability Groups

(for groups of databases)

New

Multi-database Failover

DBs that app depends on

Direct attached storage

Log synchronization

Failover takes seconds

Secondary replicas are online

Multiple Secondary Replicas Active Secondary Replicas

Failover Cluster Instances (for servers) Increased Number of Secondaries Increased Availability of Readable Secondaries Add Azure Replica Wizard

Support for Windows Cluster Shared

Volumes

(10)

SQL Server 2012: Customers using (max 4) readable

secondaries to offload read workloads

• Single technology to configure / manage • Higher throughput (~7x) than Replication • Customers asking for more replicas

• Reduce query latency in geo-distributed environments (e.g.

Bwin)

• Scale-out read workloads (e.g. Baltika) • SQL Server 2014: Max 8 secondaries

• Max 2 sync secondaries for high availability

• Secondary delay depends on network latency and I/O

• ~1s within data center ~5s between data centers

Availability Groups

(11)

Minimal performance impact

• Commits don’t wait for async secondaries • Log sender threads share log pool

• Added transaction latency of 8 async secondaries: <1% • Scoped-out: Load balancing via connection string

Read_Only connections still routed to first available readable

secondary

• Load balancing possible via DNS round-robin or specialized

load balancers (e.g. NLB)

Availability Groups

(12)

SQL Server 2012: Read workloads killed during network failures • Geo-distributed environments (e.g. failure/upgrade of network

equipment, ISP failures)

• Hybrid (on-premise to Azure) deployments

SQL Server 2014: read workloads not impacted during network

failures.. or primary down.. or cluster quorum loss..

Readable secondaries remain available during “Resolving” state • Requires direct connections to readable secondaries (Read-only

routing not supported yet)

• Replica state and last commit time available in DMV/Dashboard

Availability Groups

(13)

Availability Groups

Increased Readable Secondaries Availability

Sync Log

(14)

“The increased readable secondaries availability means our users can still find answers online and the world keeps spinning - StackOverflow

http://nickcraver.com/blog/2013/11/18/running-stack-overflow-sql-2014-ctp-2/

Availability Groups

(15)

StackOverflow can now:

Offload more critical read workloads to readable secondary in main data

center

• Network glitches even within the same DC can happen

Use readable secondary in DR site while data center is down (70% reads) • Simpler to change DNS than force failover and failback

• Doesn’t result in data loss

Availability Groups

(16)

Demo

(17)

Many customers can’t afford a DR site • Site rent + maintenance, hardware, Ops

SQL Server 2012: Started supporting replicas on Windows Azure VMs

this year

• Offload read workloads

• Offload backups (policy compliance) • Disaster recovery

At best region

• West US, East US, East Asia, Southeast Asia, North Europe, West Europe • Latency / political considerations

Availability Groups

(18)

Availability Groups

Add Azure Replica Wizard

Sync Log

(19)

Low TCO

• VM and storage

• Free ingress traffic

Case studies

• Lufthansa, Thomson Reuters, Buffalo Hospital Supply

SQL Server 2014: “Add Replica Wizard” supports Windows Azure • E2E: From provisioning VM to starting log synchronization

• Validates environment • Handles failures

• Does cleanup

Availability Groups

(20)

Demo

Screen Shots

(21)

24 Enhancements on Dashboard, Error Messages, DMVs, XEvents • Simplify troubleshooting & prevent issues

• Based on feedback from customers & CSS

Availability Groups & Failover Cluster Instances

(22)

Availability Groups & Failover Cluster Instances

Enhanced Diagnostics

Title Component

Show timestamps in XEL output in UTC (not adjusted to client SSMS computer) XEvents Viewer Warning about log synchronization behavior when primary replica is async Dashboard

System function IsPrimaryReplica(database_name) System function Add AG name (and replica name and DB name if relevant) to many more XEvents to

allow better data correlation between the logs XEvents Report major HADRON Manager transitions to AlwaysOn XEvent session XEvents Add Replica name context to connection established error log entry Error Log Dump relevant output from sys.dm_hadr_database_replica_states to SQL error log

when replicas change to resolving state XEvents

Add new error message to detect AG startup failure when quorum is forced Error Log Separate error msg 41142 (replica can't become primary) - raised for two importantly

different reasons Error Log

AlwaysOn Functions/DMVs should also support FCIs where applicable DMVs Improve the CREATE AG error message “AG already exists”, to say “It’s possible that a

previous DROP AG operation, executed during cluster quorum loss, didn’t delete the AG from the cluster. If so, please retry the DROP operation”

Error Message Remove FCI setup dependency on cluster.exe (deprecated) – Use Powershell Error Log

Show timestamps in XEL output in UTC (not adjusted to client SSMS computer) Warning about log synchronization behavior when primary replica is async System function IsPrimaryReplica(database_name)

Separate error msg 41142 (replica can't become primary) - raised for two importantly different reasons

(23)

Cluster Shared Volume (CSV)

Shared disk accessible to all nodes (over SMB)

One or more per physical drive

Failover Cluster Instances on CSV

Improves SAN utilization

Removes limitation of 24 drives •

Increases I/O resiliency

Retry read/write via other nodes •

Increases failover resiliency

Disks don’t need to be unmounted/mounted

Failover Cluster Instances

(24)

Failover Cluster Instances

(25)

AlwaysOn and Windows Server

Windows Cluster Enhancements

Windows Server 2012Dynamic Quorum

Removes votes from unavailable nodes Enables “last man standing”

Increased network resiliency Handle more exceptions Avoid node evictions • Windows Server 2012 R2

Network names without Active Directory

(26)
(27)
(28)

References

Related documents

Following a Data Guard Failover (manual or Fast-Start Failover), Data Guard Broker now automatically publishes a FAN (Fast Application Notification) event to clean up connections

– Protect database against server failure with automatic failover: Virtual IP (VIP) automatic failover by Oracle clusterware. – Reduce planned downtime for hardware, OS,

This white paper deals with the explanation of configuration of failover of xCP application session across nodes of weblogic when accessed with webserver.. We are using

By collecting ecological and molecular data for the coral Ctenactis echinata across different habitats and sampling dates (i.e. summer and winter), we were able to derive

In particular, when using an optimal AKS sorting network [15] as the basis for our re- naming network construction, we obtain an adaptive strong renaming algorithm whose step

information, agenda, etc. will go live for registration.. Conference Meeting 6-9 Months Travel to consider using a travel agency for better deals. Brown recommends Garber

This allows a speed differential to be maintained between dressing and grinding wheel (Figure 7.42). The electric version merely has an a.c. motor instead to regulate speed and is

Limited data available from cloud service provider (logs, location of data, responsibilities, 3rd parties..). External penetration tests not allowed Usually no forensics