• No results found

Database High Availability. Solutions 2010

N/A
N/A
Protected

Academic year: 2021

Share "Database High Availability. Solutions 2010"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

DB2 9 DBA certification

exam 731

Database High Availability

Solutions 2010

P.O. Box 200, 5520 AE Eersel, The Netherlands Tel.:(+31) 497-530190, Fax: (+31) 497-530191

E-mail: [email protected]

Disclaimer

 The information contained in this presentation is based on techniques, algorithms, and documentation published by the several authors and companies, and in addition is the result of research. It is therefore subject to change at any time without notice or warning.

 The information contained in this presentation has not been submitted to any formal tests or review and is distributed on an “As is” basis without any warranty, either expressed or implied.  The use of this information or the implementation of any of these techniques is a client

responsibility and depends on the client’s ability to evaluate and integrate them into the client’s operational environment.

 While each item may have been reviewed for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere.

 Clients attempting to adapt these techniques to their own environments do so at their own risks.  Foils, handouts, and additional materials distributed as part of this presentation or seminar

should be reviewed in their entirety.

Note: This presentation gives you an overview of techniques used by database vendors. It can not be used for making company decisions regarding high availability without further studies.

(2)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 3

Trademarks

This presentation contains many trademarks in use by database vendors if we are aware of a trademark we put it in capitals.

Agenda

What is downtime Techniques in use SQL Server Cluster

DB2 Data Sharing / PureScale vs Oracle RAC Wise Words

(3)

What is downtime?

Terminology in use

Term

Business Risk

Solution

Data Recovery

Downtime and

Data loss

Redundant data

High

Availability

Downtime

Redundant

system

components

Disaster

Recovery

Permanent Data

loss and

"unable to continue"

Redundant

systems

and facilities

Not investing in hardware, software and knowledge

(4)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 7

About Downtime and Data loss

Permitted downtime?

Acceptable data/transaction loss (if any)?

Mean time to recovery?

• Difference for "normal down" and disaster?

How much damage is done after how much time? $$?

Who is first in case of disaster?

Note:

Database uptime ≠ application availability

• Application failures

• Hardware Outages (Power, Network, etc.)

Uptime SLA Downtimeper Year Downtimeper Month 99.9% 8.76 hours 43.8 minutes

99.99% 52.6 minutes 4.38 minutes

99.999% 5.26 minutes 0.438 minutes

What is causing downtime?

Downtime

Unplanned

down

Hardware

Failure

Data

Corruption

Planned

down

Data / Appl

Changes

System

Upgrades

(5)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 9

Unplanned: Hardware Failure

Storage subsystem

• Disk or Controller…

• Firmware or driver problem

Network

• Often causing a partial down (difficult to measure)

• Often rely on third party (SLA!)

Server

Cluster Support:

Oracle RAC / Microsoft MSCS / IBM Sysplex and PureScale • Where is the backup / cluster server?

• Can virtual server be a solution?

Power Outage

• Environment change, too many requests (unstable grid) etc.

• Third party, difficult to SLA

Unplanned down Hardware Failure Data Corruption

Unplanned: Site failure

Site Failure

• Complete Server room down (e.g. fire)

• Can always happen because you depend on external party (e.g. power)

More than just a database problem

• All data and hardware is involved

Can you handle?

• Network changes

• Workload (different config)

• Fail-back situation

Isn't there a hidden Single Point of Failure

• E.g. glassfiber back-up in the same bundle

Unplanned down Hardware Failure Data Corruption

(6)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 11

Unplanned: Data Corruption

Human error

Biggest problem difficult because many

scenarios which need different solutions

Logical Corruption

• Difficult to detect (sometimes after years strange data emerges)

• Can you detect the cause (e.g. program) and how much data is affected?

– Special techniques to go back in time and select data again – Can you re-process the data / transactions?

Unplanned down Hardware Failure Data Corruption

Planned: Data / Application changes

Application upgrades

• Schema changes still difficult

– Still a market for vendor tools

– DB2 versioning has performance impact

• Running systems need to shutdown in order to reload

– E.g. middleware transaction refresh

• How are we testing / accepting this

– If test fails, how do we undo the change?

Data maintenance

• Offline REORG (sometimes needed)

• Roll-in / Roll-out data

Planned down Data/Appl Changes System Upgrades

(7)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 13

Planned: System Upgrades

Hardware upgrades

• Growth

• End of life cycle / vendor support

• Redundant is not always hot-swap

Software upgrades

• Operating system, middleware, DBMS etc.

– Wise to combine upgrades?

– What if the new combination is not stable

How to respond to vendor patches

• Needed? What if we don't

• Policy? One size fits all?

Planned down Data/Appl Changes System Upgrades

Down vs Solution

(8)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 15

Partial down

Most common down

• System does work but certain function are unavailable

– Examples:

 Transactions with certain input abort  Certain location cannot connect  Single database corrupt

End-user to database goes through many layers

• How to report?

Layered approach can buy to downtime

• Let user work with front-end as if it is real-time

• Bring down backend for maintenance and queue requests

– E.g. online banking without full history or real-time transactions

How long does it take to fix it…

0 2 4 6 8 10 Time N ee d ed

Recovery Time

Breakdown of recovery

 Many problems the analysis takes a (very) long time

• Human errors

• Corruptions

 Many companies suffer a knowledge problem

 How to fix it

• Creating the scenario

• Testing the scenario?

 Speed of recovery itself

• Best parameters

(9)

Techniques in use

to minimize downtime

Database Backups

Traditional backup types

• Full backup – Online / Offline (z/OS Sharelevel)

• Incremental & Differential backup

• Include log backup

• Any other than Full backup is substitute for log!

Disk is better than tape

• First backup to disk (separate physical disk volume)

• Detect exceptions encountered during backup

• Verify backup files

• Copy backup files to tape, remote disk or storage manager (TSM)

Data retention policy for backup files

• What are you going to do with these backups?

(10)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 19

Backup Retention Policy

Location of backup files

Duration of retention

Protection of sensitive data!

• Sarbanes/Oxley (SOX)

• HIPAA

• Internal policies for data management and protection

Access to backups from offsite data storage

• Often the weakest link in security scenario's

Snapshot technology

Two techniques

• Real disk mirroring

• Share disk after snap until an update is done

How useful is the snapped disk

• Was DBMS aware of the snap? Was it up?

• Did the DBMS participate in the snap?

Snap can be extremely useful

• As backup

• As a fallback (e.g. after failed upgrade)

Be careful

(11)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 21

Rebuild database (restore & rollforward)

DB2 LUW more flexible than z/OS

• Include log with backup

• Differential and delta backups

Used for:

• Redirected restore (problem investigation)

• Simple scenario's which allow for down-time

• As a safety net when all else fails

Less safe than you might imagine

• Backup just reads files (does not analyze them)

Replication

Create offline solution

• Based on master / slave

• Can be a "High Availability" solution

All database support this (sometimes very advanced)

• Often horizontal / vertical segmented

Mostly row based but sometimes optimized

• E.g. MySQL DRBD is like RAID1 over network

Replication does not take care of:

• IP takeover

• Heartbeat & automate takeover

• Slave becoming master

• Fail back and resync

(12)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 23

Replication using Log Shipping

DB2 for LUW HADR

• Many limitation compared to DataGuard

Xkoto Gridscale for DB2

• More options than HADR

Data Guard

• Suports both log shipping and SQL shipping

• Very mature / flexible product

DB2 for z/OS Trackersite

• Very basic (really in use?)

Microsoft SQL Server mirror

• Witness server automates the takeover

• Xkoto has also gridscale for SQL Server

Clustering solution

Withdraw a "node" from the solution

• Network problem (IP needs re-routed)

• Failover might not be active

• Can it be automated?

– Heartbeat or Timeout – Client re-route – Split Brain problem

What happens to running Units Of Work?

• Locked data or other node backout?

• Take-over of transaction / restart of transaction / fail transaction

Controlled "failure" for planned down (e.g. upgrade)

Fail back / Insert node into the cluster

(13)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 25

Shared nothing vs shared everything

Shared Everything

• DB2 Data Sharing & PureScale (HA & Perf) • Oracle RAC (HA & Perf)

– Does not share memory, only disk

• Microsoft SQL Server Cluster (HA)

– Based on hardware / operating system solution

• Sybase IQ ??

Shared Nothing

• Microsoft SQL Server mirror (HA)

• Oracle Data Guard (HA)

• DB2 HADR (HA)

• MySQL Cluster (HA) MySQL Replication (Perf)

– Both have many limitations, still large system use them

• Perf: Terradata, Postgres (Greenplum), Netezza , DB2 LUW DPF

High Availabilty Replication v.s. Cluster

Both can have Single Point of Failure (SPF)

Wrong configuration can destroy data

SAN/NAS I/O overhead when shared storage

• With RAID SAN is no longer SPF

• Make sure network to the SAN is not SPF

Replication is easy to break

• Inconsistent data (e.g. middle transaction)

(14)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 27

Split Brain condition

Due to communications failures nodes are separated

• Missing heartbeat is really down or communication failure

If multiple nodes control of the cluster, then it's called a

split-brain condition

• If this happens, then bad things will happen

Special software solutions are needed to 100% secure a

down of the other node(s)

• This software can become Single Point of Failure

(15)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 29 Client PCs Node A Node B Disk cabinet A Disk cabinet B Heartbeat SQL Passive Node Failure Occurs! SCSI Reserve Broken SQL fails over and is available to clients SQL New Reservation Established

SQL Server cluster: Failover clustering

Mirror Principal Witness Log Application SQL Server SQL Server 2 2 4 5 1

Data Log Data

3

>2 >3

Commit

(16)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 31

SQL Server Comparison

Database Mirroring

 Scope: user Database

 Standard hardware

 Very fast failover (seconds)

 OS flexible (e.g. 32/64)

 Independent storage

 Reporting on mirror (Read-Only)

 Geographic separation OK

Failover Clustering

 Scope: Full instance

 Certified hardware

 Automatic failover (minutes)

 Enterprise OS

 Shared storage

 Standby not available

 Servers co-located (site failure!)

Compare DB2 to Oracle

DB2 Data Sharing / PureScale

 Single system image

 Dynamic workload management

 Software / Hardware Solution

 No Single Point of Failure

ORACLE RAC

 High Speed inter-system links

 Lots of communications

 No Global cache

• Cache Fusion / Interconnect

• Passes data around a lot

 Extra communication overhead

Data Log Log Log DB2 DB2 DB2

(17)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 33

Components in a RAC Cluster

Global Cache Services (GCS)

• Manage Data Page Synchronization

• Sends DATA to other nodes

Global Enqueue Service (GES)

• Manages Global Locks for non-data pages

Oracle RAC in action – cache fusion

Example how Oracle RAC moves data around

Instance A Instance B Instance C

Instance D

Master Node

8741 1

2 3 Read

into buffer cache 8741

Want to read

data block 8741

Do READ

(18)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 35

Oracle RAC in action – cache fusion

Example how Oracle RAC moves data around

Instance A Instance B Instance C

Instance D Master Node 8741 8741 Want to read data block 8741 1 2 8741 3 Please send 8741 to node C Send 8741

4 data block 8741Received from Node B

Data no longer comes from disk, owners forward them. Multiple copies around

Oracle RAC in action – cache fusion

Example how Oracle RAC moves data around

Instance A Instance C Instance D Master Node 8741 8741 1 3 8741 6 Flush PI data block 8741

4 data block 8741Write 5 8741 Flush PI data block 8741 6 2 Forward 8741 to node B PI = Previous Image

Owners have to write, after write caches are flushed New read requests moves the data again around

(19)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 37

What happens if RAC node fails

Crash recovery by other node

• Freeze GCS, not allowing updates to database anymore

• Data block remaster and recover the pages using redo/undo

• Invalidate blocks recovered

False node failure detection

• Can have split brain problem

• Oracle Custer software

– Heartbeat based, Single Point of Failure (Yes, says IBM)

(20)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 39

RAC vs Purescale Scalability

IBM PureScale benchmark:

95% scalability  32 members81% scalability  112 members

Oracle RAC??

No figures…

License forbids publication of measurements

Even Larry admits…

eWEEK (www.eweek.com) 31-Oct-2003:

I make fun of a lot of other databases – all other

databases, in fact, except the mainframe version of

DB2. It's a first-rate piece of technology.

Larry Ellison, Oracle's Founder and CEO

(21)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 41

Wise words…

Prepare for failure… DISASTER WILL HAPPEN

Ensure that no important data is lost

• Think of the different types of unavailability (there is no golden bullet)

Keep It Simple, Stupid (KISS)

Complexity is the enemy of reliability

• Saving on education is like stopping a watch to save time…

Automate as much as possible

• Careful you still understand it, so document it (incl. what if scenario)

Test it! Frequently!! Use good scenarios!

Audit it

• You need a devile's advocate to find the holes…

New technology

Source: channelinsider.com

If you can see it an touch it then it is: physical

If you cannot see it but you can touch it then it is: transparent If you can see it but not touch it then it is: virtual

If you cannot see it nor touch it then… It's gone!

(22)

Copyright © KBCE b.v., 2010 – All Intellectual Rights Reserved 43

QUESTIONS ?

References

Related documents