• No results found

ase120 choosing the right high availability solution

N/A
N/A
Protected

Academic year: 2020

Share "ase120 choosing the right high availability solution"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)

ASE120: Choosing the Right High

Availability Solution

Chris N. Brown

Principal Systems Consultant [email protected]

(2)

Agenda

 Intro: Cutting through all the hype  Why clustering is not enough

 Ways to achieve HA

Physical CopyLogical Copy

 Client-Side High Availability

OpenClient 12.xOpenSwitchDNS Update

 DBA administration in a 24x7 environment  Summary

(3)

HA – Do you need it?

HA (High Availability) has evolved almost into an industry buzzword

Everyone talks about it

Everyone says that they can do it

Everyone wants to sell you a “solution”

... but in the end, what are they really offering you?

HA isn't something new, it's been around in one form or another for

years

Now seen as something critical because of heavy reliance on computing

systems for business critical processes

(4)

And That's Why You're Here!

 We are going to answer that question in the next 90 minutes  Examine what's out there

 Analyze what is being offered (both Sybase and non-Sybase)  Discuss how they work

... when they are appropriate... when they aren't appropriate

Talk about how to make them all work together  And address the <gasp!> users out there as well  Let's cut through the Bull and make it simple

So standby for some business level discussion... but it sets the

(5)

So do you need it?

Before you embark down the HA path....

There are some questions that you should ask...

And know the answers to!

Many times, people THINK they need HA when they really don't.

As a result, thousands (and even millions) are spent that don't need to be

What about the man hour cost?

What about the increased administration?

(6)

Think about it....

What you are buying is an “insurance policy”.

A protection from incurring a loss.

This is very similar to an auto insurance policy.

You can have high deductibles or low ones

You can have liability only or full coverage

You can choose the level of protection

You might be covered if you hit a Yugo, but will you be covered if you hit a

Mercedes or Jaguar or Bentley?

How much of an out-of-pocket loss are you willing to take?

The same principle is true with an HA architecture.

Decide how much loss (downtime and cost) is acceptable.

Architect around THAT

(7)

How critical is the system in question?

The first thing to ask is... how critical is the system?

How much does it cost if it goes down?

How long could your company operate without it?

How much would your company lose if it went down?

... and for how long?

Sometimes the costs are intangible

SLA's (Service Level Agreements) are usually put into place and

should take this question into account.

Blanket SLA's sometimes used, not always prudent

Some systems (billing, customer service) are more important than others

(email, instant messaging, LAN-based fileserver).

(8)

How much can you spend?

 Highly Available Systems can cost from $ to $$$$$$$

That incremental .09% or .009% can hike the cost exponentially.

Big derailer of HA implementations.

It's extremely important to understand the business requirements as

discussed earlier.

I.E., does the system REALLY have to be 24x7 or can it be 25x5 or 9x5?

This will dramatically change the artchitecture chosen and of course, the cost of implementation.

Raw system cost should not (completely) drive the architecture

(9)

Hardware Redundancy Hardware Redundancy RAID/Mirroring/HW Cluster RAID/Mirroring/HW Cluster Cold Standby Cold Standby Backup/Restore Backup/Restore Warm Standby Warm Standby Database Replication Database Replication Automatic Failover Automatic Failover DBMS HA DBMS HA Continuous Operations Continuous Operations Online Maintenance Online Maintenance

High Availability

High Availability

Continuous

Continuous

Availability

Availability

High Availability Levels

(10)

Agenda

Intro: Cutting through all the hypeWhy clustering is not enough

Ways to achieve HA

Physical CopyLogical Copy

Client-Side High Availability

OpenClient 12.xOpenSwitchDNS Update

DBA administration in a 24x7 environmentSummary

(11)

Clustering : The traditional solution

 For many years, the traditional

HA solution was hardware level clustering.

 Generally what most IT

professionals think of when you say “HA”.

Mainly addresses hardware failures.Evolved now to watch for process

failure and will re-start them.  When failure detected, tries to

restart services on a redundant host “as fast as possible”

(12)

Why isn't clustering enough?

 In today's computing environments,

hardware level redundancy isn't always enough.

It only provides the foundation.

Sometimes the amount of time

required to restart services is unacceptable

Can take minutes when seconds

are required ($$$$$).

 What if the problem is with a

shared resource?

(13)

Agenda

Intro: Cutting through all the hypeWhy clustering is not enough

Ways to achieve HA

Physical CopyLogical Copy

Client-Side High Availability

OpenClient 12.xOpenSwitchDNS Update3rd party

DBA administration in a 24x7 environmentSummary

(14)

How can I achieve HA (50,000 ft view)?

 There are 2 main ways to achive HA from a database level:

Physical Database re-creationLogical Database re-creation.

They range from the simple to the down and dirty to the complex.

Which one you use depends on your requirements... and your budget

... and your level of risk

In the 'ideal world', a combination of these strategies provides the

(15)

Physical Database Recreation: Dump and Load

 This is the easiest way to get a

backup copy and VERY basic HA

 Dump the primary DB, and then load

it to an ASE running on another host.  Keep in sync via incremental transaction

log dumps

Inexpensive to implement

 Issues:

Size of dump and getting it to backup site.What if a dump is corrupt?

What if tranlog loads get out of sync?Usually manual; can be automated but

requires some 'babysitting' and has quite a few moving parts.

Good for non-critical systems or

those whose data does not change much (think Log Size).

(16)

Physical Database Recreation: Quiesce Database

 If you have a SAN, another easy way to achieve basic HA is via the

'quiesce database' functionality.  New feature of ASE 12.x

Works best with ASE 12.5 and higher

Similar in principle to dump and load, but much faster.

Quiesce DB suspends writes to databases, so that underlying

devices can quickly be copied  Read still allowed

Initally targeted for:

Quick refreshes of production to developmentQuick creation of DSS environment

Quick troubleshooting of production with a 'snapshot'.

 However, customers wanted to use it for more of an HA solution

(17)

Primary Secondary

2:00 AM

quiesce database hold; <copy database using external command>; quiesce database release

7:00 AM

dump tran with standby_access

9:00 AM

dump tran with standby_access

10:00 AM

dump tran with standby_access

Repeat each hour

until activity tapers off; then lengthen intervals accordingly 2:10 AM dataserver-q .. 7:05 AM load tran;

online database for standby_access

9:07 AM

load tran;

online database for standby_access

10:10 AM

load tran;

online database for standby_access

(18)

Things To Think About With Quiesce Database

Quiesce Datbase is a solution to a specific problem.

Can be very very fast

True physical copy so WYSIWYG. But that may not be what you want

HA-ish solution with tranlog loads works best with ASE 12.5.x

You can do maintenance on replicate copy (dbcc, etc)

However...

It's a physical copy

Dependent on tranlog loads

Can't really use replicate since users must be kicked off for tranlog load to

occur

(19)

Physical Database Recreation: Block Replication

This is something that is offered by

SAN vendors.

Very attractive:

Copies data from one area in

SAN to another

Copies data from one SAN to

another.

Often times pitched as an HA or DR

solution, WHICH IT CAN BE.

Operates in 2 modes: synchronous

and asynchronous

Because it is a block-level copy, what

exists on the primary will exist on the replicate.

(20)

Block Copy – How It Works (sync)

 Methodology:

The Host OS will write its I/O to the

primary SAN cache

The Primary cache copies its I/O to the

secondary SAN cache

 The secondary SAN sends an ACK to

the primary SAN that it received the I/O

 Both the primary and the secondary then

write their I/O to disk

In this case, every disk I/O is copied.

 This is similar to RAID-1 or a variation of a

2-phase commit.

 The standby server (ASE in this case) can

restart at the same spot that the primary ended.

 Used over shorter distances.

(21)

Block Copy – How It Works (async)

Methodology:

The primary OS writes its I/O to the

primary SAN cache

The primary SAN sends an ACK to the

primary OS that it received the I/O.

The primary SAN copies that I/O to the

secondary SAN cache

Here is where it gets tricky

Not every I/O is copied

» The block could have changed many times

» Changed blocks are 'scored' and the latest change is what is sent over.

Both SAN's write to disk

Think of the replicate as a point-in-time

(22)

This sounds like A Great Thing! (TM)

 Since the SAN is copying data at the bit level, it makes sense as a

DR / HA mechanism  No data loss

Server can be restarted where the other once crashedCopy from primary to secondary is usually very fast

There are some issues to be aware of though.

Sometimes, what you see on disk isn't what you want at the replicate (corruption)

Be aware of how ASE writes data to disk, and how the OS writes data – We write 2k (4k, 8k, 16k) pages, they write 512k Blocks

– We log first then write the data for consistency, so what happens if data pages are written in the SAN before the log pages are, and you go down? (eeeeek)

 Overall, this is a good stragegy that many people use, but it cannot

(23)

ASE HA Option: Riding The Clustering Wave

ASE 12.0 introduced a new feature we call the HA option.

It brings clustering technology to the database server.

No logical IP needed for the ASE to 'listen on'.

Failover designed to be very very fast.

You can utilize both nodes in a 2 way cluster (prev. one was usally idle)

This results in better leverage of your hardware investment and can make

multiple systems highly available with less cost.

(24)

Disk

HA System

S2

Replicate Users/Logins

S1

CompanionEstablish

Node 1

Node 2

Shared Disk Storage

Disk

(25)

Disk

HA System

S2

Node 2

Shared Disk Storage

Disk

Fail Over

(26)

Disk

HA System

S2

Node 2

Shared Disk Storage

Disk

Fail Back

Prepare

(27)

Disk

HA System

S2

Replicate Users/Logins

S1

CompanionEstablish

Node 1

Node 2

Shared Disk Storage

Disk

Fail Back

(28)

Some notes on the HA option

We rely on the HA “Heartbeat” to notify us when one ASE fails.

Brings up several administration aspects

Both ASE's must be at the same version

Currently we only support 2-node failover

One of the 2 ASE's must be a fresh install

It's possible to access data from one server on another

Via proxy tables, this is done via CIS

Performance issues to consider

Might be a feasible load-balancing option

We failover fast (since that's unplanned) but failing back is unplanned

and manual (and slower).

Significant improvements in this area since the 12.0 release.

(29)

Logical Database Recreation

 So far, we have only discussed ways of re-creating the database

server “physically”

Meaning, copying the data (disk, devices, dumps) from point A to point B

 All of these work well and in some cases work very fast

They provide near zero or zero data loss

 However, they all suffer from the same common drawbacks

What you see is What you get (WYSIWYG)

Corruption is almost always copied over, making backup copy useless.You cannot change the data as it is being moved over

In most cases, the replicate is down or not useable.

The only way to get around these problems today is to use a logical

database recreation scheme.

(30)

Quickies on Queues and 3

rd

parties

 We will quickly discuss message queueing and 3rd parties.

Message queueing takes “events” and publishes them out on a bus

The event could be a data event or an application level eventA listener subscribes to certain events

Data can be manipulated based on rules.

There are 3rd party products out there that can also do this

DataMirrorUPSuite

They may not use log-based replication though ... some use

(31)

Relication Server Architecture

Replication Agent

Replication Agent

Primary Data Server

Primary Data Server

1

2

3

4

Replicate Data ServerReplicate Data Server

Replication Server

Replication Server

Client Applications

Client Applications

1) The client application updates data on the primary.

1) The client application updates data on the primary.

2) The primary data server manages its local data.

2) The primary data server manages its local data.

3) Replication Agent notifies Replication Server of primary server data updates.

3) Replication Agent notifies Replication Server of primary server data updates.

4) Replication Server coordinates data replication of those updates with other

4) Replication Server coordinates data replication of those updates with other

Replication Servers.

Replication Servers.

(32)

How Replication Server Works

LTL

LTL

Replication Agent

Replication Agent

Monitors Transaction LogMonitors Transaction LogTruncation PointTruncation Point

Marked TablesMarked TablesCreates LTLCreates LTL

RSSD

RSSD

Rep-Defs Rep-Defs PublicationsPublicationsSubscriptionsSubscriptionsRoutesRoutes

Stable Device / Stable Queues

Stable Device / Stable Queues

Inbound QueueInbound QueueOutbound QueueOutbound QueueMaterialization QueueMaterialization Queue Primary Data Server

Primary Data Server

Replicate Data Server

Replicate Data Server

Replicate db

Replicate db

LAN/WAN – DSI (Data Srv Int)

LAN/WAN – DSI (Data Srv Int)

Primary db

(33)

Physical ASE “A”

Physical ASE “A”

Logical ASE “XYZ”

Logical ASE “XYZ”

Physical ASE “B”

Physical ASE “B”

(IP Address 192.233.56.20)

(IP Address 192.233.56.20) (IP Address 192.233.56.21)(IP Address 192.233.56.21)

(34)

Some notes on Replication Server

 Warm Standby is a variant of “traditional replication”

You can replicate DDL changes if you replcate at a database level  It can be tuned to near zero latency

 Better to have the RepServer on its own host or on the replicate

host.

Beware of failure points and how they might affect your application.  The primary and the secondary must be controlled by the same

RepServer

 Currently limited to 1 primary, 1 warm standby (will change in

(35)

Agenda

Intro: Cutting through all the hypeWhy clustering is not enough

Ways to achieve HA

Physical CopyLogical Copy

Client-Side High Availability

OpenClient 12.xOpenSwitchDNS Update

DBA administration in a 24x7 environmentSummary

(36)

What about the client?

Often times, HA solutions only include the back-end.

Archtectures consider only how quick we can recover the downed

system, but what about the end user?

Some questions to ponder:

How is uptime and availability measured?

If the system was down for 5 minutes but the user couldn't connect for 30,

how long was the outage?

What if the system were down, but the user didn't really know or

notice?

It's possible today!

(37)

Method #1: OpenClient 12.x

 OpenClient 12.x integrates with

the HA option of ASE.

 Provides client-side failover

from the failed ASE server to the surviving ASE server.

ONLY useful if you are using

the HA option.

 ONLY can be used if you can

recompile your applications against OpenClient 12.x

Primary Server

Companion Server Primary

Server

Companion Server

(38)

OpenClient 12.x and HA: How Does It Work?

 To support this feature, 2 things need to be done

The First thing is change the interfaces file.

Typical entry would contain master/query syntax and connectivity infoA new entry is added in the interfaces file at the end

It indicates what server is the failover (companion) server for a primary node

 For example:

ASTRO

master tcp ether stewie 5000 query tcp ether stewie 5000 hafailover ELROY

ELROY

master tcp ether felix 5000 query tcp ether felix 5000

 If using LDAP, would add an entry to the LDAP server containing the

same information

(39)

OpenClient 12.x and HA: How Does It Work (Cont'd)?

 The second thing that needs to be done is re-compile against

OpenClient 12.x

 There is a new property that need to be addressed to utilize the HA

functionality

CS_HAFAILOVER

CS_RET_HAFAILOVER

 These are set using the ct_config and ct_con_props syntax at the connection or context level

This is only with ctlib (dblib DOES NOT support this functionality)

 Client will receive an error 1205

Client failed over to server listed as hafailover server in the

interfaces file

(40)

Method #2: Sybase OpenSwitch

 Much more flexible than OpenClient 12.x

 Does not require recompile of

applications

Not tied to HA option of ASE

 Can be used against existing and 3rd

party applications

 Allows for increased flexibility and user

management.

 User logs directly into OpenSwitch, not

into ASE.

 OpenSwitch manages the user

connection and migrates them when it detects a 'failure'.

Integrates with Business Logic.

(41)

For each incoming connection OpenSwitch

decides where it should go and opens up a new connection

Manual switch capability

Transparent Connection Management

OpenSwitch

ISQL

PowerBuilder

Any Open Client Application and Platform

EAServer

Administrator (ISQL) RPC Switch

Request

(42)

 Coordination Module provides an API to coordinate with third party HA

solutions

 This is the “brains” of OpenSwitch

OpenSwitch defers switching decision to CM if present

HA Coordination

OpenSwitch

ASE Server A

Application

What do I do? Response

Action

C

M

C

(43)

Typical OpenSwitch Usage Scenario

New York New York OpenSwitch OpenSwitch Denver Denver

CM

Connection Lost! What do I do?

Check if transactions are pending in Rep Server

queue

OK, Failover

Application

Application

Check if it is a real failure or a network hiccup

Check if warm-standby is really up and functional

CM = Coordination Module

(44)

The “Pie In The Sky”

Via CM

 This covers all possible areas: physical, logical, and users (and

(45)

Agenda

Intro: Cutting through all the hypeWhy clustering is not enough

Ways to achieve HA

Physical CopyLogical Copy

Client-Side High Availability

OpenClient 12.xOpenSwitchDNS Update

DBA administration in a 24x7 environmentSummary

(46)

24x7 DBA Administration

 Any HA scenario MUST allow for DBA maintenance activities

DBCC

Dump and Load  Update Statistics  Reorg Rebuilds  Upgrades

 If it doesn't, then by definition it's not hightly available

Simply because to do any of the above actions, you have to take the server down

 ... or you might impact performance so much that the system beomes 'pseudo-down'.

(47)

Well....

 Update stats: always the “achilles heel”

Attend Eric Miner's ASE 126 class on speeding up Update StatsThursday 3:30pm, 90 minutes, Sun “A”

DBCC

Use a phyusical database recreation scheme (bit level rep, quiesce, etc)Then run DBCC on the copy

Since it's a physical recreation, then errors in the copy will be in the primary, then you can take action accordingly to fix it.

 Reorg

Starting with ASE 12, you can specify parameters around it's useDone at an extent level, doesn't lock the entire table down.

Dump and Load

(48)

Well... (cont'd)

 Upgrades

Best done with a logical database recovery scheme (like RepServer, etc)This will allow you to keep both the old and vew version in-sync and you can

(49)

Agenda

Intro: Cutting through all the hypeWhy clustering is not enough

Ways to achieve HA

Physical CopyLogical Copy

Client-Side High Availability

OpenClient 12.xOpenSwitchDNS Update

DBA administration in a 24x7 environmentSummary

(50)

HA Questions To Ask....

How does HA solution cover …

… host machine failures  … operating system faults

 … database failures/corruptions  … datacenter loss

 … online maintenance

... DBMS, OS, Database Schema, User Admin

How well does it handle …

 … latency between synchronization  … outage during failover

… client connections

(51)

SDN Presents CodeXchange

 New SDN feature enables community collaboration

Download tools created by Sybase

Leverage contributions of others to help administer and monitor your serversContribute your own code or start your own collaborative project with input

from other ASE experts

 Any SDN member can participate

Log in using your MySybase account via SDN

 Join the collaboration already underway

http://ase.codexchange.sybase.com or via SDN at www.sybase.com/developer

References

Related documents

beets, stored in trench and analyzed five weeks later— Dec. Ames, Story Co. of Beets in sam pie. IW’ts of beets trim'd. do Klein Wanzleben. do Klein-Wanzleben do en..

2008 University of Brescia – Department of Legal Sciences – University of Cagliari – Department of Legal and Forensic Sciences – University of Florence – Department of

Transform Redo to SQL and Apply Data Guard Broker Primary Database Logical Standby Database Standby Redo Logs.. Standby Databases Are Not Idle. Standby database can be

connected to auxiliary database: MDDB1 (not mounted) RMAN&gt; LIST BACKUP OF CONTROLFILE;. 2&gt; LIST COPY OF CONTROLFILE;

Therefore, the form taken by the media exposure is considered particularly relevant because in the first to (i) safeguard the welfare and safety of workers in Lahad

societal  levels  with  which  to  analyze  people’s  economic  cost‐benefit  analysis  on  Turkey’s  potential  EU  accession.  Table  3,  taken  from 

Technology used in, jambox wireless speaker manual slides you could listen on this banner, volume controls on at least a two mini version many audio from the portable. Skip a bass

• Compare the test print with the Printech offset printout or known good Printech digital master. Reject if any