Anda di halaman 1dari 49

MICROSOFT SQL SERVER

HIGH AVAILABILITY
AND DISASTER RECOVERY
Michael Poremba // October 2008

Database HA & DR
Experience

Work with business to determine HA or DR


requirements for applications and data?

Design HA or DR solutions?

Administer HA or DR process?

Still learning MS SQL Server HA & DR


capabilities?

Scope of this Presentation


3

Presentation Focus

Data Availability

Data recovery
High availability
Disaster recovery

Technology Focus

MS SQL Server
Physical servers
SANs

Beyond Scope of
Presentation

In-depth how-to

(available elsewhere)

Partitioned views (federated)


Advanced DBA techniques
Custom application logic
3rd-party software solutions
Alternate DBMS engines
(e.g. Oracle; DB2)

HA on virtual machines
Complex scenarios &
solutions
Load balancing

Introduction to Data
Availability
So, you need to make your
production database bulletproof

Data Availability Continuum


5

Degrees of protection for information


systems:
Business Risk
Solution
Data
Recovery

Data loss

Redundant data

High
Availability

Downtime of
Redundant
database service system
components

Disaster
Recovery

Downtime of
business
operations

Redundant
systems
and facilities

Business Case for


Availability
High Availability

Keep businesscritical
applications
available
Secondary:

Server
maintenance

Disaster Recovery

Protect against
loss of data center
Secondary:

Application
upgrades
Infrastructure
upgrades

Service Level Agreement


(SLA)

Permitted downtime (planned vs. unplanned?)


Uptime
SLA

Downtime
per Year

Downtime
per Month

99.9%

8.76 hours

43.8 minutes

99.99%

52.6 minutes

4.38 minutes

99.999%

5.26 minutes

0.438
minutes

Acceptable data/transaction loss


Application response times
Mean time to recovery

Note: Database uptime is not equivalent to application


availability

Failures of other application services


Network outages

Protect What?
8

Application data stores

Database services

Databases
Files
Other data repositories
DBMS availability for applications

Application services

Application availability for users and external


systems

Databases are the heart of most information


systems;
they deserve the highest affordable protection.

Database Failure Scenarios


9

Physical Infrastructure
Failures

Storage
subsystem

Disk
Controller

Network
Server
Power

Logical Data Failures

Operator errors

DBMS interruption
Drops / deletes

Application
defects
DBMS defects
Data corruption

Service Recovery Strategies


10

Stand Failover Behavior


by
Mode

SQL Server
Feature

Cold
stand
by

Backup and
restore

Manual intervention
required to restore
offline data copy

Warm Data copy online and


stand
ready
by
Manual failover
required

Transaction log
shipping
Database
mirroring

Hot
stand
by

Database
mirroring
Failover

Automatic failover

11

Data Recovery
Terminology
Terminology varies for source vs. copy
High Availability
Strategy

Data Source

Data Copy

Backup and
Restore

Database

Backup

Log Shipping

Primary

Secondary
Standby

Database Mirroring Principal

Mirror

Failover Clustering

Secondary
Passive
Standby
Inactive

Primary
Active

12

Data Recovery
[Briefly]

Database Backups
13

Traditional backup types

Disk is better than tape

Full backup
Differential backup
Transaction log backup
First backup to disk (separate physical disk
volume)
Detect exceptions encountered during backup
Verify backup files
Copy backup files to tape or remote disk

Data retention policy for backup files

Database Backup Strategy


14

Backup of user databases not sufficient for


recovery
System database
Master database
MSDB database
Model database
External data stores

15

Synch with External Data


Stores
Synchronize recovered database with
external data stores:
Identity column seeds
Full-text indexes
(SQL Server 2000)

LDAP entries
File system objects
Other databases

Backup Retention Policy


16

Location of backup files


Duration of retention
Protection of sensitive data

Sarbanes/Oxley (SOX)
HIPAA
Internal policies for data management and
protection

Access to backups from offsite data


storage

Data Recovery Process


17

Backup file sets

Full baseline,
differential, and
transaction logs

Retrieving backup files

Offsite storage
Tape
Network copy
Dependency on
multiple people to get
access to backup files

Recovery strategy
depends on failure
scenario

Create comprehensive
failure matrix
Devise recovery strategy
for each scenario
Does worst-case
recovery scenario fit
within SLA parameters?

Recovery time; SLA


Include future data
growth in recovery plan
Fully test recovery
strategiespractice is
essential

18

High Availability

High Availability
19

Minimize or avoid service downtime

When components fail,


service interruption is brief or non-existent

Whether planned or unplanned

Automatic failover

Eliminate single points of failure (as


affordable)

Redundant components
Fault-tolerant servers

Redundant Components
20

Objective: Avoid single points of failure (where affordable)


Approach: Use redundant components for database service
Database server nodes
Server components

DBMS instance
User databases
Storage devices
Storage unit components

MPIO: Interfaces; paths; switches; controllers


RAID: Disks

Networking

ECC RAM; failure-tolerant HW & OS

MPIO: Interfaces; paths; switches

Data copies

E.g. Recovering torn page from mirror in SQL Server 2008

Transaction Log Shipping


21

Warm standby solution


Duplicate user database

Database available for read-only access

Copy transaction logs to standby server &


restore
Users must disconnect for logs to be applied
Two database licenses required if querying
standby

Manual application failover


Supported on standard hardware
Possible data loss (unapplied transactions)

Database Mirroring
22

Redundancy at user database level

Mirrored over private network channel

Requires witness server

Mirror-aware application client connection

High-availability: commit @ log on mirror;


automatic failover
High-protection: commit @ log on mirror; manual
failover
High-performance: commit when logged on
principal

Very fast automatic failoverseconds

Mirror always redoing transactions from principal


Negligible impact on transaction throughput

Multiple mirroring modes:

Duplicate copy of user database


Independent storage devices
Multiple copies of instance databases

Provided by client library


Database connection string must specify both
servers

Mirror may be available for read-only access


(snapshots)
Works with standard hardware

Mirror Witness
23

With mirroring, more than one server is required


to decide on failover
Witness automates failover from primary to mirror

Runs in separate SQL Server instance (Express is


OK)
Prevents split brain scenario
Very low resource consumption

Watches database availability


Reports observations back to principal and mirror

Can be witness for multiple databases

Not a single point of failure

24

SQL Server Failover


Clustering

Two clustered nodes

MS SQL services

Active/Passive config
Running on virtual
server

Shared storage device

User databases
System databases
Quorum drive
Redundant internal
components

25

Active/Passive Failover
Clustering

Redundancy at database instance


level

Single data copy on shared storage


device

SQL Agent; Analysis Services; Full-Text


engine, MS DTC

Automatic failover (up to minutes)


DBMS accessed over virtual IP
Database not available from inactive
node for DB client connections

No I/O overhead reducing throughput


Storage unit is single point of failure
for cluster

All database services are clustered

All databases fail over together


Shared copy of system databases

Storage is controlled by one cluster


node at a time

Requires hardware certified by


Microsoft for Microsoft Cluster
Service

HA Comparison
26

Database Mirroring

Scope: user DB
Standard hardware
One SQL license
(unless querying
snapshots on mirror)
Very fast failover (seconds)
OS flexible (e.g. 32/64)
Independent storage
Independent services
Reporting on mirror
Geographic separation OK

Failover Clustering

Scope: DBMS instance


Certified hardware
One SQL license
(only one node can access
database)
Automatic failover (up to
minutes)
Enterprise OS
Shared storage
Clustered services
Standby not available
Servers are usually co-located

Considerations for HA
27

HA complements backup and recovery strategy

Application service availability is often determined


by a network of interdependent services

Does not replace data recovery plan


Availability can be difficult to define (e.g. partial
failures)
Failure probability difficult to measure or compute

Increased system complexity could lead to lower


service availability!

Operator error a leading cause of availability issues


Increased number/types of system components
More complex to configure and administer

28

Data Recovery
Requirements

29

Disaster Recovery

Disaster Recovery
30

Minimize downtime of business


operations

SQL Server features:

Redundant systems and facilities


Transaction log shipping
Database mirroring
Failover clustering

Other technologies

Storage-based mirroring

Disaster Recovery Planning


31

Data security requirements


Clarify SLA, data loss allowance
Evaluate system cost vs. data protection
Failure analysis
System redundancy
Process validation
Training for personnel

Prevention practices
Executing disaster recovery and business continuity

Practice, practice, practice

Business Continuity Facility


32

System redundancy

Alternate facilities

Systems: Web servers app servers; database, etc.


Data: Databases; data files on OS; security info,
etc.
Networking: Domain, routing, subnet, VIPs, etc.
Network bandwidth
Physical or network access by operations staff

Failover

Often a deliberate decision, using manual failover

Data Redundancy
33

Synchronous redundancy

Asynchronous redundancy

Network bandwidth cost


Network latency and application performance
Network reliability
Risk of data loss
More cost-effective
Resilient to network latency issues

Candidate Technologies

SQL Server database mirroring


Failover clustering with SAN-based mirroring

34

DR Using Database
Mirroring

Two sites: Primary and DR location


Separate failover clusters at each site
SQL Server database mirroring between
sites

35

DR Using SAN-Based
Mirroring

Two sites: Primary and DR location


Four-node failover cluster; one virtual IP
address
SAN-based mirroring between sites
Manual cluster failover

36

Complimentary
Technologies
[Skip if time is running short.]

SAN-Based Data Mirroring


37

Data blocks duplicated at storage level

Copy performed in sequence and coordinated with


database checkpoint

Ensures consistency of mirrored data files

Synchronous or asynchronous mirroring


Co-located or geographically dispersedboth are
OK

Similar to transaction log shipping

SAN link bandwidth must support database I/O rate

May require extra feature support from SAN


vendor
Could rely on Failover Clustering for HA

38

SQL Server Database


Snapshots

Read-only point-in-time database


snapshot
No data is copiedinstantaneous

Snapshots can be maintained indefinitely

Historical snapshot pages tracked


separately from changing pages
Limited only by available storage

Snapshot copy can be used for reporting

Read-only, so no locking issues

SQL Server Replication


39

Transactional replication

Merge replication

High transaction volume


Low data latency required
Mixed technologies:
Integrates with other
DBMS
Bi-directional data
changes
Typically server-to-client

Snapshot replication

Large, infrequent data


changes
Data change latency OK
Best for smaller data sets

Subscriber
databases available
for reporting
Replicate data
subsets
Some data loss is
possible
Periodically validate
replicated data

40

App Development and


Admin

41

Considerations for App


Developers

App services tolerant to database service interruptions


Application transactions must be handled in codedata
consistency
Exception handling for transaction retry, connection
recovery
Requires coding standards, code reviews, and testing
Bulk data operations
Transaction volume impacts rollback time during failover
Batch jobs must be run on alternate nodes
Dont bypass transaction logging
Synchronization with external data sources?
Be aware of database recovery model
Mirroring uses FailoverPartner in connection string
Use TCP/IP as client protocol

Considerations for Admins


42

Use identical server hardware, when possible


Design network redundancies, when feasible

Always manage through virtual cluster, not individual cluster


nodes
Retest failover/failback after HA maintenance
Diagnose after failover

Consider network latency for geographic separation

Repair alternate node


Resynchronize data, as necessary
Be aware of primary/secondary locations
Ensure application services are connected and functioning properly

Keep server node configurations synchronized:

Service pack and patch levels


Duplicate non-redundant resources
Jobs; logins and permissions; OS & sys objects

HA Risks
43

System performance degradation


HA system complexity leads to availability
issues
Some system failures not planned for
Backup and recovery planning incomplete
Administrators not fully trained or informed
User databases not synchronized with other
data sources

Common Admin Use Cases


44

Maintain HA nodes

Resynchronize the redundant copy

Hardware maintenance
Rolling upgrades and software patches
Re-synch mirror
Restart log shipping

Diagnose and repair

Diagnose cause of failover


Repair failed node and restore failover
capabilities
Test failover and failback

Common Admin Actions


45

Train and practice administrators to:


Initiate a database mirror
Manually failover mirror database or
cluster node
Add/remove passive node from mirror or
cluster
Upgrade/patch servers nodes
Restart or redirect application services

46

More Information

ReferencesBooks
47

High Availability

Microsoft SQL Server 2008


High Availability with
Clustering & Database
Mirroring
by Michael Otey, 2009.
Microsoft SQL Server High
Availability
by Paul Bertucci, 2004.
Pro SQL Server 2005 High
Availability
by Allan Hirt, 2007.

Related Topics

Pro SQL Server 2005


Replication
by Sujoy Paul, 2006.

Pro SQL Server 2005 Service


Broker
by Klaus Aschenbrenner,
2007.

The Rational Guide to SQL


Server 2005 Service Broker
by Roger Wolter, 2006.

ReferencesPresentations
48

Microsoft Load Balancing and Clustering


http://ce.sharif.edu/courses/84-85/2/ce317/resources/root/lecture%20slides
/14.%20Microsoft%20Load%20Balancing%20and%20Clustering.ppt
SQL Server 2005 High Availability
http://www.atlantamdf.com/Presentations/AtlantaMDF_111207HA.ppt

High Availability Technologies In SQL Server 2000 And SQL Server 2005
http://202.181.238.2/hk/teched2004/ppt/Day_2_Rm407/DAT431(13301445).ppt

Meeting the Availability Challenge


http://download.microsoft.com/download/E/D/C/EDCF54DB-19CD-4882-9FC44F7D46FCEAA6/HighAvailability.ppt

Disaster Recovery Mistakes


http://www.sqlsig.org/Oct%2011%20DASSUG%20-%20Jason%20Hall%2010-1107%20MM.ppt

SQL Server 2005 High Availability


http://blogs.msdn.com/sql2005event/attachment/564303.ashx

Effective Usage of SQL Server 2005 Database Mirroring


http://www.sqlserver-qa.net/SSQA-Effective%20Usage%20of%20SQL%20Server
%202005%20Database%20Mirroring_show.ppt

ReferencesArticles
49

Achieve High Availability for SQL Server


http://technet.microsoft.com/en-us/magazine/cc162477.aspx

Geographically Dispersed Clusters in Windows


Server 2003
http://www.microsoft.com/windowsserver2003/techinfo/overview/clusterg
eo.mspx

Restoring file and filegroup backups


http://support.microsoft.com/kb/281122/en-us

Restoring specific tables or rows from backups


http://support.microsoft.com/kb/321836/en-us

Maintaining Availability During Upgrades


http://msdn.microsoft.com/en-us/library/ms191449.aspx