HACMP and XDIntro

IBM System p5 and eServer p5
IBM
Introduction to
High Availability Cluster Multi-Processing
pSeries
(HACMP) server
and
HACMP Extended Distance IBM
(HACMP-XD)
Shawn Bodily
ATS HACMP Specialist
© 2006 IBM Corporation

Although hardware is now very reliable, hardware

failures account for a small minority of system outages
Several studies place the proportion between 20% and
45%
Human error, software error and planned maintenance
cause the majority of service outages
I
2 © 2005 IBM Corporation © 2004 IBM Corporation

Downtime and poor performance are expensive both

financially and in terms of customer perceptions
“Overall downtime-costs average 3.6% of annual revenue.” –
Infonetics
Many studies estimate average cost of downtime at over
$5,000/hour
Popular Web sites estimate cost of
I downtime at millions of dollars
A 22-hour crash in June, 2003 cost eBay
an estimated $5M
Losses go beyond immediate sales
revenue
¾ To clients, availability equates to reliability
and trustworthiness
¾ Internal application failures prevent
employees from working
HACMP - Proven Technology for Business
Mature product now in its 17th major release

Averaging 40,000 licenses sold world-wide annually
Built on a decade of IBM cluster leadership

HACMP allows you to createI highly available environments
with minimal hardware.
HACMP is scalable up to 32-nodes, allowing your cluster to
adapt to the growing demands of your business.
The optional XD feature allows your clusters to span
unlimited geographic distances.

HACMP – Is NOT the right solution if:
Your environment is not secure

Network security is not in place
Change management procedures are not respected
You do not have trained administrator
I
Environment is prone to ‘user fiddle faddle’
Application requires manual intervention
HACMP will never be an out-of-the-box

solution to availability. A certain degree
of skill will be always be required.

Reducing both Planned and Unplanned downtime

Unplanned Outage
System Failure
– Hardware
– Operating System Crash
– Power Loss
– User Error
Component Failure
– NIC
– SCSI/SAN Adapter
– Network Hub/Switch I
– SAN Switch
– Disk Failure (both O/S and application data)
Planned Outage
Maintenance
– System Hardware Change/Upgrade
– OS & Application Upgrades & Fixes
Testing
– Applied Fixes
– Failure scenarios for HA & DR

HACMP™ protects against service outages by detecting

problems and quickly “failing over” to backup hardware
Two nodes (A and B)
Two networks
Company Shared Network
Private (internal) network
!

Public (shared) network Database

IBM
Web Srv
IBM
Shared disk I pSeries Private pSeries
All data in shared storage serve

r
Network serve
r
available to both nodes

Critical applications A B
Database server
Web server
– Dependent on DB Shared Disk

Example Failure #1: Node failure
Node A fails completely

Node B detects the loss
of Node A Database
IBM
Web Srv
IBM !
Private
Node B starts up its own
pSeries pSeries
I
serve
Network serve
instance of the Database. r r
Database is temporarily A B
taken-over by Node B
until Node A is brought
back online
Shared Disk

Example Failure #2: Loss of network connection
Node A loses a NIC

Because of NIC redundancy,
the service IP swaps locally Database
IBM
Web Srv
IBM !
Operations continue normally pSeries Private pSeries
while problem is resolved I serve

r Network serve
r
If total public network

connectivity was lost a A B
fallover could occur
Shared Disk

Failover possibilities One to any
One to one
Any to one Any to any

Custom Resource Groups

Startup Preferences
Online On Home Node Only (cascading) - (OHNO)
Online on First Available Node (rotating or cascading w/inactive takeover)
- (OFAN)
Online On All Available Nodes (concurrent) - (OAAN)
Startup Distribution
Fallover Preferences I
Fallover To Next Priority Node In The List - (FOHP)

Fallover Using Dynamic Node Priority - (FDNP)
Bring Offline (On Error Node Only) - (BOEN)
Fallback Preferences
Fallback To Higher Priority Node - (FBHP)
Never Fallback - (NFB)

Common Resources to make highly available

Service IP Address(es)
The IP Addresses that users/client apps will use for production
This can be one or multiple addresses
Not limited to the number of interfaces when utilizing aliasing
Application (Server)
I
Application(s) desired to be controlled/protect by HACMP
Many cases can be user provided start/stop script
May take advantage of pre-packaged application Smart Assists.
Shared Storage
Volume Groups
Logical Volumes
JFS
NFS

Additional Granular Options

Resource Group Dependencies
Parent/Child Relationships
– Great for Multi-Tier environments
Location Dependencies
– Online on Same Node
• All resource groups must be online on the same node
I
– Online on Different Nodes
• All resource groups must be online on different nodes
– Online on Same Site
• All resource groups must be online on the same site
Define Resource Group Priorities (Different Node Dep.)

Low
Intermediate
High

Application Monitoring
HACMP can monitor applications in one of two ways:
Process Monitor – determines the death of a process

Custom Monitor – monitors health of the application using a monitor
method you provide
Decisions upon failure I

Restart – Can establish a number of restarts to restart locally. After a
specified restart count, if app continues to fail you can escalate to a
fallover.
– Notifiy – Send email notification
– Fallover – Move application and associated resource group to next
candidate node.
Suspend/Resume Application Monitoring at anytime.

DLPAR/CUoD configuration
HACMP on the primary machine detects the failure

Running in a partition on another server, HACMP grows the backup
partition, activates the required inactive processors and restarts
application
Production DLPAR/CUoD Server
Database Server (running applications on active processors)
I
Active Processors Inactive Processors
Order Entry
Web Server
Database
Server
HACMP
HACMP Shared
HACMP
Disk HACMP

Recent HACMP releases greatly improve ease of use

Enhancements include:
Configuration wizard for typical two-node cluster
Automatic detection and configuration of IP networks
“Online Planning Worksheet” guides you through configuration
Simplified Web-based interface for management and monitoring
Online Planning
Worksheets For
Resource Groups
Shown Here

With HACMP V5.x, you can configure a cluster in just

five questions
1. What is the address of the backup node?

2. What is the name of the application?
3. What script HACMP should use to start it?
4. What script HACMP should use
I to stop it?
5. What is the service IP label that clients will use to access
the application?



WebSMIT Overview Demo

HACMP Cluster Test Tool

The Cluster Test Tool reduces implementation costs by simplifying
validation of cluster functionality.
It reduces support costs by automating testing of an HACMP cluster

to ensure correct behavior in the event of a real cluster failure.
The Cluster Test Tool executes a test plan, which consists of a series
of individual tests. I
Tests are carried out in sequence and the results are analyzed by the
test tool.
Administrators may define a custom test plan or use the automated

test procedure.
Test results and other important data are collected in the test tool's
log file.

New features make HACMP V5.X easier to use

and more flexible
Automatic detection and correction of common cluster
configuration problems
Enhanced support for complex multi-tier applications,
relationships and dependencies
Clusters can be configured withI simple ASCII files
Parallel resource processing recovers applications faster

Simpler, more flexible configuration and management
New “Smart-Assists” simplify HACMP implementation in
DB2®, Oracle and WebSphere® environments
Inexpensive option includes all three Smart-Assists

HACMP with Oracle 10g fallover Demo

(1) p52A
(1) p505
(1) HMC
HACMP 5.4
AIX 5.3 TL5
Oracle 10g
DS4300
LPARMon (http://www.alphaworks.ibm.com/tech/lparmon)
I
Swingbench (http://www.dominicgiles.com/swingbench.html)
Web-based System Manager
The cluster shown, was actually created using the two-node

configuration assistant within HACMP.

IBM
HACMP Extended Distance

(HACMP-XD)
pSeries
server
IBM
© 2006 IBM Corporation

Do you really need HA or DR ?
What is the target recovery time ?

Minutes ? Hours ? Days ?
Costs associated with implementing and

maintaining an HA or DR solution
Redundant hardware
Inter site networking
Operations staff
HA/DR is a balance of recovery time requirements and cost

Tiers of Disaster Recovery:

Level Setting HACMP/XD HACMP /XD
Best D/R practice is to blend tiers of solutions in order to maximize application fits in here
coverage at lowest possible cost . One size, one technology, or one
methodology doesn't fit all applications.
Applications with
Tier 7 - Highly automated, business wide, integrated solution (Example: Low tolerance to
GDPS/PPRC/VTS P2P, AIX HACMP/XD , OS/400 HABP.... outage
Zero
Zero or
or near
near zero
zero data
data
Tier 6 - Storage mirroring (example: XRC, recreation
PPRC, VTS Peer to Peer) I
Value
Tier 5 - Software two site, two phase commit (transaction integrity)

minutes
minutes to
to hours
hours
data
data recreation
recreation
Tier 4 - Batch/Online database shadowing & journaling, Applications
up
up to
to 24
24 hours
Point in Time disk copy (FlashCopy), TSM-DRM
hours Somewhat Tolerant
data recreation
data recreation
Tier 3 - Electronic Vaulting, TSM**, Tape to outage
24-48
24-48 hours
hours
data
data recreation
recreation
Tier 2 - PTAM, Hot Site,TSM**
Applications very
Tier 1 - PTAM* tolerant to outage
15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days
Recovery Time *PTAM = Pickup Truck Access Method with Tape

**TSM = Tivoli Storage Manager
Tiers based on SHARE definitions *** = Geographically Dispersed Parallel Sysplex


HACMP Extended Distance (XD) is an optional

component for cross-site geographic disaster recovery
Backup systems may be physically separate from primary

operations for protection in the event of power failure, flood,
earthquake etc.
The XD option provides a basketI of disaster recovery
capabilities and integration points
XD provides multiple options:
IP-based data mirroring (GLVM, HAGEO)
Support for hardware-based data mirroring (Metro-Mirror/PPRC)

HACMP XD – Extended Distance for Disaster Recovery

Data replication between sites ensures a copy of the data is
available after a site wide disaster
Choice of Technology depends on distance, performance

requirements
Campus-wide – use LVM Split Site Mirroring
I
LAN / MAN
S
A
N

Metro wide – use SVC or ESS/PPRC Mirroring

Production Recovery
Site Site
Router Router
Server Server I Server Server

A B SVC Mirroring C D
SVC SVC
Secondary
Primary
ESS/DS
ESS/DS
PPRC/Metro
Mirror
or
eRCMF

Unlimited – use GLVM Mirroring
Subset of disks are defined as “Remote Physical Volumes” or RPVs
RPV Driver
Replicates
data over
I
WAN
copy
copy 11 Mirror
Mirror 11 copy
copy 22
LVM
Mirrored
Volume
Group copy
copy 11 Mirror
Mirror 22 copy
copy 22
Both sites always have a complete copy of all mirrors

New HACMP “Geographic Logical Volume Manager” is

a reliable, easy-to-use data mirror and failover
capability
GLVM provides unlimited-distance IP-based data mirroring
Fully integrated with AIX 5L™ logical volume management
Easier to use than existing HAGEO solution

I
No need to define and manage separate state maps
Long-term replacement for HAGEO
Automatically reverses direction of data replication on

failover
Supports all IBM TotalStorage® products certified with
base HACMP

HACMP XD – HACMP automates the solution

HACMP integrates support for all the replication options
Manages data replication direction, switching and resync

after recovery
Recovers locally or moves entire

I application to backup site
Common infrastructure supports all solutions

Choose the one that meets your performance and distance
requirements

Thank You
Questions?????
I

Backup SlidesI on Networking

Typical Local HACMP Clustering Configuration
en0 switch en0

10.70.10.x
switch
en1 en1
I
A single network view on a common subnet.

Multiple networks can be used.

HACMP Clustering Across Sites

10.70.10.x 10.50.10.x
en0 switch switch en0

Router Router
en1 switch switch

en1
I
Different subnets, routers connected to allow cross subnet communications

HACMP and XDIntro

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

HACMP and XDIntro

Diunggah oleh

Hak Cipta:

Format Tersedia

IBM System p5 and eServer p5

© 2006 IBM Corporation

Although hardware is now very reliable, hardware

2 © 2005 IBM Corporation © 2004 IBM Corporation

Downtime and poor performance are expensive both

HACMP - Proven Technology for Business

 Mature product now in its 17th major release

 Built on a decade of IBM cluster leadership

4 © 2005 IBM Corporation © 2004 IBM Corporation

HACMP – Is NOT the right solution if:

 Your environment is not secure

HACMP will never be an out-of-the-box

5 © 2005 IBM Corporation © 2004 IBM Corporation

Reducing both Planned and Unplanned downtime

6 © 2005 IBM Corporation © 2004 IBM Corporation

HACMP™ protects against service outages by detecting

 Public (shared) network Database

 Shared disk I pSeries Private pSeries

 All data in shared storage serve

available to both nodes

7 © 2005 IBM Corporation © 2004 IBM Corporation

Example Failure #1: Node failure

 Node A fails completely

instance of the Database. r r

8 © 2005 IBM Corporation © 2004 IBM Corporation

Example Failure #2: Loss of network connection

 Node A loses a NIC

while problem is resolved I serve

 If total public network

9 © 2005 IBM Corporation © 2004 IBM Corporation

Failover possibilities One to any

Any to one Any to any

10 © 2005 IBM Corporation © 2004 IBM Corporation

Custom Resource Groups

 Fallover To Next Priority Node In The List - (FOHP)

11 © 2005 IBM Corporation © 2004 IBM Corporation

Common Resources to make highly available

12 © 2005 IBM Corporation © 2004 IBM Corporation

Additional Granular Options

Define Resource Group Priorities (Different Node Dep.)

13 © 2005 IBM Corporation © 2004 IBM Corporation

 Process Monitor – determines the death of a process

Decisions upon failure I

Suspend/Resume Application Monitoring at anytime.

14 © 2005 IBM Corporation © 2004 IBM Corporation

 HACMP on the primary machine detects the failure

Active Processors Inactive Processors

15 © 2005 IBM Corporation © 2004 IBM Corporation

Recent HACMP releases greatly improve ease of use

16 © 2005 IBM Corporation © 2004 IBM Corporation

With HACMP V5.x, you can configure a cluster in just

1. What is the address of the backup node?

17 © 2005 IBM Corporation © 2004 IBM Corporation

18 © 2005 IBM Corporation © 2004 IBM Corporation

19 © 2005 IBM Corporation © 2004 IBM Corporation

WebSMIT Overview Demo

20 © 2005 IBM Corporation © 2004 IBM Corporation

HACMP Cluster Test Tool

 It reduces support costs by automating testing of an HACMP cluster

 Administrators may define a custom test plan or use the automated

21 © 2005 IBM Corporation © 2004 IBM Corporation

New features make HACMP V5.X easier to use

Mature product now in its 17th major release

Built on a decade of IBM cluster leadership

Your environment is not secure

Public (shared) network Database

Shared disk I pSeries Private pSeries

All data in shared storage serve

Node A fails completely

Node A loses a NIC

If total public network

Fallover To Next Priority Node In The List - (FOHP)

Process Monitor – determines the death of a process

HACMP on the primary machine detects the failure

It reduces support costs by automating testing of an HACMP cluster

Administrators may define a custom test plan or use the automated

Parallel resource processing recovers applications faster

Backup systems may be physically separate from primary

Choice of Technology depends on distance, performance

Easier to use than existing HAGEO solution

Automatically reverses direction of data replication on

Manages data replication direction, switching and resync

Recovers locally or moves entire

Common infrastructure supports all solutions