Anda di halaman 1dari 29

HACMP

HIGH AVAILABILITY CLUSTER


MULTIPROCESSING

BEST PRACTICES

January 2008

• A l e x A b d e r r a z a g • Ve r s i o n 0 3 . 0 0 •
Table of Contents

I. Overview 1

II. Designing High Availability 1

• Risk Analysis 2

III. Cluster Components 3

• Nodes 3

• Networks 3

• Adapters 5

• Applications 5

IV. Testing 9

V. Maintenance 9

• Upgrading the Cluster Environment 10

VI. Monitoring 12

VII. HACMP in a Virtualized World 13

• Maintenance of the VIOS partition – Applying Updates 18

• Workload Partitions (WPAR) 18

VIII. Summary 22

IX. Appendix A 24

• Sample WPAR start, stop and monitor scripts for HACMP 24

X. References 26

XI. About the Author 26

HACMP Best Practices

2
WHITE PAPER
Overview
IBM High Availability Cluster Multiprocessing (HACMP TM) product was first shipped in 1991 and is now
in its 15th release, with over 60,000 HACMP clusters in production world wide. It is generally recognized
as a robust, mature high availability product. HACMP supports a wide variety of configurations, and pro-
vides the cluster administrator with a great deal of flexibility. With this flexibility comes the responsibility
to make wise choices: there are many cluster configurations that are workable in the sense that the cluster
will pass verification and come on line, but which are not optimum in terms of providing availability. This
document discusses the choices that the cluster designer can make, and suggests the alternatives that make
for the highest level of availability*.

Designing High Availability


“…A fundamental design goal of (successful) cluster design is the elimination of single points of failure (SPOFs)…”

A High Availability Solution helps ensure that the failure of any component of the solution, be it hardware,
software, or system management, does not cause the application and its data to be inaccessible to the user
community. This is achieved through the elimination or masking of both planned and unplanned down-
time. High availability solutions should eliminate single points of failure (SPOF) through appropriate de-
sign, planning, selection of hardware, configuration of software, and carefully controlled change manage-
ment discipline.

While the principle of "no single point of failure" is generally accepted, it is sometimes deliberately or in-
advertently violated. It is inadvertently violated when the cluster designer does not appreciate the conse-
quences of the failure of a specific component. It is deliberately violated when the cluster designer chooses
not to put redundant hardware in the cluster. The most common instance of this is when cluster nodes are
chosen that do not have enough I/O slots to support redundant adapters. This choice is often made to re-
duce the price of a cluster, and is generally a false economy: the resulting cluster is still more expensive
than a single node, but has no better availability.

A cluster should be carefully planned so that every cluster element has a backup (some would say two of
everything!). Best practice is that either the paper or on-line planning worksheets be used to do this plan-
ning, and saved as part of the on-going documentation of the system. Fig 1.0 provides a list of typical
SPOFs within a cluster.

“….cluster design decisions should be based on whether they contribute to availability (that is, eliminate a SPOF) or
detract from availability (gratuitously complex) …”

* This document applies to HACMP running under AIX®, although general best practice concepts are also applicable to HACMP
running under Linux®.

HACMP Best Practices

1
Fig 1.0 Eliminating SPOFs

Risk Analysis

Sometimes however, in reality it is just not feasible to truly eliminate all SPOFs within a cluster. Examples,
may include : Network ¹, Site ². Risk analysis techniques should be used to determine those which simply
must be dealt with as well as those which can be tolerated. One should :

Study the current environment. An example would be that the server room is on a properly sized
UPS but there is no disk mirroring today.
Perform requirements analysis. How much availability is required and what is the acceptable likeli-
hood of a long outage.
Hypothesize all possible vulnerabilities. What could go wrong?
Identify and quantify risks. Estimate the cost of a failure versus the probability that it occurs.

Evaluate counter measures. What does it take to reduce the risk or consequence to an acceptable
level?

Finally, make decisions, create a budget and design the cluster.

1 Ifthe network as a SPOF must be eliminated then the cluster requires at least two networks. Unfortunately, this only eliminates
the network directly connected to the cluster as a SPOF. It is not unusual for the users to be located some number of hops away
from the cluster. Each of these hops involves routers, switches and cabling – each of which typically represents yet another SPOF.
Truly eliminating the network as a SPOF can become a massive undertaking.

² Eliminating the Site as a SPOF depends on distance and the corporate disaster recovery strategy. Generally, this involves using
HACMP eXtended Distance (XD, previously known as HAGEO). However, if the sites can be covered by a common storage area
network—say, buildings within a 2km radius—then Cross-site LVM mirroring function as described in the HACMP Administra-
tion Guide is most appropriate, providing the best performance at no additional expense. If the sites are within the range of PPRC
(roughly, 100km) and compatible ESS/DS/SVC storage systems are used, then one of the HACMP/XD: PPRC technologies is
appropriate. Otherwise, consider HACMP/XD: GLVM. These topics are beyond the scope of this white paper.
HACMP Best Practices

2
Cluster Components
Here are the recommended practices for important cluster components.

Nodes
HACMP supports clusters of up to 32 nodes, with any combination of active and standby nodes. While it
is possible to have all nodes in the cluster running applications (a configuration referred to as "mutual
takeover"), the most reliable and available clusters have at least one standby node - one node that is nor-
mally not running any applications, but is available to take them over in the event of a failure on an active
node.

Additionally, it is important to pay attention to environmental considerations. Nodes should not have a
common power supply - which may happen if they are placed in a single rack. Similarly, building a clus-
ter of nodes that are actually logical partitions (LPARs) with a single footprint is useful as a test cluster, but
should not be considered for availability of production applications.

Nodes should be chosen that have sufficient I/O slots to install redundant network and disk adapters.
That is, twice as many slots as would be required for single node operation. This naturally suggests that
processors with small numbers of slots should be avoided. Use of nodes without redundant adapters
should not be considered best practice. Blades are an outstanding example of this. And, just as every clus-
ter resource should have a backup, the root volume group in each node should be mirrored, or be on a
RAID device.

Nodes should also be chosen so that when the production applications are run at peak load, there are still
sufficient CPU cycles and I/O bandwidth to allow HACMP to operate. The production application
should be carefully benchmarked (preferable) or modeled (if benchmarking is not feasible) and nodes cho-
sen so that they will not exceed 85% busy, even under the heaviest expected load.

Note that the takeover node should be sized to accommodate all possible workloads: if there is a single
standby backing up multiple primaries, it must be capable of servicing multiple workloads. On hardware
that supports dynamic LPAR operations, HACMP can be configured to allocate processors and memory to
a takeover node before applications are started. However, these resources must actually be available, or
acquirable through Capacity Upgrade on Demand. The worst case situation – e.g., all the applications on
a single node – must be understood and planned for.

Networks
HACMP is a network centric application. HACMP networks not only provide client access to the applica-
tions but are used to detect and diagnose node, network and adapter failures. To do this, HACMP uses
RSCT which sends heartbeats (UDP packets) over ALL defined networks. By gathering heartbeat informa-
tion on multiple nodes, HACMP can determine what type of failure has occurred and initiate the appro-
priate recovery action. Being able to distinguish between certain failures, for example the failure of a net-
work and the failure of a node, requires a second network! Although this additional network can be “IP
based” it is possible that the entire IP subsystem could fail within a given node. Therefore, in addition
HACMP Best Practices

3
there should be at least one, ideally two, non-IP networks. Failure to implement a non-IP network can po-
tentially lead to a Partitioned cluster, sometimes referred to as 'Split Brain' Syndrome. This situation can
occur if the IP network(s) between nodes becomes severed or in some cases congested. Since each node is
in fact, still very alive, HACMP would conclude the other nodes are down and initiate a takeover. After
takeover has occurred the application(s) potentially could be running simultaneously on both nodes. If the
shared disks are also online to both nodes, then the result could lead to data divergence (massive data cor-
ruption). This is a situation which must be avoided at all costs.

The most convenient way of configuring non-IP networks is to use Disk Heartbeating as it removes the
problems of distance with rs232 serial networks. Disk heartbeat networks only require a small disk or
LUN. Be careful not to put application data on these disks. Although, it is possible to do so, you don't want
any conflict with the disk heartbeat mechanism!

Important network best practices for high availability :

Failure detection is only possible if at least two physical adapters per node are in the same physical
network/VLAN. Take extreme care when making subsequence changes to the networks, with re-
gards to IP addresses, subnetmasks, intelligent switch port settings and VLANs.
Ensure there is at least one non-IP network configured.
Where possible use Etherchannel configuration in conjunction with HACMP to aid availability. This
can be achieved by ensuring the configuration contains a backup adapter which plugs into an alter-
nate switch. However, note: HACMP see Etherchannel configurations as single adapter networks. To
aid problem determination configure the netmon.cf file to allow ICMP echo requests to be sent to
other interfaces outside of the cluster. See Administration guide for further details.
Each physical adapter in each node needs an IP address in a different subnet using the same subnet
mask unless Heartbeating over IP Aliasing is used.
Currently, there is NO support in HACMP for Virtual IP Addressing (VIPA), IPv6 and IEEE802.3
standard et interfaces.
Ensure you have in place the correct network configuration rules for the cluster with regards IPAT
via Replacement/Aliasing, Etherchannel, H/W Address Take-over (HWAT), Virtual Adapter sup-
port, service and persistent addressing. For more information check the HACMP Planning Guide
documentation.
Name resolution is essential for HACMP. External resolvers are deactivated under certain event
processing conditions. Avoid problems by configuring /etc/netsvc.conf and NSORDER variable in
/etc/environment to ensure the host command checks the local /etc/hosts file first.
Read the release notes stored in : /usr/es/sbin/cluster/release_notes. Look out for new
and enhanced features, such as collocation rules, persistent addressing and Fast failure detection.
Configure persistent IP labels to each node. These IP addresses are available at AIX® boot time and
HACMP will strive to keep them highly available. They are useful for remote administration, moni-
toring and secure Node-to-Node communications. Consider implementing a host-to-host IPsec tun-
nel between persistent labels between nodes. This will ensure sensitive data such as passwords are
not sent unencrypted across the network. An example: when using C-SPOC option "change a users
password".
If you have several virtual clusters split across frames, ensure boot subnet Addresses are unique per
cluster. This will avoid problems with netmon reporting the network is up when indeed the physical
network outside the cluster maybe down.

HACMP Best Practices

4
Adapters

As stated above, each network defined to HACMP should have at least two adapters per node. While it is
possible to build a cluster with fewer, the reaction to adapter failures is more severe: the resource group
must be moved to another node. AIX provides support for Etherchannel, a facility that can used to aggre-
gate adapters (increase bandwidth) and provide network resilience. Etherchannel is particularly useful for
fast responses to adapter / switch failures. This must be set up with some care in an HACMP cluster.
When done properly, this provides the highest level of availability against adapter failure. Refer to the IBM
techdocs website: http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101785 for fur-
ther details.

Many System p TM servers contain built-in Ethernet adapters. If the nodes are physically close together, it
is possible to use the built-in Ethernet adapters on two nodes and a "cross-over" Ethernet cable (sometimes
referred to as a "data transfer" cable) to build an inexpensive Ethernet network between two nodes for
heart beating. Note that this is not a substitute for a non-IP network.

Some adapters provide multiple ports. One port on such an adapter should not be used to back up an-
other port on that adapter, since the adapter card itself is a common point of failure. The same thing is true
of the built-in Ethernet adapters in most System p servers and currently available blades: the ports have a
common adapter. When the built-in Ethernet adapter can be used, best practice is to provide an addi-
tional adapter in the node, with the two backing up each other.

Be aware of network detection settings for the cluster and consider tuning these values. In HACMP terms,
these are referred to as NIM values. There are four settings per network type which can be used : slow,
normal, fast and custom. With the default setting of normal for a standard Ethernet network, the network
failure detection time would be approximately 20 seconds. With todays switched network technology this
is a large amount of time. By switching to a fast setting the detection time would be reduced by 50% (10
seconds) which in most cases would be more acceptable. Be careful however, when using custom settings,
as setting these values too low can cause false takeovers to occur. These settings can be viewed using a va-
riety of techniques including : lssrc –ls topsvcs command (from a node which is active) or odmget
HACMPnim |grep –p ether and smitty hacmp.

Applications

The most important part of making an application run well in an HACMP cluster is understanding the
application's requirements. This is particularly important when designing the Resource Group policy be-
havior and dependencies. For high availability to be achieved, the application must have the ability to
stop and start cleanly and not explicitly prompt for interactive input. Some applications tend to bond to a
particular OS characteristic such as a uname, serial number or IP address. In most situations, these prob-
lems can be overcome. The vast majority of commercial software products which run under AIX are well
suited to be clustered with HACMP.

HACMP Best Practices

5
Application Data Location
Where should application binaries and configuration data reside? There are many arguments to this dis-
cussion. Generally, keep all the application binaries and data were possible on the shared disk, as it is easy
to forget to update it on all cluster nodes when it changes. This can prevent the application from starting or
working correctly, when it is run on a backup node. However, the correct answer is not fixed. Many appli-
cation vendors have suggestions on how to set up the applications in a cluster, but these are recommenda-
tions. Just when it seems to be clear cut as to how to implement an application, someone thinks of a new
set of circumstances. Here are some rules of thumb:

If the application is packaged in LPP format, it is usually installed on the local file systems in rootvg. This
behavior can be overcome, by bffcreate’ing the packages to disk and restoring them with the preview op-
tion. This action will show the install paths, then symbolic links can be created prior to install which point
to the shared storage area. If the application is to be used on multiple nodes with different data or configu-
ration, then the application and configuration data would probably be on local disks and the data sets on
shared disk with application scripts altering the configuration files during fallover. Also, remember the
HACMP File Collections facility can be used to keep the relevant configuration files in sync across the clus-
ter. This is particularly useful for applications which are installed locally.

Start/Stop Scripts
Application start scripts should not assume the status of the environment. Intelligent programming should
correct any irregular conditions that may occur. The cluster manager spawns theses scripts off in a separate
job in the background and carries on processing. Some things a start script should do are:

First, check that the application is not currently running! This is especially crucial for v5.4 users as
resource groups can be placed into an unmanaged state (forced down action, in previous versions).
Using the default startup options, HACMP will rerun the application start script which may cause
problems if the application is actually running. A simple and effective solution is to check the state
of the application on startup. If the application is found to be running just simply end the start script
with exit 0.

Verify the environment. Are all the disks, file systems, and IP labels available?

If different commands are to be run on different nodes, store the executing HOSTNAME to variable.

Check the state of the data. Does it require recovery? Always assume the data is in an unknown state
since the conditions that occurred to cause the takeover cannot be assumed.

Are there prerequisite services that must be running? Is it feasible to start all prerequisite services
from within the start script? Is there an inter-resource group dependency or resource group sequenc-
ing that can guarantee the previous resource group has started correctly? HACMP v5.2 and later has
facilities to implement checks on resource group dependencies including collocation rules in
HACMP v5.3.

Finally, when the environment looks right, start the application. If the environment is not correct and
error recovery procedures cannot fix the problem, ensure there are adequate alerts (email, SMS,
SMTP traps etc) sent out via the network to the appropriate support administrators.

Stop scripts are different from start scripts in that most applications have a documented start-up routine
and not necessarily a stop routine. The assumption is once the application is started why stop it? Relying
on a failure of a node to stop an application will be effective, but to use some of the more advanced fea-
tures of HACMP the requirement exists to stop an application cleanly. Some of the issues to avoid are:
HACMP Best Practices

6
Be sure to terminate any child or spawned processes that may be using the disk resources. Consider
implementing child resource groups.

Verify that the application is stopped to the point that the file system is free to be unmounted. The
fuser command may be used to verify that the file system is free.

In some cases it may be necessary to double check that the application vendor’s stop script did actu-
ally stop all the processes, and occasionally it may be necessary to forcibly terminate some processes.
Clearly the goal is to return the machine to the state it was in before the application start script was
run.

Failure to exit the stop script with a zero return code as this will stop cluster processing. * Note: This
is not the case with start scripts!

Remember, most vendor stop/starts scripts are not designed to be cluster proof! A useful tip is to have stop
and start script verbosely output using the same format to the /tmp/hacmp.out file. This can be achieved
by including the following line in the header of the script: set -x && PS4="${0##*/}"'[$LINENO]
'

Application Monitoring
HACMP provides the ability to monitor the state of an application. Although optional, implementation is
highly recommended. This mechanism provides for self-healing clusters. In order to ensure that event
processing does not hang due to failures in the (user supplied) script and to prevent hold-up during event
processing, HACMP has always started the application in the background. This approach has disadvan-
tages :

There’s no wait or error checking


In a multi-tiered environment there is no easy way to ensure that applications of higher tiers have
been started.

Application monitoring can either check for process death, or run a user-supplied custom monitor method
during the start-up or continued running of the application. The latter is particularly useful when the ap-
plication provides some form of transaction processing - a monitor can run a null transaction to ensure that
the application is functional. Best practice for applications is to have both process death and user-
supplied application monitors in place.

Don’t forget to test the monitoring, start, restart and stop methods carefully! Poor start, stop and monitor
scripts can cause cluster problems, not just in maintaining application availability but avoiding data cor-
ruption 3.

In addition, HACMP also supplies a number of tools and utilities to help in customization efforts like pre-
and post- event scripts. Care should be taken to use only those for which HACMP also supplies a man
page (lslpp -f cluster.man.en_US.es.data) – those are the only ones for which upwards com-
patibility is guaranteed. A good best practice example for this use would be for application provisioning.

3 Having monitoring scripts exit with non zero return codes when the application has not failed in-conjunction with poor start / stop
scripts can result in undesirable behavior (i.e. data corruption). Not only is the application down but is in need of emergency repair
which may involve data restore from backup.
4 CoD support includes : On/Off CoD inc. Trial, CUoD and CBU for high-end only. See
http://www-03.ibm.com/servers/eserver/about/cod for further details.
HACMP Best Practices

7
Application Provisioning
HACMP has the capability of driving Dynamic LPAR and some Capacity on Demand (CoD) operations 4
to ensure there is adequate processing and memory available for the application(s) upon start-up. This is
shown in Fig 1.1.

Fig 1.1 Application Provisioning example.

This process can be driven using HACMP smit panels. However, this approach does have several limita-
tions :

Support for POWER4 TM architecture only (Whole CPU's and 256 Memory Chunks)
No provisions or flexibility for shutting down or "stealing from" other LPARs
CoD activation key must have been entered manually prior to any HACMP Dynamic Logical Parti-
tioning (DLPAR) event
Must have LPAR name = AIX OS Hostname = HACMP node name
Large memory moves will be actioned in one operation. This will invariably take some time and
hold up event processing
LPAR hostname must be resolvable at HMC
The HACMP diver script hmc_cmd does not log the DLPAR / CoD commands it sends to the HMC.
Debugging is limited and often is it necessary to hack the script - which is far from ideal!
If the acquisition / release fails the operation is not repeated on another HMC if defined

Given these drawbacks, I would recommend this behavior is implemented using user supplied custom
scripts. Practical examples can be explored in the AU61G Education class - see Reference section.
HACMP Best Practices

8
Testing
Simplistic as it may seem, the most important thing about testing is to actually do it.

A cluster should be thoroughly tested prior to initial production (and once clverify runs without errors or
warnings). This means that every cluster node and every interface that HACMP uses should be brought
down and up again, to validate that HACMP responds as expected. Best practice would be to perform the
same level of testing after each change to the cluster. HACMP provides a cluster test tool that can be run
on a cluster before it is put into production. This will verify that the applications are brought back on line
after node, network and adapter failures. The test tool should be run as part of any comprehensive cluster
test effort.

Additionally, regular testing should be planned. It’s a common safety recommendation that home smoke
detectors be tested twice a year - the switch to and from daylight savings time being well-known points.
Similarly, if the enterprise can afford to schedule it, node fallover and fallback tests should be scheduled
biannually. These tests will at least indicate whether any problems have crept in, and allow for correction
before the cluster fails in production.

On a more regular basis, clverify should be run. Not only errors but also warning messages should be
taken quite seriously, and fixed at the first opportunity. Starting with HACMP v5.2, clverify is run
automatically daily @ 00:00 hrs. Administrators should make a practice of checking the logs daily, and re-
acting to any warnings or errors.)

Maintenance
Even the most carefully planned and configured cluster will have problems if it is not well maintained. A
large part of best practice for an HACMP cluster is associated with maintaining the initial working state of
the cluster through hardware and software changes.

Prior to any change to a cluster node, take an HACMP snapshot. If the change involves installing an
HACMP, AIX or other software fix, also take a mksysb backup. On successful completion of the change,
use SMIT to display the cluster configuration, print out and save the smit.log file. The Online Planning
Worksheets facility can also be used to generate a HTML report of the cluster configuration.

All mission critical HA Cluster Enterprises should, as best practice, maintain a test cluster identical to the
production ones. All changes to applications, cluster configuration, or software should be first thoroughly
tested on the test cluster prior to being put on the production clusters. The HACMP cluster test tool can be
used to at least partially automate this effort.

Change control is vitally important in an HACMP cluster. In some organizations, databases, networks and
clusters are administered by separate individuals or groups. When any group plans maintenance on a
cluster node, it should be planned and coordinated amongst all the parties. All should be aware of the
changes being made to avoid introducing problems. Organizational policy must preclude “unilateral”
changes to a cluster node. Additionally, change control in an HACMP cluster needs to include a goal of

HACMP Best Practices

9
having all cluster nodes at the same level. It is insufficient (and unwise!) to upgrade just the node run-
ning the application. Develop a process which encompasses the following set of questions :

Is the change necessary?


How urgent is the change?
How important is the change? (not the same as urgent)
What impact does the change have on other aspects of the cluster?
What is the impact if the change is not allowed to occur?
Are all of the steps required to implement the change clearly understood and documented?
How is the change to be tested?
What is the plan for backing out the change if necessary?
Is the appropriate expertise be available should problems develop?
When is the change scheduled?
Have the users been notified?
Does the maintenance period include sufficient time for a full set of backups prior to the change and
sufficient time for a full restore afterwards should the change fail testing?

This process should include an electronic form which requires appropriate sign-offs before the change can
go ahead. Every change, even the minor ones, must follow the process. The notion that a change, even a
small change might be permitted (or sneaked through) without following the process must not be permit-
ted.

To this end, the best practice is to use the HACMP C-SPOC facility where possible for any change, espe-
cially with regards to shared volume groups. If the installation uses AIX password control on the cluster
nodes (as opposed to NIS or LDAP), C-SPOC should also be used for any changes to users and groups.
HACMP will then ensure that the change is properly reflected to all cluster nodes.

Upgrading the Cluster Environment

OK, so you want to upgrade? Start by reading the upgrade chapter in the HACMP installation documen-
tation and make a detailed plan. Taking the time to review and plan thoroughly will save many 'I forgot to
do that!' problems during and after the migration/upgrade process. Don’t forget to check all the version
compatibilities between the different levels of software/firmware and most importantly the application
software certification against the level of AIX and HACMP. If you are not sure check with IBM support
and/or user the Fix Level Recommendation Tool (FLRT) which is available at :
http://www14.software.ibm.com/webapp/set2/flrt/home.

Don’t even think about upgrading AIX or HACMP without first taking a backup and checking that it is
restorable. In all cases, it is extremely useful to complete the process in test environment before actually
doing it for real. AIX facilities such as alt_disk_copy and multibos for creating an alternative rootvg
which can activated via a reboot are very useful tools worth exploring and using.

Before, attempting the upgrade ensure you carry out the following steps :

Check that cluster and application are stable and that the cluster can synchronize cleanly

HACMP Best Practices

10
Take a cluster snapshot and save it to a temporary non cluster directory
(export SNAPSHOTPATH=<some other directory>)

Save event script customization files / User Supplied scripts to a temporary non cluster directory. If
you are unsure that any custom scripts are included, check with odmget HACMPcustom.

Check that the same level of cluster software (including PTFs) are on all nodes before beginning a
migration

Ensure that the cluster software is committed (and not just applied)

Where possible the Rolling Migration method should be used as this ensures maximum availability. Effec-
tively, cluster services are stopped one node at a time using the takeover option (Now move resource
groups’ in HACMP v5.4). The node/system is updated accordingly and cluster services restarted. This op-
eration is completed one node at a time until all nodes are at the same level and operational. Note : While
HACMP will work with mixed levels of AIX or HACMP in the cluster, the goal should be to have all nodes
at exactly the same levels of AIX, HACMP and application software. Additionally, HACMP prevents
changes to the cluster configuration when mixed levels of HACMP are present.

Starting with HACMP v5.4, PTFs can now be applied using a ‘Non disruptive upgrade’ method. The proc-
ess is actually identical to the rolling migration, however, resource groups are placed into an ‘Unmanaged’
State to ensure they remain available. Note: During this state the application(s) are not under the control
of HACMP (ie. Not highly Available!). Using the default start-up options, HACMP relies on an application
monitor to determine the application state and hence appropriate actions to undertake.

Alternatively, the entire cluster and applications can be gracefully shutdown to update the cluster using
either the ‘snapshot’ or ‘Offline’ conversion methods. Historically, upgrading the cluster this way has re-
sulted in fewer errors! but requires a period of downtime!

HACMP Best Practices

11
Monitoring
HACMP provides a rich set of facilities for monitoring a cluster, such as Tivoli Integration filesets and
commands such as cldisp, cldump & clstat. The actual facilities used may well be set by enterprise
policy (e.g., Tivoli is used to monitor all enterprise systems). The SNMP protocol is the crux to obtaining
the status of the cluster. HACMP implements a private Managed Information Base (MIB) branch main-
tained via a SMUX peer subagent to SNMP contained in clstrmgrES daemon, as shown in Fig 2.0.

Fig 2.0 SNMP and HACMP

The clinfo daemon status facility does have several restrictions and many users/administrators of HACMP
clusters implement custom monitoring scripts. This may seem complex but actually it’s remarkably
straight forward. The cluster SNMP MIB data can be pulled simply over an secure session by typing : ssh
$NODE snmpinfo -v -m dump -o /usr/es/sbin/cluster/hacmp.defs risc6000clsmuxpd
> $OUTFILE. The output can be parsed through perl or shell scripts to produce a cluster status report. A
little further scripting can parse the output again in HTML format so the cluster status can be obtained
through a cgi web driven program, as shown in Fig 2.1. Further details are covered in the AU61 World-
Wide HACMP Education class. Other parties also have HACMP aware add-ons for SNMP monitors,
these include : HP OpenView, Tivoli Universal Agent and BMC PATROL (HACMP Observe Knowledge
Module by to/max/x).

Furthermore, HACMP can invoke notification methods such as a SMS, pager and e-mail messages on clus-
ter event execution and execute scripts on entry of error log reports. Best practice is to have notification of
some form in place for all cluster events associated with hardware, software failures and significant actions
such as adapter, network & node failures.

HACMP Best Practices

12
Fig 2.1 Custom HACMP Monitor

HACMP in a Virtualized World


HACMP will work with virtual devices, however some restrictions apply when using virtual Ethernet or
virtual disk access. Creating a cluster in a virtualized environment will add new SPOFs which need to be
taken into account. HACMP nodes inside the same physical footprint (frame) must be avoided if high
availability is to be achieved; this configuration should be considered only for test environments. To elimi-
nate the additional SPOFs in a virtual cluster the use of a second VIOS should be implemented in each
frame with the Virtual Client (VIOC) LPARs located within different frames, ideally some distance apart.

Redundancy for disk access can be achieved through LVM mirroring or Multi-Path I/O (MPIO). LVM mir-
roring is most suited to eliminate the VIOC rootvg as a SPOF as shown in Fig 3.0. The root volume group
can be mirrored using standard AIX practices. In the event of VIOS failure, the LPAR will see stale parti-
tions and the volume group would need to be resynchronized using syncvg. This procedure can also util-
HACMP Best Practices

13
ize logical volumes as backing storage to maximize flexibility. For test environments, whereby each VIOC
is located in the same frame LVM mirroring could also be used for datavgs as well.

Fig 3.0 Redundancy using LVM Mirroring

For shared data volume groups, the MPIO method should be deployed. See Fig 4.0. A LUN is mapped to
both VIOS in the SAN. From both VIOSs, the LUN is mapped again to the same VIOC. The VIOC LPAR
will correctly identify the disk as an MPIO capable device and create one hdisk device with two paths. The
configuration is then duplicated on the backup frame/node. Currently, the virtual storage devices will
work only in failover mode, other modes are not yet supported. All devices accessed through a VIO server
must support a “no_reserve” attribute. If the device driver is not able to “ignore” the reservation, the de-
vice can not be mapped to a second VIOS. Currently, the reservation held by a VIO server can not be bro-
ken by HACMP, hence only devices that will not be reserved on open are supported. Therefore, HACMP
requires the use of enhanced concurrent mode volume groups (ECVGs) The use of ECVGs is generally
considered best practice!

HACMP Best Practices

14
Fig 4.0 Redundancy using MPIO

In a virtualized networking environment, a VIOS is needed for access to the outside world via a layer-2
based Ethernet bridge which is referred to an a Shared Ethernet Adapter (SEA). Now, the physical net-
work devices along with the SEA are the new SPOFs. How are these SPOFs eliminated? Again through the
use of a second VIOS. Etherchannel technology from within the VIOS can use used to eliminate both the
network adapters and switch as a SPOF. To eliminate the VIOS as a SPOF there are two choices :

1. Etherchannel (configured in backup mode ONLY - No Aggregation) in the VIOC. See Fig 5.0
2. SEA failover via the Hypervisor. See Fig 6.0.

There are advantages and disadvantages with both methods. However, SEA failover is generally consid-
ered best practice as it provides the use of Virtual LAN ID (VID) tags and keeps the client configuration
cleaner.

From the client perspective only a single virtual adapter is required and hence IPAT via Aliasing must be
used. IPAT via Replacement and H/W Address Takeover (HWAT) are not supported. Having a second vir-
tual adapter will not eliminate a SPOF as the adapter is not real! The SPOF is the Hypervisor! Generally,

HACMP Best Practices

15
single interface networks are not best practice as this limits the error detection capabilities of HACMP. In
this case, it can’t be avoided so to aid additional analysis, add external IP-addresses to the netmon.cf file.
In addition, at least two physical adapters per SEA should be used in the VIOS in an Etherchannel configu-
ration. Adapters in this channel can also form an aggregate, but remember that most vendors require
adapters which form an aggregate to share the same backplane (A SPOF! - so don’t forget to define a
backup adapter). An exception this this rule is Nortel’s Split Multi-Link Trunking. Depending on your en-
vironment this technology maybe worth investigating.

Fig 5.0 Etherchannel in Backup Mode

HACMP Best Practices

16
Fig 6.0 SEA Failover

And finally a view of the big picture. Be methodical in your planning. As you can see from Fig 7.0 even a
simple cluster design can soon become rather complex!

Fig 7.0 A HACMP Cluster in a virtualized world


HACMP Best Practices

17
Maintenance of the VIOS partition – Applying Updates

The VIOS must be updated in isolation, i.e. with no client access. A simple way of achieving this is to start
by creating a new profile for the VIO server by copying the existing one. Then delete all virtual devices
from the profile and reactivate the VIOS using the new profile. This ensures that no client partition can
access any devices and the VIOS is ready for maintenance.

Prior to restarting the VIOS, manual failover from the client must be performed so all disk access and net-
working goes through the alternate VIOS. Steps to accomplish this are as follows. For:

MPIO storage, disable the activate path by typing :


chpath -l hdiskX -p vscsiX -s disable
LVM mirrored disks, set the virtual SCSI target devices to 'defined' state in the VIO server partition.
SEA failover can be initiated from the active VIOS by typing:
chdev -attr ha_mode=standby
Etherchannel in the VIOC, initiate a force failover using smitty etherchannel.

After the update has been applied the VIOS must be rebooted. The client should then be redirected to the
newly updated VIOS and the same procedure followed on the alternative VIOS. It’s important that each
VIOS used has the same code level.

Workload Partitions (WPAR)

The Nov ‘07 release of the AIX 6 introduced workload partitions (WPARs), an exciting new feature with
the potential to improve administration efficiency and assist with server consolidation. System WPARs
carve a single instance of the AIX OS into multiple virtual instances, allowing for separate "virtual parti-
tions." This capability allows administrators to deploy multiple AIX environments without the overhead of
managing individual AIX images. When deploying WPAR environments careful consideration will be
needed to ensure maximum availability. Potentially, new SPOFs are introduced into the environment, these
may include:

the network between the WPAR host and the NFS server
the NFS server
the WPAR(s)

the OS hosting the WPAR(s)


the WPAR applications

HACMP 5.4.1 release, introduces (as part of the base product) basic WPAR support which allows a re-
source group to be started within a WPAR. As discussed previously with application provisioning, several
limitations currently exist. The biggest drawback is that currently, HACMP does not manage or monitor
the WPAR itself, it manages the applications that run within a WPAR. This is a key point and for this rea-
son best practice recommends a much different method of supporting WPARs with HACMP than the de-
fault method supported by HACMP.

HACMP Best Practices

18
Firstly, and most importantly, I am going to recommend that all WPARs to be clustered for High Availabil-
ity are designed and created so they are not fixed to any one individual node or LPAR using what’s known
as WPAR mobility. WPAR mobility is an extension to WPAR which allows a WPAR to be moved from node
to node (independently of products such as HACMP). This involves the use of at least a third node which
will act as an NFS server to host at minimum the /, /home, /var, /tmp filesystems for each of the WPARs.
This approach has several major advantages as by default we can move the WPARs by checkpointing and
restoring the system. The checkpoint facility saves the current status of the WPAR & application(s) and
then restarts them at there previously saved state. Checkpoints can be saved at any time continually, say
every hour. During recovery the appropriate checkpoint can be restored. This way the system can be re-
stored to a point of last known good state. Furthermore, I recommend that :

the service type IP addresses for the WPARs are defined to the WPAR themselves and not to
HACMP.
WPAR mobility is created and tested outside of HACMP control before being put under HACMP
control.
HACMP controls the start-up, shutdown/checkpoint and movement of the WPAR to the backup
nodes.
HACMP monitors the health of the WPARs using custom application monitoring.
HACMP monitoring the health of the WPAR applications using process and/or custom application
monitoring.
For multiple WPAR configurations use mutual takeover implementation to balance the workload of
the AIX servers hosting the WPARs and the NFS servers.

Figure 8.0 shows an example of a highly available WPAR environment with both resilience for the NFS
server and the WPAR hosting partitions. The WPAR: zion is under the control of HACMP and shares both
filesystems from the local host and the NFS server as show in figure 9.0. Note: the movement of wparRG
will checkpoint all running applications which will automatically resume from the checkpoint state on the
backup node (no application start-up is required – but a small period of downtime is experienced!).

Using this method of integration makes WPAR support with HACMP release independent ie. same im-
plementation steps can be carried out with any supported version of HACMP not just 5.4.1.

HACMP Best Practices

19
Fig 8.0 Highly available WPAR environment

Fig 9.0 Example layout for WPAR: zion

Implementation Overview

On the NFS server create the local filesystems to export for each WPAR. Optionally decide whether
to cluster the NFS server for improved resilience.

Remember that HACMP uses the /usr/es/sbin/cluster/etc/exports as the default NFS


export file. Here is an example line from the export file:

/wpars/home
-sec=sys,rw,access=hacmp_wparA:hacmp_wparB:zion,root=hacmp_wparA:hacmp_wpa
rB:zion

HACMP Best Practices

20
Also, please ensure you configure high availability for the NFS servers using cross-mounting. Cross-
mounting causes each HACMP cluster node to NFS the local filesystem to itself, this results in a
much faster failover time. Here’s an example of the home directory export for WPAR: zion.

Change/Show All Resources and Attributes for a Resource Group


[MORE...16] [Entry Fields]
Filesystems Recovery Method sequential +
Filesystems mounted before IP configured true +
Filesystems/Directories to Export (NFSv2/3) [/wpar/home]
Filesystems/Directories to NFS Mount [/zionhome;/wpar/home]
Network For NFS Mount [] +

This causes HACMP to export /wpar/home and mount the filesystem on all all nodes as:
mount wpar_nfs:/wpar/home /zionhome

On each of the cluster (WPAR) nodes install the metacluster checkpoint file: mcr.rte. This will enable
WPAR mobility.

On the primary cluster node create the WPAR eg.

mkwpar -c -r -o /tmp/w.log -R active=yes \


-M directory=/ vfs=nfs host=count dev=/wpars/slash \
-M directory=/home vfs=nfs host=count dev=/wpars/home \
-M directory=/tmp vfs=nfs host=count dev=/wpars/tmp \
-M directory=/var vfs=nfs host=count dev=/wpars/tmp \
-M directory=/cpr vfs=nfs host=count dev=/wpars/cpr \
-h zion -N interface='en0' address='10.47.1.3' \ netmask='255.255.0.0' -n
WPARname

Go to smitty clonewpar_sys and create a WPAR clone file. Copy the clone file to the secondary
node(s).

On the secondary node(s) create the WPAR definition


mkwpar -p -f <path to the WPAR definition file>

Manually perform a WPAR relocation from the primary to the secondary node. eg.
Primary # /opt/mcr/bin/chkptwpar –d <path to statefile> -o <path to log-
file> -k <WPARname>
Secondary # /opt/mcr/bin/restartwpar –d <path to statefile> -o <path to
logfile> <WPARname>

Note: The Statefile must be visible to both the Global AIX and WPAR environment!

Create a HACMP Application Server and Resource Group for the WPAR. See Appendix A, sample
scripts.

Test the WPAR operation under HACMPs control. Tip! Start by moving the RG between nodes using
C-SPOC, before attempting any destructive testing.

Implement HACMP custom monitoring, to monitor the state of both the WPAR(s) and Applica-
tion(s). See Appendix A, sample scripts.

Test the integration thoroughly.

HACMP Best Practices

21
Summary
‘Some final words of advice ....’

Spend considerable time in the planning stage. This is where the bulk of the documentation will be pro-
duced and will lay the foundation for a successful production environment! Start by building a detailed
requirements document⁵. Focus on ensuring the cluster does what the users want /need it to do and that
the cluster behaves how you intend it to do. Next, build a technical detailed design document⁶. Details
should include a thorough description of the Storage / Network / Application / Cluster environment (H/
W & S/W configuration) and the Cluster Behavior (RG policies, location dependencies etc). Finally, make
certain the cluster undergoes comprehensive and thorough testing⁷ before going live and further at regu-
lar intervals.

Once the cluster is in production, all changes must be made in accordance with a documented Change
Management procedure, and the specific changes must follow the Operational Procedures using (where
possible) cluster aware tools⁸.

Following the above steps from the initial start phase will greatly reduce the likelihood of problems and
change once the cluster is put into production. In addition, to conclude this white paper, here is a general
summary list of HACMP do’s and don’ts.

Do :

Where feasible, use IPAT via Aliasing style networking and enhanced concurrent VGs.
Ensure the H/W & S/W environment has a reasonable degree of currency. Take regular cluster
snapshots and system backups.
Configure application monitors to enhance availability and aid self healing.
Implement a test environment to ensure changes are adequately tested.
Implement a reliable heartbeat mechanism and include at least one non IP network.
Ensure there are mechanisms in place which will send out alerts via SNMP, SMS or email when fail-
ures are encountered within the cluster.
Implement verification and validation scripts that capture common problems (or problems that are
discovered in the environment) eg. volume group settings, NFS mount/export settings, application
changes. In addition, ensure that these mechanisms are kept up-to-date.
Make use of available HACMP features, such as: application monitoring, extended cluster verifica-
tion methods, ‘automated’ cluster testing (in TEST only), file collections, fast disk takover and fast
failure detection.

Do not :

Introduce changes to one side of the cluster whilst not keeping the other nodes in sync. Always en-
sure changes are synchronized immediately. If some nodes are up and others down, ensure the
change is made and synchronized from an active node.
Attempt change outside of HACMPs control using custom mechanisms. Where possible use C-
SPOC.
Configure applications to bind in any way to node specific attributes, such as IP Addresses, host-
names, CPU IDs etc. It is best practice to move the applications from node-to-node manually before
putting them in resource groups under the control of HACMP.
Make the architecture too complex or implement a configuration which hard to test.

HACMP Best Practices

22
Deploy basic application start and stop scripts which do not include pre-requisite checking and error
recovery routines. Always ensure these scripts verbosely log to stdout and stderr.
Implement nested file systems that create dependencies or waits and other steps that elongate failo-
vers.
Provide root access to untrained and cluster unaware administrators.
Change failure detection rates on networks without very careful thought and consideration.
Action operations such as # kill `ps –ef | grep appname | awk ‘{print $2}’` when
stopping an application. This may also result in killing the HACMP application monitor as well.
Rely on standard AIX volume groups (VGs) if databases use raw logical volumes. Consider instead
implementing Big or Scaleable VGs. This way, user, group and permission information can be stored
in the VGDA header and will reduce the likelihood of problems during failover.
Rely on any form of manual effort or intervention which maybe involved in keeping the applica-
tions highly available.

⁵A written cluster requirements document allows you to carry out a coherent and focused discussion with the users about what they
want done. It also allows you to refer to these requirements while you design the cluster and while you develop the cluster test
plan.
⁶A written cluster design document describes from a technical perspective, exactly how you intend to configure the cluster envi-
ronment. The behavior of the environment should meet all requirements specified in ⁵.
⁷A written test plan allows you to test the cluster against the requirements (which describes what you were supposed to build) and
against the cluster design document (which describes what you intended to build). The test plan should be formatted in a way
which allows you to record the pass or failure of each test as this allows you to easily know what’s broken and it allows you to
eventually demonstrate that the cluster actually does what the users wanted it to do and what you intended it to do.
⁸Do not make the mistake of assuming that you have time to write the operational documentation once the cluster is in production.

HACMP Best Practices

23
Appendix A
Sample WPAR start, stop and monitor scripts for HACMP

wparstart.sh

#!/bin/ksh

WPARname=WPARalex
LOG=/tmp/REC
STATEFILE=/wpars/$WPARname/cpr/alex.state
LOGFILE=/wpars/$WPARname/cpr/alex.log
CHECKPOINT=/wpars/$WPARname/CHECKPOINT

echo "Starting WPAR on $(date)" >> $LOG

#check for the CHECKPOINT marker file on the WPAR nfs server
mount count:/wpars/slash /wpars/$WPARname

if [ -f $CHECKPOINT ]; then
umount /wpars/$WPARname
echo "checkpoint restoring ...."
/opt/mcr/bin/restartwpar -d $STATEFILE \
-o $LOGFILE $WPARname | tee -a $LOG
rm $CHECKPOINT
else
startwpar $WPARname
fi

exit 0

wparstop.sh

#!/bin/ksh

WPARname=WPARalex
LOG=/tmp/REC
STATEFILE=/wpars/$WPARname/cpr/alex.state
LOGFILE=/wpars/$WPARname/cpr/alex.log
CHECKPOINT=/wpars/$WPARname/CHECKPOINT

echo "Stopping WPAR on $(date)" >> $LOG

if [ -d $STATEFILE ]; then
rm -r $STATEFILE; rm $LOGFILE
fi

VAR1=$1; key=${VAR1:=null}
if [ $key = "CLEAN" ]; then
# may want to cleanly stop all applications running in the WPAR first!
stopwpar -F $WPARname
else
echo "checkpoint stopping..."
/opt/mcr/bin/chkptwpar -d $STATEFILE \
-o $LOGFILE -k $WPARname | tee -a $LOG
# create a CHKPOINT marker file on the WPAR nfs server
HACMP Best Practices

24
mount count:/wpars/slash /wpars/$WPARname
touch $CHECKPOINT
umount /wpars/$WPARname
fi

exit 0

wparmonitor.sh

#!/bin/ksh

LOG=/tmp/REC
WPARNODE=zion

# get the state of the WPAR


STATE=`lswpar -c -q | awk -F: '{print $2}'`
if [ $STATE = "A" ]; then
ping -w 1 -c1 $WPARNODE > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "WPAR is alive and well" |tee -a $LOG
exit 0
# Probably want to perform an application health check here!
else
echo "WPAR not responding" |tee -a $LOG
exit 31
fi
else
echo "WPAR is not active - ($STATE)" |tee -a $LOG
exit 33
fi

HACMP Best Practices

25
References
HACMP Web Pages.

http://www-03.ibm.com/systems/p/software/hacmp.html

http://lpar.co.uk
IBM Training Education :

HACMP System Administration I: Planning and Implementation, course code AU54G


HACMP System Administration II: Administration and Problem Determination, course code
AU61G
HACMP System Administration III: Virtualization and Disaster Recovery, course code AU62G
System p LPAR and Virtualization II: Implementing Advanced Configurations AU78G

IBM Redbooks :

IBM System p Advanced Power Virtualization Best Practice - REDP-4194

Implementing High Availability Cluster Multi-Processing Cookbook - SG24-6769.


Introduction to Workload Partition Management in IBM AIX Version 6.1 - SG24-7431-00.

For more information about HACMP, or to comment on this document please email:

hafeedbk@us.ibm.com

About the Author

Alex Abderrazag

Alex has worked for IBM since 1994 and has been part of IBM Training for the past
six years specializing in POWER technology, TCP/IP, Security and High Availability.
Alex has over 17 years experience working with UNIX® systems and has been ac-
tively responsible for managing, teaching and developing the AIX/Linux® educa-
tion curriculum.

Version Date Author Description

V03.00 Jan, 2008 Alex Abderrazag Minor update to include section on WPAR.

V02.00 July, 2007 Alex Abderrazag Major re-write.

V01.00 Feb, 2005 Tom Weaver Document Created by HACMP development.

Special Thanks To: Tony ‘Red’ Steel, Bill Miller, Grant McLaughlin and Susan Schreitmueller for re-
viewing this document.

HACMP Best Practices

26
®

© IBM Corporation 2007


IBM Corporation
Systems and Technology Group
Route 100
Somers, New York 10589

Produced in the United States of America


July 2007
All Rights Reserved

This document was developed for products and/or services


offered in the United States. IBM may not offer the products,
features, or services discussed in this document in other coun-
tries.

The information may be subject to change without notice. Con-


sult your local IBM business contact for information on the
products, features and services available in your area.

All statements regarding IBM future directions and intent are


subject to change or withdrawal without notice and represent
goals and objectives only.

IBM, the IBM logo, AIX 5L, Micro-Partitioning, POWER, POW-


ER4, POWER5, Power Architecture, System p, HACMP, Tivoli
are trademarks or registered trademarks of International Busi-
ness Machines Corporation in the United States or other coun-
tries or both. A full list of U.S. trademarks owned by IBM may
be found at: http://www.ibm.com/legal/copytrade.shtml.

Other company, product, and service names may be trademarks


or service marks of others.

IBM hardware products are manufactured from new parts, or


new and used parts. In some cases, the hardware product may
not be new and may have been previously installed. Regard-
less, our warranty terms apply.

Photographs show engineering and design models. Changes


may be incorporated in production models.

Copying or downloading the images contained in this document


is expressly prohibited without the written consent of IBM.

This equipment is subject to FCC rules. It will comply with the


appropriate FCC rules before final delivery to the buyer.

Information concerning non-IBM products was obtained from


the suppliers of these products or other public sources. Ques-
tions on the capabilities of the nonIBM products should be ad-
dressed with those suppliers.

All performance information was determined in a controlled


environment. Actual results may vary. Performance information
is provided “AS IS” and no warranties or guarantees are ex-
pressed or implied by IBM. Buyers should consult other sources
of information, including system benchmarks, to evaluate the
performance of a system they are considering buying.

When referring to storage capacity, 1 TB equals total GB divided


by 1000; accessible capacity may be less.

The IBM home page on the Internet can be found at:


http://www.ibm.com.

The IBM System p home page on the Internet can be found at:
http://www.ibm.com/systems/p.

PSW03025-GBEN-01

HACMP Best Practices

27