Anda di halaman 1dari 8

Achieving High

Availability Objectives

WHITE PA P E R

In this white paper we will discuss the need

for high availability and how it is defined and

measured. Then we’ll outline the shortcom-

ings in common high availability designs and

describe a methodology to address those

shortcomings.
Introduction
Table of Contents
In today’s IT environments, the need for the highest levels of availability is a well-established
principle. Businesses increasingly require immediate and continuous access to their infor-
Introduction 2
mation systems, and regularly set up traditional high availability software clusters to meet
Defining and measuring this business objective.This common solution, however, is often not enough to meet busi-
high availability 2 ness’s high availability objectives.
Supporting the four pillars
The purpose of this report is twofold. First, we will discuss the need for high availability
of high availability 4
and how availability is defined and measured. Secondly, we will discuss how to weed out
Achieving functional the shortcomings in common high availability designs and describe a methodology to
high availability 6 address those needs.
Conclusion 8
This report is written from a technology independent perspective. It takes no bias towards
high availability software products, server, storage and network hardware vendors. Also, our
high availability discussion will focus primarily on open systems technology such as UNIX
servers and Windows 2000/Windows NT servers.

Defining and measuring high availability


What is high availability?
To begin with, it is important that we are all speaking the same language when we speak
of high availability (HA). For the sake of the discussion here in this report, we will use
the following definition:

Figure 1: causes of downtime An application environment is highly available if it possesses the ability to recover automatically
within a prescribed minimal outage window.1
Source: Data Quest, November 1999

HA implies that no single point of failure (SPOF) exists in the application environment.
30% An SPOF is any software, hardware or environmental component that, if it should fail,
would take the application environment offline for an extended outage and require
25% 1 human intervention to correct.
2 It is also important to note what HA is not. HA is not continuous access to the application
20% environment throughout failures.This area of availability—called continuous availability—is
3
addressed by such technologies as fault tolerant hardware, data center site redundancy, and
15% 4 real-time remote data replication. Application environments requiring continuous
availability cannot sustain any kind of failure.
10%
What is downtime?
5 The goal of high availability solutions is to automatically recover from functional down-
6 time within a minimal outage window. By downtime we mean a service interruption at
5%
any layer of the application environment.
0%
It is important to understand clearly what we mean by the application environment.The
So

Ha

Hu

Ne err

Lo rk

Ot nvir

application environment refers to all the hardware and software required to support a
he
ca
ftw

rd

tw or
an

r
w

le
o
ar

ar

function provided to the business from IT. Largely we think of this environment as the
e

on

servers, software, network and storage required for users to be able to execute a pro-
m
en
t

gram. For example, the application environment of a web-based inventory application


would consist of the following:
• Employee and their workspace
• Desktop server
• All LAN connections
• Presentation server (http server)
• Application server

2 WHITE PA P E R 1
The minimal outage window depends on the critical nature of the business being executed. Generally these windows are from 3 to 7 minutes.
• Database server
• Operating System (at each level)
• Application software (at each level)
• Data storage subsystems for each server

Downtime is a service interruption at any or all of these layers.

As mentioned above, downtime must be considered from a functional, or user perspective.


Another way of putting it is to say anything that keeps a user from being able to conduct an
IT supported business is functional downtime. It is important to note what a user means
here. A user is any person or system that executes business functions with the designated
application environment. Users could be employees, customers, business partners, or
other IT application environments.

What causes downtime?


Traditional high availability software packages typically focus their attention solely on
hardware related failures.Yet a 1999 Data Quest study reported that only 23 percent of
service interruptions were caused by hardware related failures. (See Figure 1.)

The study showed that fully 27 percent of service interruptions were software related.
From this statistic alone, we can gather that installing any solution that addresses only
hardware related failures would be incomplete.
Figure 2: cost of downtime
Not surprisingly, software monitoring utilities have recently become more common-
place, suggesting that software related failures are being acknowledged and addressed Source: Meta Group, Individual.com, October 2000
more frequently.

Other sources of downtime included human error (18 percent), network issues (17
percent), local environment issues (8 percent), and other issues (7 percent). True high Industry $ per hour
availability can only be achieved when consideration is given to all areas that may cause
downtime. Energy $3M

Telecommunications $3M
What is the cost of downtime?
Now that we understand what is meant by downtime, the first step in planning for high Finance $1.5M
availability is to understand the exposure posed to a company by the interruption of the
IT-dependent mfg. $1.5M
application environment. It is often prudent to quantify the cost of downtime per hour of
that environment. Quantifying the cost of downtime is helpful as it clearly and concisely Healthcare $0.5M
details the risk a company faces. Deploying high availability solutions is one way of miti-
Media $0.5M
gating that risk. Simply put, if you know how much you can lose, you know how much to
spend in prevention. Hospitality/Travel $0.5M

The cost of downtime varies tremendously by industry. A study recently published by the
Meta Group puts the cost of downtime for many common industries anywhere between
$0.5 million per hour to $3 million per hour (see Figure 2).These figures are based on
an entire IT operations center being off-line. However, outages of a single server can
range in the several thousands of dollars per hour.These figures are just to be used as an
example to show that the revenue lost in an outage is substantial and should be investi-
gated on an application environment basis for each company.

Any accurate cost of downtime study must also consider the indirect costs associated
with service interruptions. It is difficult to translate these numbers into loss per hour of
downtime, but to deny that such intangibles contribute to cost is shortsighted. Examples
of these intangible costs may be decreased customer satisfaction, penalties for failure to
meet service level agreements or a legal liability associated with failure to provide serv-
ice.This is especially relevant in the healthcare and financial industries.

Achieving High Availability Objectives 3


A formal Business Impact Analysis (BIA) may be an appropriate means to calculate the
cost of downtime.These types of studies can be conducted to outline all associated risks
with service disruption, but they also go further than strictly investigating IT infrastruc-
ture. A BIA is a more robust study to clearly understand exposures faced by a company.

The cost of availability


Once we have quantified the cost of downtime we can then evaluate availability technol-
ogy on the market. Different recovery and availability objectives dictate different costs;
these range from large outage window, low-cost solutions such as offsite backup tape
storage, to small outage window, high-end solutions such as remote disk mirroring and
multiple data centers. Figure 3 shows that the initial costs of narrowing our recovery
windows grows substantially as we approach fault tolerant solutions.

• Redundant sites
Figure 3: cost of availability • Hot site disk mirroring

• Hot site remote tape vaulting


• Local high availability clusters
Cost

• Local disk copies


• Local tape vaulting
• Daily tape copy off site
• Weekly tape copy off site

Recovery Time

How do you calculate expected uptime?


In today’s market it is very much in vogue to brag about the number of “nines” of avail-
ability your hardware solution demonstrates. For instance, a data storage subsystem may
be built with a tremendous redundancy so that it is capable of 99.999 percent uptime.
In a 24x7x365 shop, 99.999 percent translates to about 5.25 minutes of downtime per
year. However, your entire application environment likely does not run solely on your
data storage subsystem.You likely have LAN connections and multiple sever layers, as
well as storage subsystems. All of these have an impact on your uptime.

For the sake of argument, let’s enhance the above application environment to include a
realistic scenario. Figure 4 describes a typical application environment, with each ele-
ment’s anticipated uptime/downtime.This is a rough number of expected downtime

Figure 4: expected downtime in a


Layer Uptime Downtime per year
typical application environment

All LAN connections 99.9% 8.76 hours

Presentation server 99.2% 70.08 hours

Application server 99.7% 26.28 hours

Database server 99.995% 26.25 minutes

Database application
software 99.3% 61.32 hours

Data subsystem for


database only 99.999% 5.25 minutes

Total 167 hours

4 WHITE PA P E R
within the environment per year.The most important point to note is that these avail-
ability numbers assume that all other criteria for successful high availability solutions
have been met.The next section describes these criteria and how they can dramatically
affect functional availability of an application environment.

Supporting the four pillars of high availability


When you look at all the potential causes of downtime and the extraordinary costs that
those outages can bring, you realize that successfully achieving your availability objec-
tives is critical and complex. Standard high availability software packages only begin the
process of addressing complete functional availability needs.

The technology of the hardware and software is simply not enough.We must also
address other areas that affect availability such as:
• An adequate and well trained staff
• Change management policies and problem determination policies that are detailed and
specific.These policies must be known and respected by all the staff
• Adequate environment monitoring tools
• Successful backup/recovery and disaster recovery tools and plans

If any one of these areas is not adequately addressed the availability of the application
environment will be in jeopardy.

We call these different areas of application environment support “pillars,” and categorize Figure 5: pillars of high availability
them into four groups: infrastructure, business contingency, support services, and operations.

We take the approach that availability objectives are achieved with a combination of Business Application
hardware and software technology brought together by a philosophy of availability.

Business Continuity
Simply put, the philosophy is as follows:

Support Services
Infrastructure

Operations
High Availability of an application environment is achieved when all pillars of that environ-
ment are adequately supported.

In some capacity every application environment contains the four components we list as
pillars. However, it is a matter of opinion as to which items are placed in which pillar.
Often the items within each pillar address a multitude of application environments.
What is crucial is that all items that affect availability are placed in pillars for examination.
Understanding these pillars and shoring up any weaknesses provides a solid foundation for
addressing availability effectiveness.

The infrastructure pillar


We divide the infrastructure pillar into three parts.The first area focuses on the hard-
ware and software associated with an application environment.These are largely the
servers, the operating systems, databases, specific availability software solutions and
other relevant applications.

The second major area of the infrastructure pillar is the shared storage infrastructure.While
technically another component of the hardware solution, the shared storage really stands
alone as a crucial piece of the overall availability solution.This area of the pillar requires
focus on the storage hardware technologies such as enterprise storage arrays and their asso-
ciated data management software.These tools can move data from one storage device to
another or provide for real time mirror copies, both in local as well as remote locations.
Also important to the storage infrastructure is networked storage such as the storage area
network (SAN), the network attached storage (NAS), and the IP storage solutions.

Achieving High Availability Objectives 5


The last part of the infrastructure pillar is the physical environment in which the solu-
tion is housed.This refers to environmental conditions such as raised floor space, proper
cooling, independent power circuits and placement of servers in racks and on floors.

The business contingency pillar


The business contingency pillar is designed to focus attention on the technology of
restarting business once it has been interrupted.This restart normally comes in the form
of a manual effort such as restoring data in a local recovery solution, or restoring a pro-
duction operating environment in the form of business continuance.

This pillar covers two major areas: local backup and recovery solutions, and business
continuance solutions.The local backup and recovery solution has obvious influence on
application availability as nearly all applications have a significant data impact. Focus here
is on the use of the technology and the retention policies of the data.

The business continuance, or disaster recovery, solution is also closely coupled with the
availability of application environments. The focus here is on the use of technology,
information from reports such as a business impact analysis, and execution of disaster
recovery testing.

The support services pillar


The support services pillar focuses primarily on two areas: security and networks. It is
clear that lack of sufficient security can negatively affect application availability.This area
should focus on the use of firewalls and intrusion detection software. In addition, poli-
cies concerning password management and server access must be closely examined.

Networks and connectivity are also critical to application availability. Areas such as
redundancy in the network architecture and throughput analysis should be investigated
to understand their influence on the ability to execute a business process. Another key
piece of network support services is the ability to quickly diagnose and repair network
related issues. Critical to this success are detailed diagrams of all network segments;
these should be regularly updated and distributed to support teams.

The operations pillar


The final pillar affecting availability is the operations pillar. This pillar covers all areas
pertaining to the routine day to day management of the application environment.These
areas include system administration, problem management, change management, 24x7
monitoring of the environments and compliance with business service level agreements.

Achieving functional high availability


Figure 6: supporting the pillars Once the pillars of availability are defined in a given application environment, careful
of high availability consideration must be given to all areas that can negatively affect availability.These areas
should be investigated to determine if they are sufficient to meet the application envi-
High Availability ronment’s availability objectives. If specific areas leave holes in the availability umbrella,
action can be taken to correct the shortcomings.
Business Continuity

Support Services

Perform an assessment of availability effectiveness


Infrastructure

Operations

The best way to ensure availability objectives are being met is to perform an availability
effectiveness assessment of the application environment.This investigation should be
conducted through a series of server and environment interrogations and interviews
with key staff.The investigation should study each of the four pillars in three different
Tools dimensions: tools, staff, and procedures (see figure 5).
Staff
Procedures
In general the tools of a pillar refer to the hardware and software components installed
to meet specified technology needs.We must discover whether tools exist to support
this pillar, whether they are used or known, and whether the current tool is effective in
6 WHITE PA P E R
supporting the pillar. A critical tool to investigate is the presence of customized diagrams
and documentation that clearly depict the application environment.These can be server,
network and storage configurations.

The staff associated with a pillar is the employees, the managers and the consultants
needed to support that pillar.This staff must be adequately trained with adequate num-
bers. For example, it is never a good idea to have only a single person who is capable of
providing system administration duties for critical servers.The staff must be well sup-
ported and represented by management. And they must have adequate training in tech-
nology supporting future IT initiatives. Overall, to support functional high availability,
critical staff should be self-sufficient. Contracted remote monitoring services are benefi-
cial for supplementing and aiding critical staff, but avoid dependence upon outside
groups and contractors for critical functions.

The procedures associated with the support of a pillar should focus on how and why
technology is used to meet availability objectives. Most importantly, they must be docu-
mented and known to all. Far too often we allow smart minds to contain far too much
critical information without asking for them to write it down.These procedures should
be clear so that even the simplest of minds can follow them. All should be educated
on—and instructed to follow—the procedures. Lastly, all procedures should constantly
be evolving, or regularly reviewed and updated.

The availability effectiveness assessment should provide feedback in two fundamental


ways. First, there should be a quantitative analysis of the findings.This can be nothing
more than a score for each dimension of each pillar. A more detailed report would
describe a numerical response to a constant set of questions.These scores can then be
weighed against their importance in an application environment.The score should reflect
one of three fundamental categories affecting availability: 1. An item does not exist, 2.
An item exists but is insufficient, or 3. An item exists and is sufficient. For example a
standard question to ask in an availability effectiveness assessment is whether the compa-
ny has a change management policy to govern these critical servers. If you ask that ques-
tion of different IT staff you may get different answers.The manager who authored the
policy may reply the policy exists and is sufficient (score of 3). However, the half dozen
people who would regularly use the policy may state it does not even exist (score of 1).

This quantitative score can then be compared to a perfect score, or if similar questions are
asked, the score of another application environment. A quick overview of such scores can
show whether deficiencies exist in tools, staff, or procedures within a particular pillar.

The second crucial method for providing feedback should be a qualitative approach.
The person evaluating the availability effectiveness based on interviews should draft
this. It should report on responses to prepared questions, especially when those
responses differ from person to person. For example, an employee may report that a
backup and recovery tool exists, but is totally worthless. On the other hand, a man-
ager who is asked the same question might reply that the backup and recovery tool
exists and completely meets their needs.

In a qualitative evaluation it is also important to note what is not said. If interviewees


consistently avoid discussing a peer’s ability to manage the systems, further investigation
may be required to investigate whether that person is competent.

Finally, the findings from an availability effectiveness assessment should be communicated to


those people responsible for the availability of an application environment.This is often the
Chief Information Officer, or IT director.This can be done by creation of a formal report or
presentation. It is important that these results are documented so that follow-up investiga-
tions can be performed to see if these problems still exist or if they have been alleviated.

Achieving High Availability Objectives 7


Focus on shoring up the weakness found in each pillar
Based on the reported findings of an availability effectiveness assessment, plans can be
put in place to address the exposed shortcomings.These rollout plans should become
part of the overall high availability implementation project.They should be documented,
tested and implemented in conjunction with the chosen high availability technology.

Since the hardware and software technology associated with high availability is fairly well
understood, the changes required to improve availability effectiveness often do not
require the capital acquisition of technology. Rather, they require proper creation and
management of policies and procedures. If is often most effective if these policies are
created and managed internal to a company, as full-time employees tend to have the best
insights into what will be fruitful solutions.

In general, it is most effective to first address issues that are most import to an applica-
tion environment. For example, having clustering software installed and running without
a trained staff to support it can often affect system availability more negatively than not
having clustering software at all.

Conclusion
In nearly all aspects of today’s business world the availability of the underlying IT infra-
structure is crucial. Even being off-line for a short time can have a tremendous effect on
a company’s health and economic viability. But preventing these outages cannot be
addressed simply by a technology solution. People, policy and procedures can have a far
more significant impact on availability. Functional high availability can only be achieved
through an effort to investigate all areas that can stifle the business transaction.

If any company is considering deploying high availability solutions, it is also critical that
they consider availability from the users’ perspective. Specifically, an investigation should
be performed to understand how effective a solution would be in terms of user inter-
action and satisfaction.

CNT offers highly trained professionals to perform availability effectiveness assessments


and design solutions that provide the highest levels of functional availability.These assess-
ments can be structured to pinpoint holes in existing high availability architectures as
well.This gap between the business needs for IT availability and an organization’s ability
to meet those needs must be bridged. CNT follows up each assessment with a compre-
hensive plan to implement solutions that will help your company meet its functional
high availability needs.

CNT is one of the world’s largest providers of comprehensive © 2003 by Computer Network Technology Corporation (Nasdaq: USA: 1-800-638-8324 „ Canada: 905-595-1500
storage networking solutions. For over 20 years, our experts have CMNT). All rights reserved. Any reproduction of these materials U K : 4 4 - 17 5 3 - 7 9 2 4 0 0 „ F r a n c e : 3 3 - 1 - 4 13 0 - 1 2 1 2
analyzed, designed, and built enterprise storage networks. without the prior written consent of CNT is strictly prohibited. CNT, Australia: 61-2-9540-5486 „ Germany: 49-89-42 74 11-0
the CNT logo, Channelink, and UltraNet are registered trademarks of Switzerland: 41-1-73 35-733 „ Belgium: 32-2-737 76 42
Visit www.cnt.com to learn about our solutions, products, partner- Computer Network Technology Corporation. All other trademarks Italy: 39-06-51 49 31 „ Brazil: 55-11-5509-1504
ships, career opportunities, and more. identified herein are the property of their respective owners. CNT is Japan: 813-5403-4858 „ Other locations: 1-763-268-6000
an equal opportunity employer. CNT corporate headquarters’ QMS is
registered to ISO 9001: 2000. Certificate #006765. PL581 | 0803