Anda di halaman 1dari 38

Availability & Related Concepts

High Availability Calculation of Availability Components of Availability Characteristics Rules Seven Rs


Dr. Neelu J. Ahuja, College of Engineering Studies

Knowing Availability

The percentage of total time that a Network, System, Service is available for use.

What is HA? { High Availability} A Network System Service


with specific design elements intended to keep availability above a high threshold. (eg: 99.99%)

High Availability.

It is a measure of probability that a service is available for use at any given instant. It is considered that the system is highly available if it has a uptime of 5 nines (99.999%). Availability is basically a function of

System Reliability System Reparability System Redundancy

RR to achieve HA

There are Rapid Recovery systems to achieve High Availability. A Network, System, Service with specific design elements intended to recover from down time very quickly. The delay is as small as possible so that the user does not face inconvenience. The time may vary depending on the kind of application.

Whats System Reliability?

It is measure of continuous system uptime in the absence of IT failure. System is said to be highly reliable if it has a high mean time between failures (MTBF). It is also called MTBSIF. Mean time between service impacting failures. It is typically a term taken from telecom industry.

Whats System Reparability?

It is measure of how quickly a failed device or a system can be restored to service. Reparability is measured in mean time to repair. It is represented as MTTR. The less reliable the system the more the need to have a low MTTR to support overall system availability.

Whats System Redundancy?

Redundancy augments the reparability of individual components by establishing a backup or stand by. This means that there are multiple resources providing the same service. Effectiveness of redundancy is a function of how quickly a backup component can be brought into service.

Availability

Availability=MTBF MTBF+MTTR
High Availability=High MTBF or Low MTTR

Calculating Availability

Availability can be measured directly through periodic polling. Polling is a communications control method used by some computer/terminal systems whereby a "master" station asks many devices attached to a common transmission medium, in turn, whether they have information to send. Some common methods used are SNMP, NAGIOS etc. SNMP= Simple network management protocol.

Using SNMP

Simple Network Management Protocol (SNMP) It is an application layer protocol that facilitates the exchange of management information between network devices. It is part of the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol suite. SNMP enables network administrators to manage network performance, find and solve network problems, and plan for network growth.

Nagios..

Nagios is a host and service monitor designed to inform administrator of the network problems before clients, end-users or managers encounter. It was initially designed to run under the Linux operating system, but works fine under most *NIX variants as well. The monitoring daemon runs intermittent checks on hosts and services which return status information to Nagios. When problems are encountered, the daemon can send notifications out to administrative contacts in a variety of different ways (email, instant message, SMS, etc.).

Calculating Availability
These are used basically to measure the availability of the system in question. A formula for predicting availability of the single component is MTBF MTBF+MTTR or TTR 1- (MTBF+TTR)

Components of Availability

Data Centre Facility Server Hardware (Processor, Memory, Communication channels) Server System Software (OS, Program Products like SDK, Web Sphere software, SOA Packs) Application Software (Programs, DBMS etc) Disk Hardware (Data files, Control files)

Contd.

Database software (data files, data dictionary files) Network Software (Protocols, network component drivers) Network Hardware (Controllers, Hubs, Routers, Repeaters, Modems etc). Desktop Software (Office suit, General Purpose Applications) Desktop Hardware (Processors, Menu, disk, Interface cards).

For Good Availability.


Characteristic Knowledge of Systems Software & Components Knowledge of Network Software & Components Knowledge of Database Systems Knowledge of Power & Air Conditioning Systems Ability to think & act tactically Knowledge of Software Configuration Priority

High

Knowledge of Hardware Configuration


Knowledge of Backup System Knowledge of Desktop Hardware & Software Knowledge of Applications Ability to Work Effectively with Developers Ability to communicate effectively with IT Executives

Medium

Low

Ability to manage diversity Ability to think & plan strategically

Rules of Nine Availability..


No of Nines % of Availability Weekly Hours Down Weekly Minutes Down Missed Missed Event out event out of 10,000 of 1000,000

1 2 3 4 5

90.000 99.000 99.900 99.990 99.999

10.000 600.00 01.000 060.00 00.100 006.00 00.010 000.60 00.001 000.06

1000.0 0100.0 0010.0 0001.0 0000.1

100,000 010,000 001,000 000,100 000,010

Business Continuity.

During Unavailability situations, the biggest target is to keep the business going. This is technically referred to as Business Continuity. Business Continuity is the preparation for, response to, and recovery from an application outage that adversely affects business operations. Business Continuity Solutions address systems unavailability, degraded application performance, or unacceptable recovery strategies

Why Assuring Availability.


Lost Productivity Number of employees impacted (x hours out * hourly rate)
Know the downtime costs (per hour, day, two days...)

Lost Revenue Direct loss Compensatory payments Lost future revenue Billing losses Investment losses

Damaged Reputation Customers Suppliers Financial markets Banks Business partners

Financial Performance Revenue recognition Cash flow Lost discounts (A/P) Payment guarantees Credit rating Stock price

Other Expenses Temporary employees, equipment rental, overtime costs, extra shipping costs, travel expenses...

Statistics to show loss due to unavailabilityeg: Loss in US Dollars in Millions


Retail Insurance Information technology Financial institutions Manufacturing Call location Telecommunications Credit card sales authorization Energy Point of sale

1.1

1.2
1.3 1.5 1.6 1.6 2.0 2.6

2.8
3.6 6.5

Retail brokerage
Source Meta Group, 2005

Very crucial is data availability on systems..

Disruptors of Data availability..


Natural or man made
Flood, fire, earthquake Contaminated building

Disaster (<1% of Occurrences)

Unplanned Occurrences (13% of Occurrences) Failure


Database corruption Component failure Human error

Planned Occurrences (87% of Occurrences) Competing workloads


Backup, reporting Data warehouse extracts Application and data restore

To Handle Business continuity during Unavailability BCP is done.


BCP stands for Business Continuity Planning. It is a plan created to ensure business continuity in adverse situations. Broad Objectives are: 1. Identifying the mission or critical business functions 2. Collecting data on current business processes. 3. Assessing, prioritizing, mitigating, and managing risk

Risk Analysis Business Impact Analysis (BIA)

Designing and developing contingency plans and disaster recovery plan (DR Plan) Training, testing, and maintenance

Seven Rs of ensuring High Availability.

Redundancy Reputation Reliability Reparability Recoverability Responsiveness Robustness

Redundancy

Redundancy augments the reparability of individual components by establishing a backup or stand by. This means that there are multiple resources providing the same service. Effectiveness of redundancy is a function of how quickly a backup component can be brought into service.

Reputation..

The organization gets its reputation from the availability standards that it maintains. The unavailability would serious affect the reputation of the business. The damage to reputation may have serious cascading affects on the overall setup.

Reliability

The reliability of hardware and software can be verified from customer references and industry analysts. Beyond that, there should be considered the performing of an Empirical Component Reliability Analysis, which consists of the following steps: Review and analyze problem management logs. Review and analyze supplier logs. Acquire feedback from operations personnel. Acquire feedback from support personnel. Acquire feedback from supplier repair personnel. Compare experiences with other shops. Study reports from industry analysts.

More

An analysis of problem logs should reveal any unusual patterns of failure; It should be studied by supplier, and product using department, considering the details of day and time of failures, frequency of failures, and time to repair. Suppliers often keep onsite repair logs that can be perused to conduct a similar analysis.

More.

Feedback from operations personnelespecially offsite operatorsis often candid, and can be revealing as to how components truly perform. For example, operators may be doing numerous resets on a particular network component every morning prior to startup, but they may not bother to log these activities since the network always comes up. Similar conversations with various support personnel such as systems administrators, network administrators, and database administrators may elicit similar revelations.

More.

There may be bias when canvassing a supplier's repair personnel about the true reliability of their products. But these people can be just as candid and revealing as the people using the product. This becomes another valuable source of information for evaluating component reliability. Yet another is comparing experiences with other shops or setups. Shops that are closely aligned with the organizations way of working and are using similar platforms, configurations, and offering similar services. Even customers can be especially helpful. Reports from reputable industry analysts can also be used to predict component reliability.

Reparability

Reparability is the relative ease with which service technicians can resolve or replace failing components. Two common metrics used to evaluate this trait are how long it takes to do the actual repair, and how often the repair work needs to be repeated. In more sophisticated systems, initial repair work can be done from remote diagnostic centers where failures are detected, circumvented, and arrangements made for permanent resolution with little or no involvement of operations personnel.

Recoverability

This refers to the ability to overcome a momentary failure in such a way that there is no impact on enduser availability. It could be as small as a portion of main memory recovering from a single-bit memory error, and as large as having an entire server system switch over to its standby system with no loss of data or transactions. Recoverability also includes retries of attempted reads and writes out to disk or tape, as well as retrying of transmissions down network lines.

Responsiveness
.

Responsiveness is the sense of urgency that all people involved with high availability need to exhibit. This includes having well-trained suppliers and inhouse support personnel who can respond to problems quickly and efficiently. It also pertains to how quickly the automated recovery of resources such as disks or servers can be enacted.

Robustness

A robust process is able to withstand a variety of forcesboth internal and externalthat could easily disrupt and undermine availability in a weaker environment. Robustness puts a high premium on documentation and training to withstand technical changes related to platforms, products, services, and customers; personnel changes related to turnover, expansion, and rotation; and business changes related to new direction, acquisitions, and mergers.

10 Factors to evaluate System Availability

1.Executive support: The Management Supports and sponsors the availability with actions such as analyzing outage reports and holding groups accountable. 2. Process Owner: Process owner is the person who has initiated the process in the system. It is necessary on his part to ensure timely and accurate analysis of Distribution of outage reports.

Contd

3. Customer Involvement: It is proved that the processes which involve the basic understanding of the customer needs are more successful than the others which dont. Customer Involvement in the design and use of the processes plays a vital role to indicate availability.

Contd

4. Supplier Involvement: Involvement of key suppliers of hardware, software, service providers is necessary. 5. Service Metrics: Analysis of metrics for trends such as percentage of down time and value of time lost due to outages.

Contd

6. Process Metrics: Extent to which process metrics are analyzed for trends such as ease of quickness with which servers can be rebooted. 7. Process Integration: The degree to which availability process integrates with other processes and tools such as overall network management.

Contd

8. Streamlining or automation: The extent to which availability process is streamlined by automating actions such as generation and notification or issuing of outage tickets. 9. Training of Staff: Training of staff on availability process.

Contd

10. Process Documentation: Quality and value of availability documentation measured and maintained for future analysis.

Anda mungkin juga menyukai