Concepts Guide
2003, LEGATO Systems, Inc. All rights reserved. This product may be covered by one or more of the following patents: U.S. 5,359,713; 5,519,853; 5,649,152; 5,799,141; 5,812,748; 5,835,953; 5,978,565; 6,073,222; 6,085,298; 6,145,089; 6,308,283; 6,324,654; 6,338,126. Other U.S. and international patents pending. Legato Automated Availability Manager, Release 5.1, Concepts Guide September 2003 01-6168-5.1
LEGATO and the LEGATO logo are registered trademarks, and LEGATO NetWorker, NetWorker, NetWorker DiskBackup, LM:, Celestra, PowerSnap, SnapImage, GEMS, Co-StandbyServer, RepliStor, SnapShotServer, QuikStartz, SAN Academy, AlphaStor, ClientPak, Xtender, XtenderSolutions, DiskXtender, ApplicationXtender, ArchiveXtender, EmailXtender, and EmailXaminar are trademarks or registered trademarks of LEGATO Systems, Inc. This is a nonexhaustive list of LEGATO trademarks, and other trademarks may be the property of their respective owners. The following may be trademarks or registered trademarks of the companies identied next to them, and may be used in this document for identication purposes only. Acrobat, Adobe / Adobe Systems, Inc. Apple, Macintosh / Apple Computer, Inc. Caldera Systems, SCO, SCO OpenServer, UnixWare / Caldera, Inc. TELEform / Cardiff Check Point, FireWall-1 / Check Point Software Technologies, Ltd. Unicenter / Computer Associates International, Inc. Access Logix, Celerra, Centera, CLARiiON, EMC, EMC2, MirrorView, MOSAIC:2000, Navisphere, SnapView, SRDF, Symmetrix, TimeFinder / EMC Corporation Fujitsu / Fujitsu, Ltd. Hewlett-Packard, HP, HP-UX, HP Tru64, HP TruCluster, OpenVMS, ProLiant / Hewlett-Packard Company AIX, DB2, DB2 Universal Database, Domino, DYNIX, DYNIXptx, IBM, Informix, Lotus, Lotus Notes, OS/2, PTX, ptx/ADMIN, Raid Plus, ServeRAID, Sequent, Symmetry, Tivoli, / IBM Corporation InstallShield / InstallShield Software Corporation Intel, Itanium / Intel Corporation Linux / Linus Torvalds Active Directory, Microsoft, MS-DOS, Outlook, SQL Server, Windows, Windows NT / Microsoft Corporation Netscape, Netscape Navigator / Netscape Communications Corporation Date ONTAP, NetApp, NetCache, Network Appliance, SnapMirror, SnapRestore / Network Appliance, Inc. IntraNetWare, NetWare, Novell / Novell, Inc. Oracle, Oracle8i, Oracle9i / Oracle Corporation NetFORCE / Procom Technology, Inc. DLTtape / Quantum Corporation Red Hat / Red Hat, Inc. R/3, SAP / SAP AG IRIX, OpenVault, SGI / Silicon Graphics, Inc. SPARC / SPARC International, Inc.b ACSLS, REELbackup, StorageTek / Storage Technology Corporation Solaris, Solstice Backup, Sun, SunOS, Sun StorEdge, Ultra / Sun Microsystems, Inc. SuSE / SuSE, Inc. Sybase / Sybase, Inc. Turbolinux / Turbolinux, Inc. VERITAS, VERITAS File System/ VERITAS Software Corporation WumpusWare / WumpusWare, LLC UNIX / X/Open Company Ltda Unicode / Unicode, Inc. Notes: a. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd. b. Products bearing SPARC trademarks are based on an architecture developed by Sun Microsystems, Inc.
5. LIMITED WARRANTY 5.1 Media and Documentation. Legato warrants that if the media or documentation are damaged or physically defective at the time of delivery of the rst copy of the Software to Licensee and if defective or damaged product is returned to Legato (postage prepaid) within thirty (30) days thereafter, then Legato will provide Licensee with replacements at no cost. 5.2 Limited Software Warranty. Subject to the conditions and limitations of liability stated herein, Legato warrants for a period of thirty (30) days from the delivery of the rst copy of the Software to Licensee that the Software, as delivered, will materially conform to Legatos then current published Documentation for the Software. This warranty covers only problems reported to Legato during the warranty period. For customers outside of the United States, this Limited Software Warranty shall be construed to limit the warranty to the minimum warranty required by law. 5.3 Remedies. The remedies available to Licensee hereunder for any such Software which does not perform as set out herein shall be either repair or replacement, or, if such remedy is not practicable in Legatos opinion, refund of the license fees paid by Licensee upon a return of all copies of the Software to Legato. In the event of a refund this Agreement shall terminate immediately without notice. 6. TERM AND TERMINATION 6.1 Term. The term of this Agreement is perpetual unless terminated in accordance with its provisions. 6.2 Termination. Legato may terminate this Agreement, without notice, upon Licensees breach of any of the provisions hereof. 6.3 Effect of Termination. Upon termination of this Agreement, Licensee agrees to cease all use of the Software and to return to Legato or destroy the Software and all Documentation and related materials in Licensees possession, and so certify to Legato. Except for the License granted herein and as expressly provided herein, the terms of this Agreement shall survive termination. 7. DISCLAIMER AND LIMITATIONS 7.1 Warranty Disclaimer. EXCEPT FOR THE LIMITED WARRANTY PROVIDED IN SECTION 5 ABOVE, LEGATO AND ITS LICENSORS MAKE NO WARRANTIES WITH RESPECT TO ANY SOFTWARE AND DISCLAIMS ALL STATUTORY OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR ARISING FROM A COURSE OF DEALING OR USAGE OF TRADE AND ANY WARRANTIES OF NONINFRINGEMENT. ALL SOFTWARE IS PROVIDED AS IS AND LEGATO DOES NOT WARRANT THAT THE SOFTWARE WILL MEET ANY REQUIREMENTS OR THAT THE OPERATION OF SOFTWARE WILL BE UNINTERRUPTED OR ERROR FREE. ANY LIABILITY OF LEGATO WITH RESPECT TO THE SOFTWARE OR THE PERFORMANCE THEREOF UNDER ANY WARRANTY, NEGLIGENCE, STRICT LIABILITY OR OTHER THEORY WILL BE LIMITED EXCLUSIVELY TO THE REMEDIES SPECIFIED IN SECTION 5.3 ABOVE. Some jurisdictions do not allow the exclusion of implied warranties or limitations on how long an implied warranty may last, so the above limitations may not be applicable. 8. LIMITATION OF LIABILITY 8.1 Limitation of Liability. EXCEPT FOR BODILY INJURY, LEGATO (AND ITS LICENSORS) WILL NOT BE LIABLE OR RESPONSIBLE WITH RESPECT TO THE SUBJECT MATTER OF THIS AGREEMENT UNDER ANY CONTRACT, NEGLIGENCE, STRICT LIABILITY, OR OTHER LEGAL OR EQUITABLE THEORY FOR: (I) ANY INDIRECT, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND WHETHER OR NOT ADVISED IN ADVANCE OF THE POSSIBILITY OF SUCH DAMAGES; OR (II) DAMAGES FOR LOST PROFITS OR LOST DATA; OR (III) COST OF PROCUREMENT OF SUBSTITUTE GOODS, TECHNOLOGY, SERVICES, OR RIGHTS; OR FOR AMOUNTS IN EXCESS OF THOSE RECEIVED BY LEGATO FOR THE PARTICULAR LEGATO SOFTWARE THAT CAUSED THE LIABILITY . Because some jurisdictions do not allow the exclusion or limitation of incidental or consequential damages, Legato's liability in such jurisdictions shall be limited to the extent permitted by law.
State of California, as applied to agreements entered into and to be performed entirely within California between California residents, without regard to the principles of conict of laws or the United Nations Convention on Contracts for the International Sale of Goods. 9.2 Government Restricted Rights. This provision applies to Software acquired directly or indirectly by or on behalf of any government. The Software is a commercial software product, licensed on the open market at market prices, and was developed entirely at private expense and without the use of any government funds. All Software and accompanying Documentation provided in connection with this Agreement are commercial items, commercial computer software and/or commercial computer software documentation.. Any use, modication, reproduction, release, performance, display, or disclosure of the Software by any government shall be governed solely by the terms of this Agreement and shall be prohibited except to the extent expressly permitted by the terms of this Agreement, and no license to the Software is granted to any government requiring different terms. Licensee shall ensure that each copy used or possessed by or for any government is labeled to reect the foregoing. 9.3 Export and Import Controls. Regardless of any disclosure made by Licensee to Legato of an ultimate destination of the Products, Licensee will not directly or indirectly export or transfer any portion of the Software, or any system containing a portion of the Software, to anyone outside the United States (including further export if Licensee took delivery outside the U.S.) without rst complying with any export or import controls that may be imposed on the Software by the U.S. Government or any country or organization of nations within whose jurisdiction Licensee operates or does business. Licensee shall at all times strictly comply with all such laws, regulations, and orders, and agrees to commit no act which, directly or indirectly, would violate any such law, regulation or order. 9.4 Assignment. This Agreement may not be assigned or transferred by Licensee without the prior written consent of Legato, which shall not be unreasonably withheld. Legato may assign or otherwise transfer any or all of its rights and obligations under this Agreement upon notice to Licensee. 9.5 Sole Remedy and Allocation of Risk. Licensee's sole and exclusive remedies are set forth in this Agreement. This Agreement denes a mutually agreed-upon allocation of risk, and the License price reects such allocation of risk. 9.6 Equitable Relief. The parties agree that a breach of this Agreement adversely affecting Legatos intellectual property rights in the Software may cause irreparable injury to Legato for which monetary damages may not be an adequate remedy and Legato shall be entitled to equitable relief in addition to any remedies it may have hereunder or at law. 9.7 No Waiver. Failure by either party to enforce any provision of this Agreement will not be deemed a waiver of future enforcement of that or any other provision, nor will any single or partial exercise of any right or power hereunder preclude further exercise of any other right hereunder. 9.8 Severability. If for any reason a court of competent jurisdiction nds any provision of this Agreement, or portion thereof, to be unenforceable, that provision of the Agreement will be enforced to the maximum extent permissible so as to effect the intent of the parties, and the remainder of this Agreement will continue in full force and effect. 10. ENTIRE AGREEMENT 10.1 This Agreement sets forth the entire understanding and agreement between the parties and may be amended only in a writing signed by authorized representatives of both parties. No vendor, distributor, dealer, retailer, sales person, or other person is authorized by Legato to modify this Agreement or to make any warranty, representation, or promise which is different than, or in addition to, the warranties, representations, or promises made in this Agreement. No pre-printed purchase order terms shall in any way modify, replace or supersede the terms of this Agreement.
9. MISCELLANEOUS 9.1 Governing Law. This Agreement shall be governed by the laws of the
Contents
Preface ......................................................................................................... 11.
Audience.......................................................................................................................... 11. Product Documentation ................................................................................................... 11. Conventions..................................................................................................................... 12. Information and Services................................................................................................. 13. General Information .................................................................................................. 13. Technical Support ..................................................................................................... 13. Licensing and Registration........................................................................................ 14. Customer Feedback ........................................................................................................ 14.
Contents
Scalable Solutions for Different Needs ............................................................................24. AAM Two Node Solution ...........................................................................................24. Features and Management Capabilities .............................................................24. Customizing AAM for Larger Environments ..............................................................26. Features ..............................................................................................................26. Automated Availability Manager SDK........................................................................27. Automated Availability Manager Modules .................................................................27.
Contents
Time of Day ........................................................................................................ 37. Application Migration and System Changes ....................................................... 38. New Systems...................................................................................................... 38. Triggers, Sensors, and Actuators .................................................................................... 38. Triggers ..................................................................................................................... 39. Sensor-based Triggers ....................................................................................... 40. Scheduled Triggers ............................................................................................ 40. On-Demand Triggers .......................................................................................... 41. Windows Event Log Triggers.............................................................................. 41. Sensors ..................................................................................................................... 41. Actuators ................................................................................................................... 42. Proxies............................................................................................................................. 42.
Contents
Agents .......................................................................................................................57. Agent Components....................................................................................................58. Agent Process.....................................................................................................59. Replicated Database...........................................................................................59. Process Monitor ..................................................................................................59. Rule Interpreter ...................................................................................................60. Events and Rules Engine....................................................................................60. Primary Agents ..........................................................................................................60. Secondary Agents .....................................................................................................61. Agents and Scaling....................................................................................................61. Network Architecture........................................................................................................62. Network Configuration ...............................................................................................62. Network Redundancy ................................................................................................62. Failure Detection .......................................................................................................63. Data Source Architecture .................................................................................................63. AIX Logical Volume Manager (LVM) .........................................................................64. HP-UX Logical Volume Manager (LVM) ....................................................................65. Windows Shared Disk ...............................................................................................65. Windows Network Share ...........................................................................................65. Legato Mirroring for Windows 2000...........................................................................66. Windows Veritas Volume Manager (VxVM) ..............................................................66. RepliStor Data Source ...............................................................................................66. Solaris Solstice Disk Suite (SDS) ..............................................................................67. Solaris Veritas Volume Manager (VxVM) ..................................................................67. UNIX File System ......................................................................................................67. EMC SRDF ................................................................................................................67. EMC PowerPath Volume Manager............................................................................68. Rule and Monitoring Architecture.....................................................................................68. Events And Rules Engine ..........................................................................................69. Triggers, Sensors, and Actuators ..............................................................................70. Sensors and Sensor-based Triggers ..................................................................71.
Legato Automated Availability Manager, Release 5.1 Concepts Guide
Contents
Scheduled Triggers ............................................................................................ 71. Triggering the Rule ............................................................................................. 71. Actuators ............................................................................................................ 71. Proxies ...................................................................................................................... 72. Resource Group Architecture .......................................................................................... 72.
Contents
10
Preface
The Legato Automated Availability Manager Concepts Guide contains conceptual information about Legato Automated Availability Manager (AAM) software.
Audience
The information in this guide is intended for system administrators who are responsible for installing software and maintaining the servers and clients on a network. Operators who monitor the daily backups may also nd this manual useful.
Product Documentation
Legato offers an extensive archive of product documentation at its web site www.legato.com. Most of the documents are in Adobe Acrobat Portable Document Format (PDF), and can be viewed by downloading and installing the Adobe Acrobat Reader. The Reader is available in the /viewers/acroread directory on the Legato Documentation Suite CD-ROM, or directly from Adobe at www.adobe.com. To install and use the Reader on the preferred platform, refer to the instructions in the CD-ROMs /viewers/acroread/readme.txt le or at the Adobe web site.
11
Conventions
Conventions
This document uses the following typographic conventions and symbols to make information easier to access and understand. Convention boldface Indicates Names of line commands, daemons, options, programs, or scripts Example The nsradmin command starts the command line version of the administration program.
italic in text
Pathnames, lenames, Displayed messages are also written to computer names, new terms /nsr/logs/daemon.log. dened in the Glossary or within the chapter, or emphasized words nwadmin -s server-name
italic in command A variable that must be line provided in the command line fixed-width Examples and information displayed on the screen
media waiting: recover waiting for 8mm 5GB tape volume name
fixed-width, Commands and options that nsr_shutdown -a boldface must be typed exactly as shown Menu_Name> Command Important: A path or an order to follow Volume>Change Mode>Appendable for making selections in the GUI Information that must be read and followed to ensure successful backup and Important: Use the no_verify option with recovery of data extreme caution.
12
Preface
General Information
The Legato web site provides most of the information that customers might need. Technical bulletins and binary patches are also accessible on the Legato FTP site. For specic sales or training needs, e-mail or call Legato. Legato Service or Resource www.legato.com ftp.legato.com (log in as anonymous) Legato Sales (650) 210-7000 (option 1) sales@legato.com Legato Education Services (650) 842-9357 training@legato.com Company & Technical Binary Product Training Bulletins Patches Information Programs Yes Yes Yes Yes Yes Yes Yes
Yes
Technical Support
The Support section of the Legato web site provides contact information, software patches, technical documentation, and information about available support programs. Customers with an active support agreement have access to TechDialog, Legatos integrated product knowledge base. Help with Legato software issues is also available through Legato Technical Support. Customers without an active support agreement can contact Support Sales and Renewal to purchase annual Software Update Subscriptions, or Legato Technical Support services for per-update/per-incident support.
13
a. Contact information for Americas, Asia, and Pacic. b. Contact information for Europe, Middle East, and Africa.
Customer Feedback
Legato welcomes comments and suggestions about software features, the installation procedure, and documentation. Please send any suggestions and comments to feedback@legato.com. Legato conrms receipt of all e-mail correspondence. Although Legato cannot respond personally to every request, all comments and suggestions are considered during product design. Help improve Legato documentation and be eligible to win a prize by completing a brief survey. Visit the Legato web site at www.legato.com, navigate to the documentation page, and click on the link to the survey.
14
Chapter 1: Introduction
Businesses need reliable access to mission-critical information and applications in order to stay competitive and be successful in todays ever changing marketplace. With companies relying heavily on distributed computing architectures, multi-vendor environments, and web based solutions, the need for high availability has never been greater. Every minute of downtimeplanned or unplannedcan cost thousands of dollars. As such, you need a cost-effective way to monitor the state of your computer environment, automate repetitive tasks, and keep applications available and running. By denition, High Availability (HA) means an environment where there is immediate response to failures, either at the application level or the machine level. When the application or environment fails, the HA solution relocates the affected resources and restarts the protected application. Reaction to failure, however, is only one aspect of a complete and coherent availability solution. A total HA solution goes far beyond this simple failover denition of traditional high availability and allows you to monitor and automate responses to events. This means that system maintenance, load uctuations, business work ow cycles, and even human error, and other factors can impact availability and effectiveness can be monitored and maintained. In addition to availability issues, you want a solution that ts your current business environment as well as your business plan; a solution that: Leverages your existing investment in computing and networking resources Reacts and adapts to your dynamic environment Can be customized to your specic business needs Expands as your business expands
15
Legato Automated Availability Manager, also known as AAM, allows you to keep your companys applications up and running smoothly during both planned and unplanned events. AAM works within your existing environment, and is scalable in order to grow and change as your business needs require.
16
Chapter 1: Introduction
Failure AAM provides complete and exible failover capabilities at the application and node level. AAM minimizes downtime by detecting failures quickly when they occur, and allows you to have complete control over AAM migration by conguring shutdown and startup sequences.
Comprehensive Protection
Legato Automated Availability Manager is the answer to availability and automation needs over the continuum of service events. AAM provides the most comprehensive solution available for proactively monitoring, managing, and protecting resources. Preventing, detecting, and responding to resource failures, planned system changes, application migration, time-of-day occurrences, and performance degradation are all included in AAMs scope of protection. With AAM, capabilities that have been available in the past as individual static tools that cooperated poorly at bestand usually only when provided by the same vendorare now available in one package to provide a comprehensive and dynamic solution, customized for the needs of your business. When you install AAM on a set of nodes, you create a domain, the largest unit of management. AAMs comprehensive Rules and Events Engine provides a reliable framework through which the full range of service events are received and acted upon throughout the domain. As these events occur, the Events and Rules Engine automatically carries out your set of prepackaged and customized management policies. AAM provides advanced failover capabilities and full management capabilities for applications and services, as well as data sources for disks and le systems, managed IP addresses, and node name aliases for machines. With AAM you can manage single objects to large installations of nodes whose applications have complex interdependencies. Machines can be grouped and managed by resource groups, which in turn can use all or any number of the resources contained within this group. To create more complex and intricate functionality, you can develop customized scripts specically for your environment.
17
Proactive Management
Proactive Management
Enterprise solutions often need sophisticated or proactive management capabilities. AAM provides management solutions at an enterprise level. At the heart of AAMs extensive availability capabilities are rules, management policies that can be developed to handle almost any operation. Rules can be as simple or advanced as the situation dictates.
Rules
The proactive, automated environment management capabilities provided by rules can monitor almost any data point within your environment or application and keep your managed applications available and responsive. Rules offer nearly limitless exibility and control in implementing not only high availability but also environment management solutions and policies. Rules provide you with total control over the management solution by allowing you to: Dene the statesor triggersthat drive the rule Determine the dataor sensorvalue that a trigger monitors, or providing a schedule on which the trigger res Determine the actions to be takenin the form of a ruleas a result of the trigger notication Determine if any function callsor actuatorsare made from within the rule.
Each of these components is part of the rule process used by AAMs fault-tolerant Events and Rules Engine for management. Rules provide the mechanism through which your AAM management policies are dened and carried out. Platform-specic sensors and sensors for monitoring node status, process status, and time of day are provided with the package. For highly customized applications, the AAM SDK allows you to create your own sets of custom sensors and objects called actuators to extend their management capabilities. For more details on triggers, sensors, and actuators, see "Chapter 2: Continuum of Service Events" on page 29. For more details on how rules and the Events and Rules Engine work see "Chapter 4: Architecture Highlights" on page 57.
18
Chapter 1: Introduction
Peer-to-Peer Architecture
AAM implements a peer-to-peer architecture that eliminates the need for idle standby nodes, preventing unnecessary expenditures for additional hardware. This active-active architecture allows each node to run its normal applications, making productive use of your computing resources, while still providing a target location in the event of a failure, maintenance operation or other scheduled event. This type of environment, in which different types of applications are running on different nodes within the domain, is a mixed workload environment, one that is fully supported by AAM without modication to the product or the applications it manages. For example, consider the mixed workload environment of three applications running on three different nodes. If it is determined that each application only uses 40 per cent of the nodes resources, thenif any single node failsthe application can migrate to a surviving node without causing a problem, although the surviving node now runs at 80 percent capacity. To avoid overloading a node when an application migration takes place you could congure a hot standby nodeone congured to be idle under normal circumstances, but able to support the application when a migration event occurs. AAM supports this architecture, but unlike many other products, does not require this conguration. Instead you could write a rule which checks the load of the potential destination node, and decides whether it should relocate there based on potential overload.
19
Simplified Failover
Load Balancing
AAMs Events and Rules Engine allows you to dene sophisticated load balancing policies. Rules can be developed to monitor system resources or application load, and respond at runtime to move or stop applications dynamically to address load problems. AAM can also start multiple instances of the same application, providing dynamic load balancing across multiple nodes as processing requirements increase. Consider again the example of three applications and three nodes. Imagine that each night at 8:00 a backup procedure is started on a node and the response time to the application running on that machine increases signicantly during its 45 minute backup period. To eliminate this problem, an AAM rule can be developed to automate the relocation of the application from the node during the backup and its return when the backup is done. Interestingly, there are a number of ways to implement the condition that initiates the migration. For example the migration could be tied to the load on the machine, in which case AAM moves the application whenever the node gets busy. Another approach would be to move the application each night at 8:00 regardless of the load. Yet another approach might be to use AAM to start the backup procedure and design the rule to move the application before it starts the backup process. How AAM manages your environment is entirely up to you.
Cross-Application Activation
Using the sensors available in AAM, rules can be created that take data values from one application and activate functions within that, or an entirely different, application. Even applications to which AAM does not have direct access can be monitored and managed using AAMs Proxy Processes, which publish sensor and actuator values on behalf of a closed application. Once-isolated applications can now be tied together to leverage the functional capabilities of other applications in the environment.
Simplied Failover
AAM allows you to perform failover for any resource object, including processes, services, IP addresses, data sources, and node aliases on any machine in the domain (called nodes when managed by AAM). AAMs failover system enables migration of an individual application or of all applications and system resources running on or attached to a single node.
20
Chapter 1: Introduction
Other solutions may provide failover, but those solutions are typically built around a pair of servers, providing protection within a restricted environment. Scalability is impossible since applications are typically limited to a small number of machines. AAM, meanwhile, allows you to group up to 100 nodes and an unlimited number of applications into a single high availability unit, even in a mixed platform environment. It can switch over a single application, allowing the original node to stay operational, or, if necessary, switch over all applications and system resources running on or attached to the node. AAM also provides the ability to automatically fail the resources back to the original node if the node returns to a healthy state.
Resource Groups
AAM uses resource groups for failover. Each resource group coordinates the migration and other capabilities and denes the nodes and resources that are to be managed. The node denition provides the mechanism for AAMs cascading failover for up to 100 nodes in the domain. The resourcessuch as processes, disks, IP addresses, and node name aliasesdene the bounds of management. Additionally, a resource group denition can contain further customization through the support of scripts, timing delays, utility processes and failback options. In AAMs solution, the number of resource groups are limited only by the available resources in your environment. Your solutions are limited only by your needs. Making use of AAMs reliable rule architecture, resource groups are implemented internally as rules. Each resource group uses built-in sensors and triggers, which monitor and react to process and node failures. The exibility of the rule engine allows for simple one node application monitoring, two-node failover and other traditional HA features as well as a framework to do much more. AAM can relocate one or more resources independently, without waiting for the entire server to fail and without forcing the need to relocate every resource from that server concurrently. AAM can also automatically relocate resource groups to their original nodes when failed servers come back online. AAMs resource groups provide a comprehensive and easy-to-use mechanism for adding advanced multi-node failover capabilities to your AAM environment.
21
Failure Prevention
AAM Resource Group management provides load balancing upon failure. That is, when a node fails or comes down for maintenance, each resource group on that node can be relocated to different, separate nodes, so the entire application load can be better balanced. Without this ability, all of your applications could be failed over indiscriminately to a single node, potentially causing a serious performance degradation for end users. AAM also provides the ability to relocate a resource group manually during times of high usage, or for routine system maintenance. Rules can also be written to monitor load across the domain, and move the resource group as needed.
Cross-Platform Capabilities
AAM is hardware independent and compatible with equipment supported by platform vendors for Windows and supported UNIX operating systems. AAM also supports leading UNIX platforms such as Sun Solaris 8 and above, HP-UX 11.0, Redhat Linux,and AIX 5.1. AAM supports near identical functionality across platforms, making mixed-platform environments transparent to system administrators and clients. This consistency allows transparent cross-platform migration between and among nodes running Windows and UNIX.
22
Chapter 1: Introduction
AAMs Management Console can be used to manage all nodesWindows or UNIX. All nodes, regardless of platform, can be supported from a single console running on any node. The AAM Management Console automatically detects the platforms on which agents are running in the domain and displays the correct information as needed.
Centralized Management
AAM provides a centralized Management Console for full administration of all managed objects in the AAM domain. This single administration interface allows you to manage and view information about nodes, resource groups, individual resources, and rules for both Windows and UNIX nodes. The Management Console can be run from any machine on the network. In addition, any changes you make through the Management Console are automatically and immediately reected across the entire AAM domain, and are stored in AAMs fully replicated distributed database. The AAM Management Console provides: A centralized monitoring and administration tool for taking managed resources and resource groups on- and ofine. A real-time reection of managed cluster object states. As soon as AAM detects a state change for an object, the Management Console updates its display to reect that change. An interface to dene and congure all the managed resources in the cluster from a single local or remote location. An interface to create, congure, and run rules. Ability to manage multiple AAM domains from a single interface.
23
Often, high availability solutions are not redundant or highly available themselves. When the HA solution fails, the environment becomes vulnerable. There is little point in trying to provide a high availability solution with a low availability product, so AAM was designed to be reliable and highly available with a redundant, self-monitoring architecture to keep its own processes running. AAM is built on top of a reliable messaging technology, ensuring that messages are distributed throughout the set of nodes. AAM automatically uses redundant network messaging if multiple networks are in place and maintains a fully replicated database of all its components and conguration information. The AAM agent is implemented as a group of specialized, self-checking processes that monitor each others health and automatically restart any component that fails. The Events and Rules Engine guarantees that rules execute, even if the node on which they were executing fails.
24
Chapter 1: Introduction
Resource Groups These groups of applications and resource objects are at the heart of AAMs failover and migration capabilities. In the event of a failure or a user-initiated migration, all objects in the resource group are managed as a single package, providing reliability and consistency. Centralized Management AAM provides centralized management from the Management Console, a graphical interface which can be run on any nodeWindows or UNIXon the network to view a comprehensive view of all nodes and resources in the domain, regardless of platform or location. For more information, see "Centralized Management" on page 23. Data Sources A data source is a domain-wide name used to specify a storage device. Dening a shared data source allows access to the device from any node in the domain. Data sources can be dened for a shared disk where AAM controls which node has access to the data source. Network Interface Cards AAM enables you to test NICs on a node and to determine how the card will be used. Managed IP Addresses AAM enables you to associate unique IP addresses to nodes in the domain. The IP address can be moved from node to node in the domain, allowing client connections to applications to continue in the event of migration. Clients that connect to a server using a managed IP address need not know which physical machine in the cluster is hosting it. AAM also allows advance IP management including MAC address modication, default NIC selection on an IP address basis, and physical address migration. Node Aliases AAM enables you to associate a unique name or names with Windows nodes in the domain. Node aliases allow client processes to use host names for connections rather than IP addresses to nd the available resource group objects. The node alias can be moved from node to node in the domain, allowing client connections to continue even in the event of migration. Clients that connect to a server using a node alias are connecting to the logical node name and need not know which physical machine in the cluster is hosting it. Note that this works only for applications that access the application through NetBIOS. Security AAM provides a secure interface for dening and controlling domain-wide objects. AAM provides three types of security: user, operator, and administrator. Nodes must be added to the database before it can join the domain. Likewise, users must be added to the database and assigned a security level before they can access the domain.
25
Event Viewer AAM log les provide a history of events, which system administrators can use to analyze patterns of events that have led up to failures. Event messages from AAM appear in AAMs own Event Viewer and log les, as well as in the Windows Event Log. Failure Detection The nodes in an AAM domain use a communication network for application and domain communication. Each participating AAM node is connected to at least one communication network, although multiple network support is built in to AAM. If a network failure occurs in a cluster with multiple cluster networks, AAM automatically utilizes the working network. Isolation Detection AAM provides a mechanism which allows a node to determine if it has lost its network connectivity, and to respond appropriately. This behavior prevents resource groups and rules from starting on the isolated node when these objects are already running correctly on the unaffected portion of the network. Command Line Interface (CLI) The CLI provides the same functions as the Management Console, but allows AAM commands to be executed from within scripts or in batch mode.
Features
AAM users have access to built-in sensors, triggers, and actuators, as well as over 70 API calls for use in their rules, which leverage all of AAMs power. Rules are created in AAMs Rule Editor. The Rule API calls provide management control over all elements in the domain, provide access to the replicated database for data from both inside and outside of the domain, and provide a consistent view of the domain from all rules. AAMs tracing capabilities allow inclusion of descriptive bug tracking information in rules. AAMs NIC failover capability ensures that network communication is not interrupted.
26
Chapter 1: Introduction
In addition, Proxy Processesaware applications to publish sensorsare also included in AAM. These application open up the management capabilities of the domain and create an integrated environment between AAM and the managed application.
27
28
Legato Automated Availability Manager(AAM) provides solutions for events across the continuum of service events described in "Chapter 1: Introduction" on page 15. With its reliable system of resource groups for failover and rules for management solutions ranging from simple to sophisticated, AAM not only addresses failure in the domain, but also allows customized solutions specic to your business needs. The group of machines where AAM agents are installed and congured to work together are known collectively as the AAM domain, which is at the center of the AAM architecture. Domains can be as small as one node, or as large as 100. All of the nodes contained within the AAM domain share certain characteristics, such as a common replicated database, event log, and failure detection settings. The AAM agent that is placed on each machine is a reliable, self-monitoring set of processes which performs all of AAMs monitoring and management functionality. Sensor processing, rule triggering, rule execution and actuator ring in response to events all take place through the Events and Rules Engine, provided through the agent. Once the AAM agent is installed on a machine, that machine is referred to as a node. Nodes comprise the domain, over which rules can be used for exible management. A node can also host multiple domains. Rules can be designed to handle events in the domain, and the solution can be as simple or sophisticated as the problem dictates. Nodes can also be included in resource groups, which provide management and availability for applications and other resource objects such as data sources, IP addresses, and node aliases. For added protection, data sources can be replicated using Legato Replication. These tools provide all of the functionality necessary to provide availability protection for your system over the continuum of service events.
29
Resource Groups
Resource Groups
AAMs failover capabilities are implemented using resource groups. Each resource group coordinates the migration of a group or set of resources across the domain. With their failover and migration capabilities, resource groups go a long way toward covering events on the continuum of service events, providing solutions for Failure Application migration System changes New systems
Overview
Resource groups provide a standard, easy-to-dene solution for application migration, either planned or unplanned. Resource groups are dened in the Resource Group Editor, a point-and-click interface which allows quick conguration of the nodes in the domain and the objects to be managed. AAMs resource groups provide a comprehensive and easy-to-use mechanism for adding advanced multi-node failover and clustering capabilities to your AAM environment.
30
The preferred node list allows you to congure static load balancing for applications in your domain. Since the order of nodes is congurable, multiple resource groups running on the same node need not migrate to the same node in the event of failure. Instead, each resource group can migrate to a different node, preventing the possibility of overloading the destination nodes. If advanced operation requires a resource group managed by a rule to determine the migration destination of the resource group, AAM supports nodeless resource groups. In this case, no node list is congured, and the rule is written to determine the most suitable location for the resource group if node failure occurs.
Resource Objects
Resource group are composed of a group of related AAM managed objects. These objects are processes, services, managed IP addresses, network interface cards, node aliases, and data sources. Once you dene the resource group, AAM treats it and all of the resource objects as a package. All objects within the resource group always run on the same node. If one of the objects within the resource group cannot run properly on a particular node, the whole resource group migrates to the next node in the preferred node list. Additionally, a resource group denition can contain further customization through scripts, timing delays, utility processes and failback options.
31
Overview
Migration
Migration is the movement of a resource groups objects from one node to another. When a resource group migrates, AAM moves all objects within the resource group from the current node and moves them to the specied node. All managed objects within a single resource group migrate to the same node. Resource groups can only migrate to nodes specied in the Preferred Node List that have the software properly installed. Automatic migration occurs in the event of a node failure or unrecoverable resource failure. Users can also initiate migration of a resource group for events such as planned node outages or software upgrades on the current node. Resource groups can optionally be congured to provide automatic failback of a resource group when the preferred node for the group comes back online after a failure. This functionality ensures that a resource group can always run on the node most suited for operation if that node is available.
32
Tracking
AAM allows you to get statistics for uptime and planned versus unplanned downtime using its availability tracking feature. This feature allows you to see at a glance the availability for the resource group over any given time since the resource group was created.
Failure Protection
Resource groups protect your system against failure using AAMs advanced capabilities. In the event of node failure, the resource group migrates to the next node in the preferred node list. Since AAM allows up to 100 nodes in a resource group, the availability of your applications and resources are almost assured. Because of the way resource groups are implemented, you can also be assured that all objects within the resource group always run together on the same node.
33
AAMs ability to load balance in the event of failure also ensures that you can prevent overloading destination nodes, which could cause another failure. If an individual process or Windows service that AAM monitors within a resource group fails, AAM performs one of two actions based on the users preference. AAM either: Restarts the failed process or Windows service, or Executes the resource group shutdown sequence followed by the resource group startup sequence on the same node to bring the resources back online.
If AAM is unable to restart the process or service on the original node, it then migrates the resource group to the next designated node specied for the resource group.
Application Migration
AAMs migration capabilities also provides an efcient mechanism for moving the resource group and its objects to another node in the cluster. Because migration is automatic, downtime caused by migration is minimal. Because AAM does all the work of moving the resources, you are ensured that they all migrate correctly. With AAM, migration of a resource group from one node to another is as simple as the click of a button.
System Changes
System changes can also be easily handled with AAMs resource group capabilities. When hardware is replaced or upgraded, resource groups can be migrated to other nodes in the cluster while the changes are taking place, and moved back as soon as the work is complete. During application upgrades, migration is also an option, or users can choose to disable resource group monitoring and upgrade the software in place. With resource group monitoring disabled, the processes and services in the group can be stopped without causing restart or migration. The affected process can be upgraded, while the other applications on the node continue to run unaffected. Once the upgrade is complete, monitoring can be re-enabled, and AAMs failure protection immediately resumes monitoring for failure.
New Systems
Adding a new system to the cluster is a matter of installing the proper processes and services on the node, adding the node to the domain, and then to the resource groups preferred node list.
Legato Automated Availability Manager, Release 5.1 Concepts Guide
34
Rules
AAMs exible proactive and reactive management capabilities are implemented using rules. The exible implementation of rules provides much more exibility and customization than resource groups. While resource groups provide resource monitoring and migration capabilities, rules can be created to perform much more complex procedures while still monitoring and managing resource objects in the domain just as resource groups do, as well as resource groups themselves. Rules that utilize AAMs built-in library of rule API subroutines provide centralized administration, a consistent domain-wide view, and direct access to the replicated database. They can provide solutions to problems within your domain ranging from simple to complex. Because of their exibility, rules can provide solutions across the full range of the continuum of service events.
Overview
With rules, you can move your solution beyond simple failover to specialized proactive and reactive responses to events within your domain. Rules are applicable over the entire domain. They are activated by a trigger, which can be specied to monitor sensor points from all nodes or a subset, or to re at a set time. When the triggering condition is met, the rule executes. Rules are dened through the AAM Rule Editor, and are written in Perl, an interpreted scripting language. Perl is portable, allowing the same rule text to execute in exactly the same manner on any platform. Rules are executed within the Events and Rules Engine which provides visibility at runtime into the entire domain and can make decisions at the domain level. With rules, availability decisions no longer must be static or made in advance. Rules can sense and react dynamically to changes in the environment and make decisions based on up-to-the-second runtime decisions. Multiple triggers can be associated with a rule, and one trigger can activate multiple rules. Figure 1 on page 36 provides an overview of relationships within the Events and Rules Engine.
35
Overview
Sensor-based Trigger
Sensor-based Trigger
Scheduled Trigger
2
User-dened Sensors AAM-provided Sensors Time
36
The result is a management tool which can be congured from a central location that watches the entire domain and responds accordingly.
Failure Protection
Because rules have visibility over the entire domain, they can respond to failures on both the node and application level. A typical rule could respond to either type of failure by migration to another node. Rules can also perform actions such as sending e-mail notication to system administrators, and writing messages to the AAM and Windows event logs. Through AAMs actuator capabilities, rules can interact with or notify third party management facilities. In the event of a failure, the third party could be notied by a page or through other means.
Performance Degradation
Using AAM sensors, you can dene a rule to execute based on CPU load or disk space. Using such conditions, a rule could be written to provide load balancing when CPU load reaches a certain level or to clean up the disk when disk levels reach a certain value.
Business Changes
Rules can be created to respond to dynamic resource deployment. For example, if your business is involved in e-commerce, a rule can be written to monitor the number of users on your web page, or the systems CPU load. When the load becomes too great, AAM can launch another web service on another node, relocate IP addresses, and respond to the additional trafc. Once levels return to normal, the second web server can be shut down, and return the system to its normal processing conguration.
Time of Day
Using AAMs scheduled triggers, you can run any rule once at a specic time, or on a recurring, scheduled basis. Once triggered, the rule can perform any scheduled operation, such as nightly backup or generating weekly reports.
37
For example, in an environment where AAM has been integrated with a data mirroring product such as Legato RepliStor, each night you can temporarily turn off mirroring of the source disk, start a backup procedure against the target disk, and then restart data replication updates when the backup is completed. Another application of scheduling rules could be shifting processing priority to the accounting and nance department on the 29th of the month for end-of-month processing.
2
Application Migration and System Changes
Although resource groups are specially designed to respond to the migration and maintenance needs of your domain, rules can also be designed to perform the same tasks, or to manage the operation of a number of resource groups. For example, you could create several resource groups to perform special tasks, and create a rule to run concurrently to coordinate the use of the resource groups.
New Systems
Because rules are not node specic, they are immediately available to nodes which are added to the domain. As soon as the triggers are activated, the same management capabilities that are available to other nodes in the domain are available on the new node. Information about the node and its processes are available to the node immediately.
38
Actuators are external function calls to an AAM-aware process that are dened and provided through the AAM SDK. Actuators can be invoked from rule text. The process that the actuator takes action upon need not be the same process whose sensor value red the trigger. Both resource groups and rules use this system of triggers, sensors and actuators. In rules, triggers, sensors and actuators are visible to, and congurable by, the user. AAM automatically creates them when resource groups are used. This combination of sensors, actuators, and triggers provides a powerful tool kit for solving a wide range of availability and clustering problems.
Triggers
Triggers are the objects which drive rules. Triggers may re based on a scheduled time or on a data value received from a sensor within a process. The ring of a trigger forces the evaluation of any rule associated with the trigger by the Events and Rules Engine. Note that multiple triggers can be associated with a single rule, allowing different conditions to trigger the same rule. A trigger can also be associated with more than one rule, allowing a single event to drive multiple rules. Triggers are dened using the Management Console and are stored in AAMs replicated database. When the rule is enabled, the associated triggers are activated and start monitoring the associated sensor. The trigger is then monitored and red whenever the congured condition is met. There are four types of triggers: Sensor-based The trigger res when its sensor returns a data value matching the condition criterion. There are a number of condition types which can be set for sensor-based triggers. A single trigger can monitor no more than one sensor. Scheduled Triggers re at a specic time or on demand. On-Demand Allows the user to run rules as needed. This can be useful during rule creation and testing to ensure that the rule is running as needed. Windows Event Log Enable a trigger to re when a particular Windows event occurs or when a series of events occur within a specic time interval.
39
Triggers
Sensor-based Triggers
A sensor-based Trigger res when its associated sensor returns a data value that matches the condition criterion set for the trigger. There are a number of condition types which you can set for sensor-based triggers, based on the expected values from the sensor: Threshold A value is entered, and the trigger res when the sensor value goes over, under, or is equal to a specied value. Absolute Change A value is entered, and the trigger res when the sensor value changes by the absolute value of the amount indicated. Percent Change (Relative To Range) A percentage and a value range are entered. The trigger res when the sensor value changes by the selected percentage of the entered range. Percent Change (Relative To Last Value) A percentage is entered, and the trigger res when the sensor value changes by the selected percentage relative to the sensors last reported value. State Sensor Comparison The trigger res when the process or node state is equal (or not equal) to indicated state. State Sensor Change The trigger res when process or node state changes in any way.
Scheduled Triggers
Scheduled triggers allow you to congure the trigger to re at a specic time or time interval. You can also congure the trigger to re on demand. Date The triggers re based on the time and date. Triggers can also be congured to re at a regular interval. Day Of The Week The triggers re based on the time on specic days of the week On-Demand The triggers re when invoked from a rule or through the AAM Management Console or command line interface.
Scheduled triggers are not connected to a sensor. At the congured time, the trigger res. This allows AAM to manage scheduled events in the domain. On-demand triggers allow the user to run rules as needed, and can be useful during rule creation and testing to ensure that the rule is running as intended. On-demand triggers can be red from within a rule, or from the Management Console.
40
On-Demand Triggers
On Demand triggers allow the user to run rules as needed. This can be useful during rule creation and testing to ensure that the rule is running as needed.
Sensors
A sensor is an AAM object that publishes values from a data point such as a counter or environmental state. AAM publishes state sensors for managed processes and managed nodes. Users can also create additional sensors to publish data from within AAM-aware processes using the AAM SDK API. AAM also provides several aware processes called proxies which provide sensors on behalf of nodes and processes. During operation, sensors provide data to sensor-based triggers. Sensor values can also be explicitly polled through the Management Console, the CLI, or a running rule. AAM supplies two special state sensors which publish values for every node and managed process in an AAM domain. Unlike user-dened sensors, these sensors are not published from within a process, but are published by the AAM agent. AAM provides sensors for monitoring node status, process status, and time of day. Platform-specic sensors are also provided.
41
Actuators
User dened sensors provide data from aware processes based on the parameters that are dened and programmed by the user, allowing nearly limitless options on the types of information which can be gathered to trigger events.
Actuators
Actuator are function calls provided by an AAM-aware application which can be invoked from rules. When the rule text invokes the actuator, the actuators associated function is executed within the aware application. The actuator can be called either for a specic application or a number of processes within a class. Further, the application that the actuator takes action upon does not need to be the same process whose sensor red the trigger.
Proxies
Proxies are AAM-aware processes which publish sensor values and actuators on behalf of another process. AAM provides two proxy processes for each platform, a node proxy and a process proxy, which are used to publish information about node and processes respectively. Generalized platform and operating system specic sensors and actuators are provided. With proxies, developers can dene sensors to monitor the detailed behavior of an application, allowing monitoring of just about any data point for any user-dened condition. Actuators can then initiate actions within the application based on these triggered conditions. Proxies essentially provide the functionality of an AAM-aware application in any off the shelf product. Proxies are a powerful tool for total system integration and management through AAM.
42
This chapter describes how a company goes about implementing the Legato Automated Availability Manager (AAM)for business-critical applications. The case study describes what actions the company administrator wants to take, the initial and acquired hardware resources, and conguration of AAM for the companys environment.
High Availability
Marvellus Corporation, a software reseller, has long offered its products through their own product catalog. Now, they want to enter the Internet Commerce market. They will be using a database to record their transactions. Since Marvellus is a PC-based company, they choose Microsoft Internet Information Server (IIS) for their online needs. IIS provides the World Wide Web Publishing Service that Marvellus wants. They also use Microsoft SQL Server for their database.
43
Traditionally, all of these pieces run on a single machine. Marvellus buys a machine to use as a server, installs the two services they will be using, congures the IP address, and creates their Web pages. They implement the single node conguration shown in Figure 2 on page 44, which has no backup and no failover scheme. Figure 2. Initial Hardware Conguration
Communication Network IP Address 1.2.3.4 Web SQL Server
D:
gold
Running Applications Live Connections IP Address 1.2.3.4 Static IP Address Network Connections
The gure shows the single node conguration. Marvellus system administrator, Jack, installs and runs the Web and SQL services on the node called gold. Data, including Web pages and related CGI scripts for the Web Service and data les for the database is stored on the local disk drive D:. Internet communication for the Web Server and FTP Server takes place through the network IP address 1.2.3.4. Soon Marvellus Web site generates 40 percent of their sales, with many of the orders being placed during the evening and over the weekend. Operators process the weekend Web-based orders during normal work hours on Monday. Late one Friday night, the main power supply to gold is accidentally unplugged, and the server goes down. No one discovers the failure until Monday morning. The company estimates that they lost 20 percent of their potential revenues for the week because of this incident and the lack of a failover solution. The most serious drawback to the single node conguration is that when any sort of failure occurs, the application is unavailable until the machine and applications are brought online. Most often, the applications are brought back through manual procedures and after signicant down time. The single node conguration is vulnerable to various failure scenarios, including node failure,
44
application failure, and disk failure. Failure of a disk not only stops the application but prolongs downtime, since a system administrator must retrieve data from a backup. For Marvellus, the failure of one component results in a total node failure. The entire Web operation is unavailable until someone discovers the failure and brings the node back online manually.
45
Hardware Conguration
Marvellus decides to create a two-node cluster. Along with gold, the machine that they use as a Web server, Marvellus decides to use an additional machine called silver. Marvellus Order Database will run on silver. Because each node runs only a single application, the CPU load for each should be low, providing a good failover location should the other fail. To increase availability, for a small additional expense Jack adds an Uninterruptible Power Supply (UPS) and one additional network interface card (NIC) per node. The extra NIC provides a redundant network in case one network fails. Now each machine has two static IP addresses to allow two communication paths between the nodes. Web les are replicated on each machines local disk, and uses SQL commands to access the data from the database. Finally, Jack moves the database les to an external shared disk connected to both nodes and managed by Automated Availability Manager as a shared disk data source. To avoid a single point of failure, the data on the disk may be replicated using a RAID array or a mirroring product such as Legato Replication. Figure 3 on page 46 shows the hardware and high-level AAM conguration the company has set up to keep its services highly available. Figure 3. High Availability Solution Hardware Conguration
Communication Network 2 Communication Network 1 IP Address 1.2.3.4 IP Address 1.2.4.4 IP Address 1.2.3.200 IP Address 1.2.3.210 Node Alias Mercury IP Address 1.2.3.5 IP Address 1.2.4.5 IP Address 1.2.3.200 IP Address 1.2.3.210 Node Alias Mercury
Web
Database Files
SQL Data Source SQL Data Source
SQL
AAM gold
AAM silver
46
In this conguration Jack connects each node to both communication networks so that each node has two static IP addresses, one each on the 1.2.3 and 1.2.4 subnets. In addition, he has congured the nodes to accept two managed AAM IP addresses, 1.2.3.200 and 1.2.3.210. Managed IP addresses are standard IP addresses, but they can be moved from node to node without rebooting. Since SQL Server uses a NetBIOS node name for default communication, Jack creates a movable node alias, Mercury, for SQL communication. Like the managed IP address, the node alias can be moved from node to node. Jack then installs the Web and SQL services on each node. Finally, he moves the database les from the local disks on each node to a shared disk, managed by AAM using a data source. The data source manages disk access to both nodes and ensures that only one node can access the shared disk at one time. The AAM resource group ensures the proper data source connection follows the application as it moves from node to node. The Web service runs initially on gold and the SQL service on silver. The AAM agent, congured to monitor the services and nodes for failure, actively runs on both nodes. Web browser connections from clients communicate with the Web service through the managed IP address, not through either machines standard IP address. In this case, the managed IP address is 1.2.3.200. Marvellus has published 1.2.3.200 through DNS as www.marvellus.com, and this is the address which Web browsers use for access. SQL Server has been congured to communicate using the node name Mercury, which has been dened as a movable node alias.
Resource Groups
The easiest way to congure application migration is through AAMs resource groups. Dene resource groups using the Resource Group Editor. Within the Resource Group Editor you can dene all of the objects to manage and the preferred nodes on which the resource group runs. Once you bring the resource group online, AAM managed the dened objects, providing high availability. Since Marvellus wants to manage two separate services typically running on separate nodesthe SQL service on silver and the Web service on goldthe availability solution requires two resource groups.
47
In preparation, Jack installs the Web and SQL services on each machine. Although AAM manages the cluster so that only one instance of the services is active at a time, they must be installed on each node so that they can be activated if AAM migrates the resource group to that node. Once he completes the installation, he can begin creating resource groups. Each resource group manages its own service. The Web resource group uses a managed IP address, while the SQL resource group uses a managed IP address, a node alias and a shared disk data source. Once Jack denes the two resource groups and brings them online, AAM begins managing the services. If node gold, running the Web service, fails, AAM recognizes the failure and moves the Web service and the managed IP address 1.2.3.200 to silver. Similarly, the SQL service and related objects are congured to run on silver with failover to gold. Similarly, if node silver, running the SQL service, fails, AAM moves the SQL service, the managed IP address 1.2.3.210 and the data source connection to gold.
48
If AAM encounters a problem, for example the service does not start or the managed IP address cannot be assigned, the shutdown sequence executes on the node. The resource group attempts to execute on the next node in the list, or in this example, silver. If AAM encounters more problems, it attempts to startup on each successive node in the Preferred Node List, until the resource group is online. When it reaches the end of the list it pauses, and returns to the rst node on the list.
In the Event of Failure Assuming a normal startup, the resource group enters the Online state, and remains running on the most preferred node until some type of failure occurs. If one of the services fails, AAM attempts to restart it on the same node, gold. If that fails, the resource group executes the shutdown sequence and executes the startup sequence on silver. If the node itself fails, AAM recognizes the failure and executes the startup sequence on silver. If at any point gold comes back online, the objects move because Auto-Failback is selected.
Once successfully completed, the resource group enters the Online state on the failover node, which then has the Web service running and responds to IP address 1.2.3.200. The SQL Resource Group The second resource group monitors the SQL Server service. The SQL Service resource group is very similar to the Web service Resource group. It manages the SQL service, a managed IP address a shared data source, and a node alias. Bringing the Resource Group Online For this resource group, silver is the rst node in the Preferred Node List. On startup, AAM attempts to execute the startup sequence on silver and then gold if problems are encountered. AAM repeats the process until the resource group enters the Online state on a node.
49
In the Event of Failure Like its Web counterpart, once in the Online state, this resource group remains running on the preferred node, silver in this case, until some type of failure occurs. Upon failure, the resource group moves to gold. Since the Auto-Failback option is selected, this resource group returns to silver once it comes back online.
50
This task can be performed using a combination of resource groups and AAMs rule capabilities. Rules can be reactive, responding to an event within the domain, or proactive, performing an action before a serious condition occurs. AAM rules are implemented as Perl scripts, and provides an extended version of the Perl interpreter which has API extensions that provide direct access to AAMs management infrastructure and agents. Since Perl is an interpreted rather than a compiled language, changes to rules can be deployed without the need to compile. Also, since Perl is platform-independent, scripts written for AAM on one platform will also run on another without modication. AAM also provides an extensive set of API calls for Rules which allow user interaction with the AAM domain. You use the Rule Editor to create and edit rules. AAM uses triggers which initiate the evaluation of rules. Scheduled triggers allow you to congure events to take place once, every interval, at a time of day, or day of the month. Sensor-based triggers get data values from sensors, which monitor data values from an application. When a trigger res, that is, initiates an event based on a change in its associated sensor value or its scheduled time, the AAM Agent evaluates the text for every rule in the domain which is associated with that trigger. More than one rule can be associated with a trigger. Rules are generally written to be stateless and conditional, with the rule reacting to the state and value of the received trigger. The Perl code, for example, could start with an if statement which causes action to be take only when a threshold trigger is in the ON state, and ignoring the trigger indicated the OFF state.
Hardware Conguration
To provide this level of load balancing, Jack must recongure the connections within the AAM domain to include the node platinum. The SQL server on silver is still a member of the domain with failover of the its resource group still taking place on gold. To provide the sensor data that will be needed to trigger the additional web server, the Web resource group must be modied to include the Proxy process called processProxy, which obtains sensor values from applications which are not AAM-aware. A proxy process is an application which publishes sensors and actuators on behalf of an unaware process. The processProxy and nodeProxy proxy processes are supplied with AAM, and other proxy processes can be programmed using the AAM SDK. The updated domain conguration is shown in Figure 4 on page 52.
51
Resource Groups
Web Service
AAM gold
Running Applications Application Failover Destinations Live Connections Failover Connections IP Address 1.2.3.4 IP Address 1.2.3.200 IP Address 1.2.3.200
AAM platinum
Static IP Addresses Managed IP Address Managed IP Destination Network Connections
The gure details the conguration of nodes gold and platinum. Node silvers conguration remains unchanged, as shown in Figure 3 on page 46. In the gure, gold is still the primary Web Server, with failover backup on silver. Node platinum is added to the AAM domain to be used as a backup node when the load on the Web server node becomes too high. Marvellus is now using a round robin DNS server which splits the Web Server requests between IP addresses 1.2.3.200 and 1.2.3.205. The web pages on both gold and platinum access the SQL Database on silver.
Resource Groups
The resource groups used in the rst solution are still valid, with the addition of the processProxy process for the Web service to monitor the CPU load of the node where the resource group is running. The Web service on platinum must also be congured within AAM to auto restart so that it will always run.
52
Rules
The rule is triggered by a sensor which tracks the CPU load on the node where the main Web service is running. Normally, if gold is running, the web service runs there. If gold is not available, the Web service runs on silver. The rule must be written to take into account the possibility of the node running on either node, however, for simplicity, assume that gold is the location of the Web service. Rule creation consists of dening triggers and creating Perl scripts that AAM evaluates when the trigger res. In this case, the Web service runs actively on gold and platinum at all times. When the CPU load on node gold reaches 70 percent, the managed IP address 1.2.3.205 moves to node platinum. When the CPU load on gold drops below 20 percent, the managed IP address moves back. See Figure 5 on page 53. Figure 5. Migration Graph
IP returns to gold 0% 20% No IP migration takes place IP migrates to platinum 70% 100%
Creating Triggers
To get the value of the CPU load Jack uses proxyProcess, provided with AAM. One of the sensors in that process provides the CPU load of a node. Triggers can be dened using this sensor to re the rule. Using the AAM SDK, other proxies could be programmed to get the average number of hits per hour or the number of web connections. Jack needs to create two triggers, both checking the value of the CPU load sensor. He sets the threshold of the rst trigger, LoadHigh, to >= 70. During operation, whenever the CPU load of node gold is greater than or equal to 70 percent, the managed IP address 1.2.3.205 migrates to node platinum. He sets the threshold of the second trigger, LoadLow, to <20. When the CPU load on gold drops below 20 percent, the managed IP address migrates back to that node. Jack congures the trigger to check the sensor value only every 60 seconds. Because the triggers are Threshold triggers, they re whenever the threshold is crossed. Each time the trigger res, any rule text associated with that trigger is evaluated. The state of the LoadHigh trigger is ON when the sensor value is
53
Rules
greater than or equal to 70 percent and OFF when it is less than 70 percent. The state of the LoadLow trigger is ON when the sensor value is less than or equal to 20 percent and OFF when it is greater than 20 percent. Assume the LoadHigh trigger checks the sensor value ve times. The values, in percent, are 60, 72, 85, 10, and 50. The rst value of 60 percent res the trigger, providing an initial value. Since 60 percent is below the Threshold, the Trigger state is off. The value 72 percent has crossed the threshold of 70 percent, so the trigger res, and the trigger state changes to on. The next value of 85 percent does not cause the trigger to re since the condition >70 is still true. The state is still on. The fourth value, 10 percent causes the trigger to re, since the value dropped below the threshold, and the trigger state toggles to off. The nal value of 50 percent does not cause the trigger to re, and the state remains off.
Creating Rules
When the triggers re, the rule evaluates the states of the two nodes, and performs the following tasks: If the initial CPU load of the node where the Web resource group is running, most likely gold, is less than 70 percent, the managed IP address 1.2.3.205 is also assigned to that node. Later, if the CPU load on that node is greater than 20 percent and less than 70 percent the managed IP address remains on the current node. If the CPU load of resource group node is greater than 70 percent, managed IP address 1.2.3.205 migrates to node platinum.
When the rule runs for the rst time, AAM checks to see if the managed IP address 1.2.3.205 is assigned to a node. If it is not assigned, AAM assigns it to node platinum, assuming that during operation, it will move to the proper location. Only if the domain has just been brought up should the IP address be unassigned. The rule text uses a condition statement to check the state of the triggers. When the CPU load of node gold is greater than 70 percent, the LoadHigh trigger state is on. If the LoadHigh trigger state is on, then the rule executes the portion of the case statement dealing with that condition. If platinum is running, AAM migrates the managed IP address 1.2.3.205 to that node. When the CPU load of the node running the Web resource group is less than 20 percent, the LoadLow trigger state is on. If the trigger state is on, then the rule executes the portion of the case statement dealing with this condition. In this case, the rule migrates the managed IP address 1.2.3.205 from platinum back to the resource group node.
54
If the CPU value is between 20 percent and 70 percent the state of both triggers is off. In this case, unless a failure occurs, the managed IP addresses remain on the current node. Marvellus has now created a load balancing solution for their Web server. This solution could be expanded to include three or more nodes using the same principles.
55
Rules
56
Legato Automated Availability Manager (AAM) uses a number of components during operation. Previous chapters have touched on some of these components. This chapter describes the following components in greater detail: "AAM Agent Architecture" on page 57 "Network Architecture" on page 62 "Data Source Architecture" on page 63 "Rule and Monitoring Architecture" on page 68 "Resource Group Architecture" on page 72
Agents
Each node has an AAM agent installed. The agent provides the monitoring and management capabilities within the node. AAM supports two types of agents, primary and secondary. Conguration determines the agent type for the node. Primary agents provide application management and monitoring, and maintain AAMs replicated databases of conguration and state information. Primary agents also log events, evaluate rules, and service all AAM client requests. For correct operation, there must be at least one primary agent
57
Agent Components
running in the domain. Primary agents provide high availability of AAM itself. With multiple primaries, one can fail and management within the domain continues. Secondary agents only perform monitoring and management operations on behalf of, and are dependent on, primary agents. They perform no database replication or rule interpretation. Because secondary agents require fewer system resources than primary agents they allow scaling in an AAM domain up to 100 nodes. If necessary, you can promote and demote agents to maintain the processing balance in the domain.
Agent Components
There are four components which make up a primary agent: the agent process, process monitor, replicated database, and the rule interpreter. Secondary agents run only the agent process and process monitor. The following sections describe each agent component, and Figure 6 on page 58 illustrates their interaction.
Proxy
SDK
Socket
Agent Process
Primary Agent
58
Agent Process
The agent process coordinates AAM activities on the node and manages the activities which occur in the domain, either in response to a rule or user interaction. These activities include starting and stopping processes, attaching and detaching data sources, and assigning or unassigning node aliases and managed IP addresses. The agent process provide self-checking fault tolerance to provide high availability within AAM itself. If any of the agent processes fail, the others automatically restart it within seconds. When changes occur in the database, such as the denition of a resource group, this information is immediately, or synchronously, reected in all instances of the replicated database. If a primary agent is not running at the time of the database change, it is automatically updated as soon as it starts up again. Agent process are also responsible for communication with aware processes for sensor and trigger data and to invoke actuators. If an aware process tries to communicate trigger or event information while the AAM agent is down, the AAM SDK stores up to 200 events until the agent restarts.
Replicated Database
Each primary agent maintains a copy of AAMs replicated database, which stores information about the AAM domain. All object denitions and states are stored in the database, and whenever information changes in the domain, the AAM agent recognizes the changes and sends the information to each database copy. As a result, the database replicas always remain synchronized.
Process Monitor
The AAM Process Monitor is responsible for monitoring the state of all nodes, as well as managed processes and services on its node. It is also responsible for sending out heartbeats to the other agents in the domain to provide node level failure detections. In the event of a node or process failure, the process monitor reports the failure to the AAM agent process which then manages the recovery and passes the information to any rules or resource groups which must execute because of the failure.
59
Primary Agents
Rule Interpreter
The Rule Interpreter runs only on primary agent nodes. AAMs rule interpreter offers coherent, consistent, predictable rule behavior. When a rule executes, it executes in exactly the same way every time, executing the steps in the same order. Each rule is assigned a rule interpreter at the time the rule is enabled. The rule remains on that rule interpreter until either it fails, the node fails, or the rule is disabled. If the rule interpreter or node fails, all of the rules assigned to that rule interpreter are automatically reassigned to a rule interpreter on a surviving node. After they are reassigned, all rules are evaluated. AAM chooses the rule interpreter randomly from the active rule interpreters at the time the rule is enabled. The rule interpreter has direct access to the AAM agent and its replicated database, and it is this access which allows rules to perform almost any management operation automatically. For more information about AAMs rule architecture, see "Rule and Monitoring Architecture" on page 68.
Primary Agents
A node running a primary agent is comprised of all four agent components. It manages services and processes, executes rules, manages resource groups, monitors the state of other nodes and the processes running on its node. It also maintains a copy of the AAM replicated database. Primary agents communicate with all other agents in the domain, ensuring a consistent view of the domain. Events from secondary agents which require action are handled by primary agents, and any objects or state information is maintained in the AAM replicated database.
60
Primary agents cannot be deleted directly. However, they can be demoted to secondary agents and then deleted.
Secondary Agents
A secondary agent provides management and monitoring for all of the managed resources on its node, but does not execute rules or maintain a copy of the replicated database. The primary agents handle rule execution and database updates. Secondary agents connect to an AAM backbone process running on a primary agent through the network. If the connected primary agent fails, the secondary agent shuts down momentarily, and restarts after connecting to another primary agent in the domain. Secondary agents can be promoted to primary agents.
61
Network Architecture
Network Architecture
The nodes in an AAM domain use a communication network to communicate with each other. Each participating AAM node connects to at least one network. The nodes in the domain may be attached to one network subnet, or spread over multiple subnets using a router. The architecture fully supports redundant networks, and provides automatic failover between networks.
Network Conguration
Each node in the domain requires at least one network connection, over which it sends internal messages. AAM supports multiple networks but does not require them. AAM can be congured to share the same network that the managed applications use. AAM can also be congured to use one or more private networks for domain communication. The network conguration for your domain may all be on one network subnet, or span multiple subnets with communication through a router. If the router supports multicast broadcasts, AAM communicates normally without modication. If your router does not support multicasts, your domain can still span multiple subnets by using AAMs point-to-point communication option. However, when using managed IP addresses along with managed processes, the IP address must be in the same subnet as the static IP address dened for the Network Interface Card (NIC). For example, if the dened managed IP address is 1.2.3.250, the address works correctly on the ves nodes which are part of the 1.2.3 subnet, but cannot interface with the ve nodes in the 1.2.4 subnet. If a split conguration is necessary, you must ensure that the 1.2.3.255 managed IP address is congured only to nodes in the 1.2.3 subnet.
Network Redundancy
Although redundant networks are not required, Legato recommends that you congure each cluster to have at least two private networks to eliminate the network as a single point of failure. If a network failure occurs in a cluster with multiple cluster networks, AAM automatically fails over to the working network. However, if managed applications are also using the cluster communication networks, it is the responsibility of the application to move to another network. Application communication may cease if the network is unavailable. Client applications using network communications may not respond until the rst network returns.
62
Redundant networks may also be necessary to provide adequate protection when using shared disks. If only one network is used and a network partition occurs between the machines accessing the disk, communication is lost and there are no checks to ensure that only one machine accesses the disk at one time. Disk corruption may occur due to concurrent writes. The addition of redundant networks greatly reduces the risk of this scenario; if one network fails, AAM communication within the domain continues, and AAM can continue to manage the disk properly.
Failure Detection
Agents communicate with each other using a heartbeat mechanism over the domain communication networks. On primary agents, the process monitor both sends and receives heartbeats. On secondary agents, the process monitor only sends heartbeats. Once a primary agent receives a heartbeat from a node, it expects the heartbeats from that node to continue. If an agent no longer receives heartbeats from the source node, it assumes that the agent has failed. The agent then pings the source node. If the node pings successfully, the agent assumes that only the agent has failed on the source node. If the node does not ping, the agent assumes that the entire node has failed. For added reliability, AAM can be congured to ping the node over multiple network interfaces.
63
AIX_LVM AAM manages attaches using AIX Logical Volume Manager. HP_LVM AAM manages attaches using HP Logical Volume Manager for HP-UX. Shared_Disk AAM manages the attaches to a shared disk from two or more nodes in the domain. Network_Share AAM manages attaches from nodes within the AAM domain to a Windows share. The Windows share may be on a machine outside the AAM domain. Legato Mirroring for Windows 2000 The Mirroring for Windows 2000 Servers data source provides built-in volume-level mirroring for Windows 2000 Servers. NT_VxVM AAM manages attaches using Veritas Volume Manager. RepliStor File System This data source allows access to RepliStor mirrored applications. SUN_SDS AAM manages attaches using Solaris Solstice Disk Suite. SUN_VxVM AAM manages attaches using Veritas Volume Manager for Solaris. UX_File_System AAM manages attaches using a UNIX File system, either UFS, VxFS, or NFS. EMC SRDF Managed using EMC SRDF. EMC PowerPath Volume Manager Managed using EMC PowerPath Volume Manager.
64
65
Because the Windows Share does not require a physical connection to the node, it is much more scalable than shared disks, and a connection can be made from any node in the domain. You can determine the number of connections which can be made from node in the domain to limit access. AAM cannot control access to the share from machines outside of the domain.
66
EMC SRDF
The EMC SRDF data source provides application availability to critical data, even in the event of node or data source failure. A typical conguration is a AAM domain of at least two nodes connected to a pair of EMC Symmetrix data arrays.
Legato Automated Availability Manager, Release 5.1 Concepts Guide
67
The EMC SRDF software controls access within the Symmetrix data array, providing access to one or more devices managed as a device group. The device groups are congured into two sidesthe R1 (primary) side and the R2 (secondary) side. When a machine is connected to a pair of Symmetrix data arrays, the side that is its primary connection is always referred to as the R1. Normally, data is mirrored dynamically from R1 to R2. In the event that R1 becomes inaccessible, a failover takes place and the data is then accessible from R2, the secondary connection. When a host is connected to R2, no mirroring is taking place, although the host(s) have read and write access on R2. When a failover takes place, data can be re-synchronized by an R1 Update operation, which is initiated manually, or by a failback in which the host is reconnected to R1. In the case of a failback, the synchronization of data takes place automatically. The EMC SRDF data source controls data access from the node to the Symmetrix. This data source can be included in a resource group to provide high availability in the event that the data source or other resources fail. The resource group controls node access to the Symmetrix. The data source uses a predened device group name to determine which Symmetrix and corresponding devices it should use for connection.
68
Sensor Data
AAM Trigger
Trigger Event
Issue System Management Commands Rule Interpreter Evaluating Rules Process & Service Management
Scheduled Time
Invoke Actuators
When a rule is enabled, AAM chooses a primary agent on which to run the rule. This can be any primary agent and as a result, rules must be node independent. Once AAM determines which agent is responsible for the rule, it is loaded into the rule interpreter on that node and the rule can then run whenever any of its associated triggers re. If the rule interpreter or node fails, the rule is immediately and automatically relocated to another rule interpreter.
Legato Automated Availability Manager, Release 5.1 Concepts Guide
69
Users code
Sensor GetMemUsage()
ft_Execute()
AAM SDK Trigger1 Sensor: GetLoad Freq: 20 sec. Cond: >70 Trigger2 Sensor: GetMemUsage Freq: 40 sec. Cond: >80
70
Sensor-based Triggers re when their sensors return data values matching the condition criterion. There are a number of condition types which can be set for sensor-based triggers. Scheduled Triggers re at a specic time or on demand.
Scheduled Triggers
Scheduled triggers are not connected to a sensor but rather are congured to re at a pre-scheduled time. Once congured, the agent tracks the schedule on which scheduled triggers re. At the congured time, the trigger res. A special case of scheduled triggers are on-demand triggers. On-demand triggers allow the user to trigger rules to run as needed and can be red from within the rule itself, or from the Management Console. When the user or rule invokes the trigger the agent receives the message and passes it on to any rules driven by the trigger.
Actuators
Actuators are functions provided by an AAM-aware process which can be invoked from rule text. When the rule interpreter evaluates the rule text and executes the actuator function, the message is sent to the agent, which in turn communicates with the appropriate aware process to invoke the function. Actuators allow users to expand the functionality available to AAM to manage the environment.
71
Proxies
Using the AAM SDK, users can program sensors and actuators. Some actuators are included with the AAM proxy processes.
Proxies
Proxies are AAM-aware processes that publish sensor values on behalf of another process. Actuators back to the application through a proxy are also provided, bringing the aware process functionality to an unaware application. Communication to and from the proxy process take place in the same manner as normal sensor, trigger, and actuator operation. The proxy process gathers sensor values from the unaware application and reports them to the agent. When the agent res an actuator for the process proxy, the proxy then runs the function which affects the unaware process. The agent communicates with the proxy process through the AAM SDK library, which gets sensor data and invokes functions through the managed applications API or CLI. Figure 9 on page 72 illustrates the relationship of the proxy to the AAM agent and the managed application. Figure 9. Proxy Process Architecture
AAM SDK
Agent
y ox Pr
oc Pr
es
s
Shrink-wrapped Managed Application
Application API/CLI
72
Resource groups also use the same sensor and trigger system that rules use. Triggers and sensors for resource groups are created automatically when the resource group is saved, preventing the user from having to congure them.
73
74
Glossary
This glossary contains terms and denitions found in this manual. Most of the terms are specic to Legato Automated Availability Manager (AAM) Actuator An external function call in an AAM-aware process dened and provided through the AAM SDK. Actuators can be invoked from a rule or the Management Console. The process that the actuator takes action upon need not be the same process whose sensor value red the trigger. Name/value labels that be attached to the domain, resource group, nodes, or processes. Can be used by rules to organize objects into subgroups, or attach user-dened information to objects for use by rules. Selecting this option in a resource group results in the resource group being taken ofine, then brought online on the failback node if it is currently hosted on a node other than the failback node when the failback node comes online. A process dened to AAM that has bound the AAM programming library to publish sensors or actuators. The processes running on AAM primary agents that provide messaging services. The backbone includes the process monitor, rule interpreter, and the AAM replicated database.
Attributes
Auto-failback
Aware Process
Backbone
75
Glossary
Initiate an attempt to execute the resource groups startup sequence. When the startup sequence is completed successfully, the resource group is in the Online state. The command line interface program provides access to the capabilities of the Management Console, including additional capabilities for debugging and batch processing. Also referred to as the CLI. A resource that provides data content to an application. With AAM, data sources can be congured to detach from a failed node and attach to a node that takes over the application associated with the data source. Turns off resource group monitoring. When disabled, AAM does not restart resource objects if they go ofine. This mode is useful for manual intervention and maintenance of resource objects without the need to relocate the entire resource group. Turns off a rule and its triggers. A group of nodes that participate in a particular management scope. A node can belong to more than one domain. All nodes on a network with an AAM agent installed are known collectively as the domain. The AAM domain should not be confused with the Windows networking domain.
Data Source
Disable Rule
Glossary
Domain
Inter-agent messages. The network AAM agents use to communicate with each other. AAM clusters can be congured to use an optional independent network or to use the existing service network.
76
Glossary
Turns on resource group monitoring. When enabled, AAM attempts to keep the resource group and all of its resource objects online in the event the object goes ofine. Includes the following AAM components: sensors to monitor the resources in the environment, triggers to initiate a response to received sensor data, and actuators to cause the appropriate actions to occur. A test to check for the physical existence of one or more managed processes. AAM provides the option of user-dened existence monitors for processes which can provide a more detailed and accurate status of the process. Settings for heartbeat interval and multicast or point-to-point addresses used by the AAM failure detection mechanism. A trigger res when its monitoring condition changes. The execution of a trigger is often referred to as ring. A periodic message sent from agent to agent for managing node state. Also a message sent from an Aware Process to the AAM agent to indicate its responsiveness. Heartbeats are used to determine node failure and isolation. A system which allows input and output to and from a managed process which would normally go to the le handles STDIN, STDOUT, and STDERR to be redirected to an alternate destination. Since AAM usually runs managed processes as background processes, the information to the locations may be lost without redirection.
Existence Test
Failure Detection
Fire
Heartbeat
Glossary
I/O Redirection
77
Glossary
Isolation Detection
Isolation detection is a mechanism which allows an AAM node to determine if it has become isolated on the network. Its primary purpose is to protect against a situation when a node cannot communicate with any resource. Isolation detection works in conjunction with the Minimum Detection Time set in the Domain Settings. The script that is invoked on an AAM agent when the node detects that it is isolated. This script includes the instructions that the node follows to ensure that the resource objects, resource groups, and rules on both the isolated node and on other nodes in the domain respond appropriately to the isolation. Media Access Control Address. A 48-bit address assigned by the interface card manufacturer. Most manufacturers specify a unique MAC address for each card. MAC addresses can be modied by AAM, if the model of network interface card allows it, when managed IP addresses are moved among interface cards. Process, Service, managed IP Address, Node Alias, Data Source. A process or service that has been dened to AAM and will be started and stopped under the control of AAM from the GUI, a rule or a resource group. The graphical user interface that provides access to create and manage cluster objects. The hardware component which allows the node to communicate over the network. IP addresses are assigned through software to the network interface card, and those addresses are used to determine the source and destination of network communication. Also referred to as a NIC.
Isolation Script
MAC Address
Glossary
78
Glossary
NIC Group
A mechanism used by Legato Automated Availability Manager to group NIC cards using the same subnet. Once included in a group, NIC management tasks such as NIC testing and NIC to NIC failover can be performed. A failover scheme which allows all IP addresses assigned to a Network Interface Card (NIC), including subnet and MAC address information if any, to move to another NIC on the same node if the rst NIC fails. A name that can be added to an Windows node that NetBIOS can recognize, allowing access from computers on the network. Called a Virtual Server by MSCS, the AAM node alias name is not registered with DNS or WINS. Only accessible from NetBIOS clients. An aware proxy process provided by AAM that publishes node-related sensors and actuators on behalf of a node. See also Proxy Process. Running, Shutdown, Failed, Agent Failed, Unknown.
Node Alias
Node Proxy
Glossary
A resource group intended to be controlled from a rule. Because the rule will determine the nodes on which the resource group will run, there are no nodes assigned in the Preferred Node List. The Nodeless Resource Group is congured, and once assigned to a node by the controlling rule, runs like a standard resource group. See also Resource Group. A variable that is maintained in the AAM Replicated Database rather than in local memory on the node. It is available even if the rule referencing it fails.
Persistent Variable
79
Glossary
Physical IP Address
The IP address that is manually assigned to a Network Interface Card (NIC) by a user. This IP address is usually one of the primary addresses used by the NIC. Once in place, this IP address can be managed by AAM. See also Virtual IP Address, Network Interface Card. The subset of nodes on which a resource group can be brought online. A node running an agent, process monitor, and rule interpreter which provide the main activities for management of a cluster. A primary agent also maintains a copy of the replicated database. At least one primary agent must be running at all times for AAM to function properly. There are typically two to ve primary agents in a domain. Any domain of two or more nodes should include at least two primary agent nodes. Other nodes in a domain should run as secondary agents. Monitors the state of processes and services on a node, and has the secondary function of managing the node failure detection mechanism using heartbeats. Also referred to as the ProcMon. A proxy process provided by AAM that publishes process-related sensors and actuators on behalf of a managed process. The Process Proxy is associated with the process in AAMs process conguration and publishes sensors to the given sensor class. See also Proxy Process. Unknown, Running, No Response, Stopping, Stopped, Failed Static characteristics of a node, such as OS version, physical memory, etc. Allows an application to publish actuators and sensors on behalf of another (unaware) process in a tightly integrated manner. See also Node Proxy, Process Proxy, Managed Process.
Process Monitor
Glossary
Process Proxy
80
Glossary
The act of providing sensor data to AAM. Move a resource group currently running on one node to another node in the Preferred Node List. Relocation allows for manual load balancing. Internal database where objects and objects states are kept. Each primary agent maintains a synchronous copy. The collection of objects, such as services, processes, IP addresses, data sources, and others, that comprise a failover group. See also Nodeless Resource Group. The Management Console screen that allows for the selection of objects, nodes, and attributes of a managed failover group. The list of resource objects that will be taken ofine, the scripts that will execute, the events that will be posted, and the delays that will executed in the order specied when attempting to take the resource group ofine. The list of resource objects that will be brought online, the scripts that will execute, the events that will be posted, and the delays that will executed in the order specied when attempting to bring the resource group online. The objects that can be managed by AAM, such as processes, services, IP address, node aliases, and data sources. An optional test of a process response state. Response monitors test that the target process is healthy and responding correctly. This test is optional, but indicates a processes health. AAM provides the option of user-dened response monitors for processes and services which can provide a more detailed and accurate status of the process or service.
Resource Group
Glossary
Resource Objects
Response Test
81
Glossary
Response Time
The time period within which the agent should expect to receive a message from an aware process. A user-denable Perl script which is evaluated when a trigger condition changes. A collection of AAM subroutines that are accessible from a rule for performing AAM-related management capabilities. Most rule API subroutines can also be used in resource group scripts. The agent process that evaluates custom rules (and resource groups) on primary agent nodes. The range over which a trigger applies. For node-sensors, the scope indicates for which nodes the trigger applies, for process sensors, the scope allows for the specication of the processes for which the trigger applies. Two AAM processes, the agent and process monitor, that provide management capabilities of a node. It does not execute rules, nor does it maintain a copy of the replicated database. Secondary agents allow AAM domains to scale to as many as 100 nodes with minimal overhead. The list of users with permissions in the domain and the level of access allowed to each user. AAM provides three forms of user-level access: User, Operator, and Administrator. A piece of data to be monitored. Provides data values for use by triggers and rules. Sensors are published by the agent, by proxy processes, and by any other AAM aware processes.
Secondary Agent
Glossary
Security
Sensor
82
Glossary
Stop Script
A script used in conjunction with a managed process which performs specic actions each time the process is shutdown gracefully by AAM. The shutdown script may ensure that all processes associated with an application are properly closed. See also Start Script. Provides the AAM programming API, enabling users to develop their own aware applications. With the SDK, users can write additional sensors and actuators to compliment and extend those provided by AAM. A script used in conjunction with a managed process which performs specic actions each time the process is started. The start script is often used to launch a number of related processes which are necessary for the proper execution of an application. See also Stop Script. AAM provides state monitoring capabilities for processes and services dened in the domain. There are two types of monitors to test different aspects of a process healththe Response Monitor and the Existence Monitor.
Start Script
State Monitor
Glossary
Subnet Mask
A TCP/IP conguration parameter that extracts network and host conguration from an IP address. This 32-bit value allows TCP/IP to distinguish the network ID portion of the IP address from the host ID portion. The host ID identies individual computers on the network. TCP/IP hosts use the subnet mask to determine whether a destination host is located on a local or a remote network. A subnet mask is a 32-bit number expressed as four decimal numbers from 0 to 255 separated by periods, for example: 255.255.0.0
83
Glossary
Trigger
An object that monitors a sensor for a logic condition and reports any matching conditions to one or more rules. Triggers are the objects that cause rules to execute. A process that does not use the AAM runtime API and so is unaware of AAM. This includes third-party and shrink-wrapped software that is monitored by AAM. Unaware processes can be associated with a proxy process to allow AAM to monitor various conditions. A process that can be executed from AAM through a rule or from the Management Console. The utility process is started, but not managed, by AAM. The utility process is typically a short-lived program, often used to perform some auxiliary function. Utility processes are also referred to as UtilProcs. The movable IP address that is assigned to a Network Interface Card (NIC) by AAM. See also Physical IP Address, Network Interface Card.
Unaware Process
Utility Process
Virtual IP Address
Glossary
84