Aam CG 51

Automated Availability Manager
Release 5.1 UNIX and Microsoft Windows Version
Concepts Guide
2003, LEGATO Systems, Inc. All rights reserved. This product may be covered by one or more of the following patents: U.S. 5,359,713; 5,519,853; 5,649,152; 5,799,141; 5,812,748; 5,835,953; 5,978,565; 6,073,222; 6,085,298; 6,145,089; 6,308,283; 6,324,654; 6,338,126. Other U.S. and international patents pending. Legato Automated Availability Manager, Release 5.1, Concepts Guide September 2003 01-6168-5.1
LEGATO and the LEGATO logo are registered trademarks, and LEGATO NetWorker, NetWorker, NetWorker DiskBackup, LM:, Celestra, PowerSnap, SnapImage, GEMS, Co-StandbyServer, RepliStor, SnapShotServer, QuikStartz, SAN Academy, AlphaStor, ClientPak, Xtender, XtenderSolutions, DiskXtender, ApplicationXtender, ArchiveXtender, EmailXtender, and EmailXaminar are trademarks or registered trademarks of LEGATO Systems, Inc. This is a nonexhaustive list of LEGATO trademarks, and other trademarks may be the property of their respective owners. The following may be trademarks or registered trademarks of the companies identied next to them, and may be used in this document for identication purposes only. Acrobat, Adobe / Adobe Systems, Inc. Apple, Macintosh / Apple Computer, Inc. Caldera Systems, SCO, SCO OpenServer, UnixWare / Caldera, Inc. TELEform / Cardiff Check Point, FireWall-1 / Check Point Software Technologies, Ltd. Unicenter / Computer Associates International, Inc. Access Logix, Celerra, Centera, CLARiiON, EMC, EMC2, MirrorView, MOSAIC:2000, Navisphere, SnapView, SRDF, Symmetrix, TimeFinder / EMC Corporation Fujitsu / Fujitsu, Ltd. Hewlett-Packard, HP, HP-UX, HP Tru64, HP TruCluster, OpenVMS, ProLiant / Hewlett-Packard Company AIX, DB2, DB2 Universal Database, Domino, DYNIX, DYNIXptx, IBM, Informix, Lotus, Lotus Notes, OS/2, PTX, ptx/ADMIN, Raid Plus, ServeRAID, Sequent, Symmetry, Tivoli, / IBM Corporation InstallShield / InstallShield Software Corporation Intel, Itanium / Intel Corporation Linux / Linus Torvalds Active Directory, Microsoft, MS-DOS, Outlook, SQL Server, Windows, Windows NT / Microsoft Corporation Netscape, Netscape Navigator / Netscape Communications Corporation Date ONTAP, NetApp, NetCache, Network Appliance, SnapMirror, SnapRestore / Network Appliance, Inc. IntraNetWare, NetWare, Novell / Novell, Inc. Oracle, Oracle8i, Oracle9i / Oracle Corporation NetFORCE / Procom Technology, Inc. DLTtape / Quantum Corporation Red Hat / Red Hat, Inc. R/3, SAP / SAP AG IRIX, OpenVault, SGI / Silicon Graphics, Inc. SPARC / SPARC International, Inc.b ACSLS, REELbackup, StorageTek / Storage Technology Corporation Solaris, Solstice Backup, Sun, SunOS, Sun StorEdge, Ultra / Sun Microsystems, Inc. SuSE / SuSE, Inc. Sybase / Sybase, Inc. Turbolinux / Turbolinux, Inc. VERITAS, VERITAS File System/ VERITAS Software Corporation WumpusWare / WumpusWare, LLC UNIX / X/Open Company Ltda Unicode / Unicode, Inc. Notes: a. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd. b. Products bearing SPARC trademarks are based on an architecture developed by Sun Microsystems, Inc.
Legato Systems, Inc. End-User License Agreement

THIS PRODUCT CONTAINS CERTAIN COMPUTER PROGRAMS AND OTHER PROPRIETARY MATERIAL, THE USE OF WHICH IS SUBJECT TO THIS END-USER LICENSE AGREEMENT (THE AGREEMENT). DO NOT PROCEED WITH THE INSTALLATION OF THIS PRODUCT UNTIL YOU (LATER DEFINED AS LICENSEE) HAVE READ THIS AGREEMENT AND AGREE TO BE BOUND BY AND BECOME A PARTY TO THIS AGREEMENT. BY PROCEEDING WITH THE INSTALLATION OF THIS PRODUCT (OR AUTHORIZING ANY OTHER PERSON TO DO SO), YOU AND YOUR COMPANY ACCEPT THIS AGREEMENT AND YOU REPRESENT THAT YOU ARE AUTHORIZED TO DO SO. IF YOU ARE ACCESSING THE PRODUCT ELECTRONICALLY INDICATE YOUR ACCEPTANCE OF THESE TERMS BY SELECTING THE ACCEPT BUTTON AT THE END OF THE AGREEMENT. IF YOU DO NOT AGREE TO THE TERMS OF THIS AGREEMENT, YOU MAY RETURN THIS PRODUCT, ALL MEDIA AND DOCUMENTATION, AND PROOF OF PAYMENT, TO THE PLACE YOU OBTAINED THEM FOR A FULL REFUND WITHIN THIRTY (30) DAYS OF FIRST ACQUIRING THIS PRODUCT OR, IF THE PRODUCT IS ACCESSED ELECTRONICALLY, SELECT THE DECLINE BUTTON AT THE END OF THIS AGREEMENT AND RETURN PROOF OF PAYMENT IN ACCORDANCE WITH THE ABOVE REFERENCED RETURN/REFUND PROCESS . WRITTEN APPROVAL IS NOT A PREREQUISITE TO THE VALIDITY OR ENFORCEABILITY OF THIS AGREEMENT AND NO SOLICITATION OF ANY SUCH WRITTEN APPROVAL BY OR ON BEHALF OF LEGATO SHALL BE CONSTRUED AS AN INFERENCE TO THE CONTRARY. IF YOU HAVE ORDERED THIS PRODUCT, LEGATOS ACCEPTANCE IS EXPRESSLY CONDITIONAL ON YOUR ASSENT TO THESE TERMS TO THE EXCLUSION OF ALL OTHER TERMS; IF THESE TERMS ARE CONSIDERED AN OFFER BY LEGATO, ACCEPTANCE IS EXPRESSLY LIMITED TO THESE TERMS. 1. DEFINITIONS 1.1 Authorization Code: means the code provided to Licensee by Legato for permanent authorization to use the Software. The Authorization Code is provided to Licensee once the Enabler Code is registered with Legato. 1.2 Documentation: means any user reference materials on any media, provided by Legato for use with the Software. 1.3 Enabler Code: means the code provided by Legato for activation of the Software. 1.4 Licensee: means the person or entity acquiring this License or for whom this License was acquired. 1.5 Software: means the object code copy of the software program provided to You in association with this Agreement, together with the associated original electronic media and all accompanying manuals and other documentation, and together with all enhancements, upgrades, and extensions thereto that may be provided by Legato to You from time to time. 2. OWNERSHIP AND ADMINISTRATION OF SOFTWARE 2.1 Ownership and Title. As between the parties, Legato, and its licensors, own and shall retain all right, title, and interest in and to: (i) the Software including all intellectual property rights embodied therein; (ii) all of the service marks, trademarks, trade names, or any other designations associated with the Software; and (iii) all copyrights, patent rights, trade secret rights, and other proprietary rights relating to the Software. 2.2 Software Activation. Legato employs Enabler Codes and Authorization Codes that enable the use of the Software. The Software is shipped in a Media Kit which consists of object code software on CD-ROM and an Enabler Code for initial activation of the Software or the Software and Enabler Code may be delivered electronically. Once Legato receives conrmation from Licensee that the Enabler Code is installed and is provided with the host ID information, Legato will provide the Authorization Code to Licensee. Legato administers the generation and distribution of Enabler and Authorization Codes, which administration may be modied by Legato from time to time. 2.3 Administration of Software. Legato may include on the media with the Software additional computer programs which are not currently licensed for use by Licensee and to which the Enabler Code or Authorization code will not permit access. Inclusion of such additional computer programs in no way implies a license from Legato and access or use of such programs is strictly prohibited unless Licensee procures the right to use any such program and the applicable Enabler Code is provided thereto. 3. LICENSE GRANT 3.1 Grant. Legato grants to Licensee a non-exclusive, nontransferable, non-sublicensable, perpetual, unless terminated in accordance with the provisions of this Agreement, license (the License) to (i) use the Software installed in accordance with the Documentation and only on the licensed computer solely for its own internal operations; and (ii) move the Software temporarily in case of computer system malfunction. The License granted under this Agreement does not constitute a sale of the Software or any portion or copy of it. Licensee may not use the Software on more than one computer system unless otherwise specically authorized by an explicit Software product, or additional licenses for additional computers are purchased. Rights not expressly granted are reserved by Legato. Where the Software is provided to Licensee at no charge for evaluation purposes only, the License granted is limited to a continuous thirty (30) day period, commencing with the acceptance of this Agreement (the "Evaluation Period"). At the conclusion of the Evaluation Period, Licensee agrees to destroy the Software and certify its destruction to Legato, in writing, within ten (10) days, or shall return the Software to Legato or purchase a perpetual license. 3.2 Copies. Licensee may make copies of the Software provided that any such copy is : (i) created as an essential step in utilization of the Software on the licensed computer and is used in no other manner; or (ii) used for archival purposes to back-up the licensed computers. All trademark and copyright notices must be reproduced and included on such copies. Licensee may not make any other copies of the Software. 3.3 Restrictions on use. Licensee shall not, and shall not aid, abet, or permit any third party to: (i) decompile, disassemble, or otherwise reverse engineer or attempt to reconstruct or discover any source code or underlying ideas or algorithms of the Software by any means whatsoever; (ii) remove any identication, copyright, or other notices from the Software; (iii) provide, lease, lend, use for timesharing or service bureau purposes; (iv) create a derivative work of any part of the Software; or (v) develop methods to enable unauthorized parties to use the Software. If EC law is applicable, the restrictions in Section 3.3 (i) are limited so that they prohibit such activity only to the maximum extent such activity may be prohibited without violating the EC Directive on the legal protection of computer programs. Notwithstanding the foregoing, prior to decompiling, disassembling, or otherwise reverse engineering any of the Software, Licensee shall request Legato in writing, to provide Licensee with such information or assistance and Licensee shall refrain from decompiling, disassembling, or otherwise reverse engineering any of the Software unless Legato cannot or has not complied with such request in a commercially reasonable amount of time. 3.4 Purchase Orders. Nothing contained in any purchase order, acknowledgment, or invoice shall in any way modify the terms or add any additional terms or conditions to this Agreement. 3.5 Updates. This section applies if the Software acquired is an update to the original Software ( the Update). An Update does not constitute a legally licensed copy of the Software unless purchased as an Update to a previous version of the same Software. The Update may only be used in accordance with the provisions of this Agreement. The Update, together with the original Software, constitutes one (1) legally licensed copy of the Software. 3.6 Evaluation License. This Section applies if the Software is being used for an initial thirty (30) day evaluation period. The license is valid only for a period of thirty (30) days from the delivery of the Software, and is designed to allow Licensee the right to evaluate the Software during such period. In the event that Licensee desires to enter into a longer-term license agreement with Legato, Licensee shall obtain an appropriate Enabler and Authorization Code in accordance with Section 2.2 above, upon payment of applicable fees, which authorizes use of the Software after such evaluation period, but only subject to all of the terms and conditions of this Agreement. In the event Licensee determines not to enter into a licensing transaction with Legato at the end of such thirty (30) day evaluation period, then Licensees rights under this Agreement shall terminate automatically and Licensee shall promptly return to Legato or destroy all copies of the Software and so certify to Legato. 4. MAINTENANCE AND SUPPORT 4.1 Legato has no obligation to provide support, maintenance, upgrades, modications, or new releases under this Agreement. Legato may provide such services under separate agreement.
5. LIMITED WARRANTY 5.1 Media and Documentation. Legato warrants that if the media or documentation are damaged or physically defective at the time of delivery of the rst copy of the Software to Licensee and if defective or damaged product is returned to Legato (postage prepaid) within thirty (30) days thereafter, then Legato will provide Licensee with replacements at no cost. 5.2 Limited Software Warranty. Subject to the conditions and limitations of liability stated herein, Legato warrants for a period of thirty (30) days from the delivery of the rst copy of the Software to Licensee that the Software, as delivered, will materially conform to Legatos then current published Documentation for the Software. This warranty covers only problems reported to Legato during the warranty period. For customers outside of the United States, this Limited Software Warranty shall be construed to limit the warranty to the minimum warranty required by law. 5.3 Remedies. The remedies available to Licensee hereunder for any such Software which does not perform as set out herein shall be either repair or replacement, or, if such remedy is not practicable in Legatos opinion, refund of the license fees paid by Licensee upon a return of all copies of the Software to Legato. In the event of a refund this Agreement shall terminate immediately without notice. 6. TERM AND TERMINATION 6.1 Term. The term of this Agreement is perpetual unless terminated in accordance with its provisions. 6.2 Termination. Legato may terminate this Agreement, without notice, upon Licensees breach of any of the provisions hereof. 6.3 Effect of Termination. Upon termination of this Agreement, Licensee agrees to cease all use of the Software and to return to Legato or destroy the Software and all Documentation and related materials in Licensees possession, and so certify to Legato. Except for the License granted herein and as expressly provided herein, the terms of this Agreement shall survive termination. 7. DISCLAIMER AND LIMITATIONS 7.1 Warranty Disclaimer. EXCEPT FOR THE LIMITED WARRANTY PROVIDED IN SECTION 5 ABOVE, LEGATO AND ITS LICENSORS MAKE NO WARRANTIES WITH RESPECT TO ANY SOFTWARE AND DISCLAIMS ALL STATUTORY OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR ARISING FROM A COURSE OF DEALING OR USAGE OF TRADE AND ANY WARRANTIES OF NONINFRINGEMENT. ALL SOFTWARE IS PROVIDED AS IS AND LEGATO DOES NOT WARRANT THAT THE SOFTWARE WILL MEET ANY REQUIREMENTS OR THAT THE OPERATION OF SOFTWARE WILL BE UNINTERRUPTED OR ERROR FREE. ANY LIABILITY OF LEGATO WITH RESPECT TO THE SOFTWARE OR THE PERFORMANCE THEREOF UNDER ANY WARRANTY, NEGLIGENCE, STRICT LIABILITY OR OTHER THEORY WILL BE LIMITED EXCLUSIVELY TO THE REMEDIES SPECIFIED IN SECTION 5.3 ABOVE. Some jurisdictions do not allow the exclusion of implied warranties or limitations on how long an implied warranty may last, so the above limitations may not be applicable. 8. LIMITATION OF LIABILITY 8.1 Limitation of Liability. EXCEPT FOR BODILY INJURY, LEGATO (AND ITS LICENSORS) WILL NOT BE LIABLE OR RESPONSIBLE WITH RESPECT TO THE SUBJECT MATTER OF THIS AGREEMENT UNDER ANY CONTRACT, NEGLIGENCE, STRICT LIABILITY, OR OTHER LEGAL OR EQUITABLE THEORY FOR: (I) ANY INDIRECT, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND WHETHER OR NOT ADVISED IN ADVANCE OF THE POSSIBILITY OF SUCH DAMAGES; OR (II) DAMAGES FOR LOST PROFITS OR LOST DATA; OR (III) COST OF PROCUREMENT OF SUBSTITUTE GOODS, TECHNOLOGY, SERVICES, OR RIGHTS; OR FOR AMOUNTS IN EXCESS OF THOSE RECEIVED BY LEGATO FOR THE PARTICULAR LEGATO SOFTWARE THAT CAUSED THE LIABILITY . Because some jurisdictions do not allow the exclusion or limitation of incidental or consequential damages, Legato's liability in such jurisdictions shall be limited to the extent permitted by law.
State of California, as applied to agreements entered into and to be performed entirely within California between California residents, without regard to the principles of conict of laws or the United Nations Convention on Contracts for the International Sale of Goods. 9.2 Government Restricted Rights. This provision applies to Software acquired directly or indirectly by or on behalf of any government. The Software is a commercial software product, licensed on the open market at market prices, and was developed entirely at private expense and without the use of any government funds. All Software and accompanying Documentation provided in connection with this Agreement are commercial items, commercial computer software and/or commercial computer software documentation.. Any use, modication, reproduction, release, performance, display, or disclosure of the Software by any government shall be governed solely by the terms of this Agreement and shall be prohibited except to the extent expressly permitted by the terms of this Agreement, and no license to the Software is granted to any government requiring different terms. Licensee shall ensure that each copy used or possessed by or for any government is labeled to reect the foregoing. 9.3 Export and Import Controls. Regardless of any disclosure made by Licensee to Legato of an ultimate destination of the Products, Licensee will not directly or indirectly export or transfer any portion of the Software, or any system containing a portion of the Software, to anyone outside the United States (including further export if Licensee took delivery outside the U.S.) without rst complying with any export or import controls that may be imposed on the Software by the U.S. Government or any country or organization of nations within whose jurisdiction Licensee operates or does business. Licensee shall at all times strictly comply with all such laws, regulations, and orders, and agrees to commit no act which, directly or indirectly, would violate any such law, regulation or order. 9.4 Assignment. This Agreement may not be assigned or transferred by Licensee without the prior written consent of Legato, which shall not be unreasonably withheld. Legato may assign or otherwise transfer any or all of its rights and obligations under this Agreement upon notice to Licensee. 9.5 Sole Remedy and Allocation of Risk. Licensee's sole and exclusive remedies are set forth in this Agreement. This Agreement denes a mutually agreed-upon allocation of risk, and the License price reects such allocation of risk. 9.6 Equitable Relief. The parties agree that a breach of this Agreement adversely affecting Legatos intellectual property rights in the Software may cause irreparable injury to Legato for which monetary damages may not be an adequate remedy and Legato shall be entitled to equitable relief in addition to any remedies it may have hereunder or at law. 9.7 No Waiver. Failure by either party to enforce any provision of this Agreement will not be deemed a waiver of future enforcement of that or any other provision, nor will any single or partial exercise of any right or power hereunder preclude further exercise of any other right hereunder. 9.8 Severability. If for any reason a court of competent jurisdiction nds any provision of this Agreement, or portion thereof, to be unenforceable, that provision of the Agreement will be enforced to the maximum extent permissible so as to effect the intent of the parties, and the remainder of this Agreement will continue in full force and effect. 10. ENTIRE AGREEMENT 10.1 This Agreement sets forth the entire understanding and agreement between the parties and may be amended only in a writing signed by authorized representatives of both parties. No vendor, distributor, dealer, retailer, sales person, or other person is authorized by Legato to modify this Agreement or to make any warranty, representation, or promise which is different than, or in addition to, the warranties, representations, or promises made in this Agreement. No pre-printed purchase order terms shall in any way modify, replace or supersede the terms of this Agreement.
9. MISCELLANEOUS 9.1 Governing Law. This Agreement shall be governed by the laws of the
Contents
Preface ......................................................................................................... 11.
Audience.......................................................................................................................... 11. Product Documentation ................................................................................................... 11. Conventions..................................................................................................................... 12. Information and Services................................................................................................. 13. General Information .................................................................................................. 13. Technical Support ..................................................................................................... 13. Licensing and Registration........................................................................................ 14. Customer Feedback ........................................................................................................ 14.
Chapter 1: Introduction............................................................................... 15.

Continuum of Service Events .......................................................................................... 16. Comprehensive Protection .............................................................................................. 17. Proactive Management.................................................................................................... 18. Rules ......................................................................................................................... 18. Peer-to-Peer Architecture ................................................................................... 19. Sensors Can Detect Problems Before They Occur ............................................ 19. Load Balancing................................................................................................... 20. Cross-Application Activation ............................................................................... 20. Simplified Failover..................................................................................................... 20. Resource Groups ............................................................................................... 21. Failure Prevention .............................................................................................. 22. A Flexible and Transparent Solution ......................................................................... 22. Cross-Platform Capabilities....................................................................................... 22. Centralized Management .......................................................................................... 23. AAM Replicated Database ........................................................................................ 23. AAM Reliability Features........................................................................................... 24.
Legato Automated Availability Manager, Release 5.1 Concepts Guide
Contents
Scalable Solutions for Different Needs ............................................................................24. AAM Two Node Solution ...........................................................................................24. Features and Management Capabilities .............................................................24. Customizing AAM for Larger Environments ..............................................................26. Features ..............................................................................................................26. Automated Availability Manager SDK........................................................................27. Automated Availability Manager Modules .................................................................27.
Chapter 2: Continuum of Service Events ..................................................29.

Resource Groups .............................................................................................................30. Overview....................................................................................................................30. The Node List......................................................................................................30. Resource Objects................................................................................................31. Startup and Shutdown Sequences......................................................................31. Resource Group States.......................................................................................32. Resource Group Monitoring ................................................................................32. Migration .............................................................................................................32. Resource Group Benefits....................................................................................33. Tracking ..............................................................................................................33. Reacting to the Continuum ........................................................................................33. Failure Protection ................................................................................................33. Application Migration...........................................................................................34. System Changes.................................................................................................34. New Systems ......................................................................................................34. Rules ................................................................................................................................35. Overview....................................................................................................................35. Solutions, Not Just Scripts ..................................................................................36. Reacting to the Continuum ........................................................................................37. Failure Protection ................................................................................................37. Performance Degradation ...................................................................................37. Business Changes ..............................................................................................37.
Contents
Time of Day ........................................................................................................ 37. Application Migration and System Changes ....................................................... 38. New Systems...................................................................................................... 38. Triggers, Sensors, and Actuators .................................................................................... 38. Triggers ..................................................................................................................... 39. Sensor-based Triggers ....................................................................................... 40. Scheduled Triggers ............................................................................................ 40. On-Demand Triggers .......................................................................................... 41. Windows Event Log Triggers.............................................................................. 41. Sensors ..................................................................................................................... 41. Actuators ................................................................................................................... 42. Proxies............................................................................................................................. 42.
Chapter 3: Case Studies ............................................................................. 43.

High Availability ............................................................................................................... 43. Initial Hardware Configuration................................................................................... 43. Making the Services Highly Available ....................................................................... 45. Hardware Configuration...................................................................................... 46. Resource Groups ............................................................................................... 47. Web Resource Group ......................................................................................... 48. The SQL Resource Group .................................................................................. 49. Static Load Distribution....................................................................................... 50. Automated Load Balancing ............................................................................................. 50. Hardware Configuration ............................................................................................ 51. Resource Groups ...................................................................................................... 52. Rules ......................................................................................................................... 53. Creating Triggers ................................................................................................ 53. Creating Rules .................................................................................................... 54.
Chapter 4: Architecture Highlights............................................................ 57.

AAM Agent Architecture .................................................................................................. 57.
Contents
Agents .......................................................................................................................57. Agent Components....................................................................................................58. Agent Process.....................................................................................................59. Replicated Database...........................................................................................59. Process Monitor ..................................................................................................59. Rule Interpreter ...................................................................................................60. Events and Rules Engine....................................................................................60. Primary Agents ..........................................................................................................60. Secondary Agents .....................................................................................................61. Agents and Scaling....................................................................................................61. Network Architecture........................................................................................................62. Network Configuration ...............................................................................................62. Network Redundancy ................................................................................................62. Failure Detection .......................................................................................................63. Data Source Architecture .................................................................................................63. AIX Logical Volume Manager (LVM) .........................................................................64. HP-UX Logical Volume Manager (LVM) ....................................................................65. Windows Shared Disk ...............................................................................................65. Windows Network Share ...........................................................................................65. Legato Mirroring for Windows 2000...........................................................................66. Windows Veritas Volume Manager (VxVM) ..............................................................66. RepliStor Data Source ...............................................................................................66. Solaris Solstice Disk Suite (SDS) ..............................................................................67. Solaris Veritas Volume Manager (VxVM) ..................................................................67. UNIX File System ......................................................................................................67. EMC SRDF ................................................................................................................67. EMC PowerPath Volume Manager............................................................................68. Rule and Monitoring Architecture.....................................................................................68. Events And Rules Engine ..........................................................................................69. Triggers, Sensors, and Actuators ..............................................................................70. Sensors and Sensor-based Triggers ..................................................................71.
Contents
Scheduled Triggers ............................................................................................ 71. Triggering the Rule ............................................................................................. 71. Actuators ............................................................................................................ 71. Proxies ...................................................................................................................... 72. Resource Group Architecture .......................................................................................... 72.
Glossary ....................................................................................................... 75.
Contents
10
Preface
The Legato Automated Availability Manager Concepts Guide contains conceptual information about Legato Automated Availability Manager (AAM) software.
Audience
The information in this guide is intended for system administrators who are responsible for installing software and maintaining the servers and clients on a network. Operators who monitor the daily backups may also nd this manual useful.
Product Documentation
Legato offers an extensive archive of product documentation at its web site www.legato.com. Most of the documents are in Adobe Acrobat Portable Document Format (PDF), and can be viewed by downloading and installing the Adobe Acrobat Reader. The Reader is available in the /viewers/acroread directory on the Legato Documentation Suite CD-ROM, or directly from Adobe at www.adobe.com. To install and use the Reader on the preferred platform, refer to the instructions in the CD-ROMs /viewers/acroread/readme.txt le or at the Adobe web site.
11
Conventions
Conventions
This document uses the following typographic conventions and symbols to make information easier to access and understand. Convention boldface Indicates Names of line commands, daemons, options, programs, or scripts Example The nsradmin command starts the command line version of the administration program.
italic in text
Pathnames, lenames, Displayed messages are also written to computer names, new terms /nsr/logs/daemon.log. dened in the Glossary or within the chapter, or emphasized words nwadmin -s server-name
italic in command A variable that must be line provided in the command line fixed-width Examples and information displayed on the screen
media waiting: recover waiting for 8mm 5GB tape volume name
fixed-width, Commands and options that nsr_shutdown -a boldface must be typed exactly as shown Menu_Name> Command Important: A path or an order to follow Volume>Change Mode>Appendable for making selections in the GUI Information that must be read and followed to ensure successful backup and Important: Use the no_verify option with recovery of data extreme caution.
12
Preface
Information and Services

Legato offers a variety of methods, including electronic, telephone, and fax support to obtain company, product, and technical information.
General Information
The Legato web site provides most of the information that customers might need. Technical bulletins and binary patches are also accessible on the Legato FTP site. For specic sales or training needs, e-mail or call Legato. Legato Service or Resource www.legato.com ftp.legato.com (log in as anonymous) Legato Sales (650) 210-7000 (option 1) sales@legato.com Legato Education Services (650) 842-9357 training@legato.com Company & Technical Binary Product Training Bulletins Patches Information Programs Yes Yes Yes Yes Yes Yes Yes
Yes
Technical Support
The Support section of the Legato web site provides contact information, software patches, technical documentation, and information about available support programs. Customers with an active support agreement have access to TechDialog, Legatos integrated product knowledge base. Help with Legato software issues is also available through Legato Technical Support. Customers without an active support agreement can contact Support Sales and Renewal to purchase annual Software Update Subscriptions, or Legato Technical Support services for per-update/per-incident support.
13
Licensing and Registration
Licensing and Registration

To license and register Legato products, go to the Legato licensing web site. To change contact information, transfer licenses, or ask questions about licensing, contact Legato using one of the following methods. Licensing and Registration Legato licensing web site Telephone number Fax number E-mail Contact http://license.legato.com (650) 812 6000 (option 3, option 2)a +31 23 554 8881b (650) 745-1477a +31 23 554 8808b licensing@legato.coma licensingemea@legato.comb
a. Contact information for Americas, Asia, and Pacic. b. Contact information for Europe, Middle East, and Africa.
Customer Feedback
Legato welcomes comments and suggestions about software features, the installation procedure, and documentation. Please send any suggestions and comments to feedback@legato.com. Legato conrms receipt of all e-mail correspondence. Although Legato cannot respond personally to every request, all comments and suggestions are considered during product design. Help improve Legato documentation and be eligible to win a prize by completing a brief survey. Visit the Legato web site at www.legato.com, navigate to the documentation page, and click on the link to the survey.
14
Chapter 1: Introduction
Businesses need reliable access to mission-critical information and applications in order to stay competitive and be successful in todays ever changing marketplace. With companies relying heavily on distributed computing architectures, multi-vendor environments, and web based solutions, the need for high availability has never been greater. Every minute of downtimeplanned or unplannedcan cost thousands of dollars. As such, you need a cost-effective way to monitor the state of your computer environment, automate repetitive tasks, and keep applications available and running. By denition, High Availability (HA) means an environment where there is immediate response to failures, either at the application level or the machine level. When the application or environment fails, the HA solution relocates the affected resources and restarts the protected application. Reaction to failure, however, is only one aspect of a complete and coherent availability solution. A total HA solution goes far beyond this simple failover denition of traditional high availability and allows you to monitor and automate responses to events. This means that system maintenance, load uctuations, business work ow cycles, and even human error, and other factors can impact availability and effectiveness can be monitored and maintained. In addition to availability issues, you want a solution that ts your current business environment as well as your business plan; a solution that: Leverages your existing investment in computing and networking resources Reacts and adapts to your dynamic environment Can be customized to your specic business needs Expands as your business expands
15
Continuum of Service Events
Legato Automated Availability Manager, also known as AAM, allows you to keep your companys applications up and running smoothly during both planned and unplanned events. AAM works within your existing environment, and is scalable in order to grow and change as your business needs require.
Continuum of Service Events

AAM provides a seamless cross-platform solution over the full spectrum of events that impact your applications availability, while providing state-of-the-art failover capabilities. Unlike most current failover products, which are limited to a small number of machines and a small number of failure events on a homogeneous network, AAM handles virtually unlimited events on up to 100 nodes, even in a heterogeneous environment. AAM enables System Administrators to control, manage and predict resource behavior during any event. New Systems New systems can be added easily and immediately included in your availability solutions. System Changes AAM allows you to relocate resources with minimum downtime during application and environment upgrades. AAM provides a maintenance mode that allows managed resources in and out of the monitored state in order to avoid triggering erroneous migration during maintenance. Application Migration AAM allows applications and related objects relocation to other nodes, enabling you to plan and manage system changes within your network or in response to a particular event. Resources can be automatically and transparently relocatedeven across platformsto provide continuous service to users while system changes are performed. Time of Day AAM can automate and schedule events to take place based on time of day or date. For example, AAM can enable comprehensive daily backup to media or shift processing priority on the 29th of the month. Business Changes AAM can monitor the load and use of your business related applications and maximize the processing power of your solution by either changing application priorities or load balancing applications. Performance Degradation AAMs sensor monitoring system and platform states allow you to congure AAM to avoid system failures proactively. For instance, you can congure AAM can monitor CPU load to preempt a situation where the node becomes overloaded.
16
Failure AAM provides complete and exible failover capabilities at the application and node level. AAM minimizes downtime by detecting failures quickly when they occur, and allows you to have complete control over AAM migration by conguring shutdown and startup sequences.
Comprehensive Protection
Legato Automated Availability Manager is the answer to availability and automation needs over the continuum of service events. AAM provides the most comprehensive solution available for proactively monitoring, managing, and protecting resources. Preventing, detecting, and responding to resource failures, planned system changes, application migration, time-of-day occurrences, and performance degradation are all included in AAMs scope of protection. With AAM, capabilities that have been available in the past as individual static tools that cooperated poorly at bestand usually only when provided by the same vendorare now available in one package to provide a comprehensive and dynamic solution, customized for the needs of your business. When you install AAM on a set of nodes, you create a domain, the largest unit of management. AAMs comprehensive Rules and Events Engine provides a reliable framework through which the full range of service events are received and acted upon throughout the domain. As these events occur, the Events and Rules Engine automatically carries out your set of prepackaged and customized management policies. AAM provides advanced failover capabilities and full management capabilities for applications and services, as well as data sources for disks and le systems, managed IP addresses, and node name aliases for machines. With AAM you can manage single objects to large installations of nodes whose applications have complex interdependencies. Machines can be grouped and managed by resource groups, which in turn can use all or any number of the resources contained within this group. To create more complex and intricate functionality, you can develop customized scripts specically for your environment.
17
Proactive Management
Proactive Management
Enterprise solutions often need sophisticated or proactive management capabilities. AAM provides management solutions at an enterprise level. At the heart of AAMs extensive availability capabilities are rules, management policies that can be developed to handle almost any operation. Rules can be as simple or advanced as the situation dictates.
Rules
The proactive, automated environment management capabilities provided by rules can monitor almost any data point within your environment or application and keep your managed applications available and responsive. Rules offer nearly limitless exibility and control in implementing not only high availability but also environment management solutions and policies. Rules provide you with total control over the management solution by allowing you to: Dene the statesor triggersthat drive the rule Determine the dataor sensorvalue that a trigger monitors, or providing a schedule on which the trigger res Determine the actions to be takenin the form of a ruleas a result of the trigger notication Determine if any function callsor actuatorsare made from within the rule.
Each of these components is part of the rule process used by AAMs fault-tolerant Events and Rules Engine for management. Rules provide the mechanism through which your AAM management policies are dened and carried out. Platform-specic sensors and sensors for monitoring node status, process status, and time of day are provided with the package. For highly customized applications, the AAM SDK allows you to create your own sets of custom sensors and objects called actuators to extend their management capabilities. For more details on triggers, sensors, and actuators, see "Chapter 2: Continuum of Service Events" on page 29. For more details on how rules and the Events and Rules Engine work see "Chapter 4: Architecture Highlights" on page 57.
18
Peer-to-Peer Architecture
AAM implements a peer-to-peer architecture that eliminates the need for idle standby nodes, preventing unnecessary expenditures for additional hardware. This active-active architecture allows each node to run its normal applications, making productive use of your computing resources, while still providing a target location in the event of a failure, maintenance operation or other scheduled event. This type of environment, in which different types of applications are running on different nodes within the domain, is a mixed workload environment, one that is fully supported by AAM without modication to the product or the applications it manages. For example, consider the mixed workload environment of three applications running on three different nodes. If it is determined that each application only uses 40 per cent of the nodes resources, thenif any single node failsthe application can migrate to a surviving node without causing a problem, although the surviving node now runs at 80 percent capacity. To avoid overloading a node when an application migration takes place you could congure a hot standby nodeone congured to be idle under normal circumstances, but able to support the application when a migration event occurs. AAM supports this architecture, but unlike many other products, does not require this conguration. Instead you could write a rule which checks the load of the potential destination node, and decides whether it should relocate there based on potential overload.
Sensors Can Detect Problems Before They Occur

Traditional HA solutions are reactive, responding to failures within the system. AAMs abilities, however, go far beyond a simple response to failoverAAM is built to be proactive as well as reactive. Sensors monitor events in the environment, including hardware, operating system, application, and network events. You can create rules to look for particular values, such as disk space or CPU utilization, and trigger rules to preempt problems. For example, if CPU utilization increases as more users connect to a monitored server, AAM can proactively and transparently relocate one set of managed resources while leaving others up and running, avoiding a potential failure. This type of proactive management avoids traditional HA pitfalls and provides you with virtually unlimited control over valuable resources, from prioritizing processing for applications to load balancing. AAM allows you to check the load of a target node before relocation and determine if the node is suitable for the additional load you are about to add.
19
Simplified Failover
Load Balancing
AAMs Events and Rules Engine allows you to dene sophisticated load balancing policies. Rules can be developed to monitor system resources or application load, and respond at runtime to move or stop applications dynamically to address load problems. AAM can also start multiple instances of the same application, providing dynamic load balancing across multiple nodes as processing requirements increase. Consider again the example of three applications and three nodes. Imagine that each night at 8:00 a backup procedure is started on a node and the response time to the application running on that machine increases signicantly during its 45 minute backup period. To eliminate this problem, an AAM rule can be developed to automate the relocation of the application from the node during the backup and its return when the backup is done. Interestingly, there are a number of ways to implement the condition that initiates the migration. For example the migration could be tied to the load on the machine, in which case AAM moves the application whenever the node gets busy. Another approach would be to move the application each night at 8:00 regardless of the load. Yet another approach might be to use AAM to start the backup procedure and design the rule to move the application before it starts the backup process. How AAM manages your environment is entirely up to you.
Cross-Application Activation
Using the sensors available in AAM, rules can be created that take data values from one application and activate functions within that, or an entirely different, application. Even applications to which AAM does not have direct access can be monitored and managed using AAMs Proxy Processes, which publish sensor and actuator values on behalf of a closed application. Once-isolated applications can now be tied together to leverage the functional capabilities of other applications in the environment.
Simplied Failover
AAM allows you to perform failover for any resource object, including processes, services, IP addresses, data sources, and node aliases on any machine in the domain (called nodes when managed by AAM). AAMs failover system enables migration of an individual application or of all applications and system resources running on or attached to a single node.
20
Other solutions may provide failover, but those solutions are typically built around a pair of servers, providing protection within a restricted environment. Scalability is impossible since applications are typically limited to a small number of machines. AAM, meanwhile, allows you to group up to 100 nodes and an unlimited number of applications into a single high availability unit, even in a mixed platform environment. It can switch over a single application, allowing the original node to stay operational, or, if necessary, switch over all applications and system resources running on or attached to the node. AAM also provides the ability to automatically fail the resources back to the original node if the node returns to a healthy state.
Resource Groups
AAM uses resource groups for failover. Each resource group coordinates the migration and other capabilities and denes the nodes and resources that are to be managed. The node denition provides the mechanism for AAMs cascading failover for up to 100 nodes in the domain. The resourcessuch as processes, disks, IP addresses, and node name aliasesdene the bounds of management. Additionally, a resource group denition can contain further customization through the support of scripts, timing delays, utility processes and failback options. In AAMs solution, the number of resource groups are limited only by the available resources in your environment. Your solutions are limited only by your needs. Making use of AAMs reliable rule architecture, resource groups are implemented internally as rules. Each resource group uses built-in sensors and triggers, which monitor and react to process and node failures. The exibility of the rule engine allows for simple one node application monitoring, two-node failover and other traditional HA features as well as a framework to do much more. AAM can relocate one or more resources independently, without waiting for the entire server to fail and without forcing the need to relocate every resource from that server concurrently. AAM can also automatically relocate resource groups to their original nodes when failed servers come back online. AAMs resource groups provide a comprehensive and easy-to-use mechanism for adding advanced multi-node failover capabilities to your AAM environment.
21
A Flexible and Transparent Solution
Failure Prevention
AAM Resource Group management provides load balancing upon failure. That is, when a node fails or comes down for maintenance, each resource group on that node can be relocated to different, separate nodes, so the entire application load can be better balanced. Without this ability, all of your applications could be failed over indiscriminately to a single node, potentially causing a serious performance degradation for end users. AAM also provides the ability to relocate a resource group manually during times of high usage, or for routine system maintenance. Rules can also be written to monitor load across the domain, and move the resource group as needed.
A Flexible and Transparent Solution

AAM provides a exible and transparent solution that works within your existing environmentincluding applications, hardware, standard TCP/IP networks, and datato preserve and protect valuable resources without requiring expensive and time-consuming upgrades to your environment. AAMs exibility comes from its ability to manage any application or service on multiple platforms. AAM monitors and transparently protects any application or service, without any scripting or programming. Even traditionally closed applications and services can be monitored and managed using AAMs proxy capability which stand in for the applications and services, providing necessary information for complete management. AAM actions may be transparent to end-users, including actions to relocate applications and other resources, change the AAM conguration, and perform routine maintenance. Relocation of a stateless and unconnected application, such as a Web server, may appear transparent to clients.
Cross-Platform Capabilities
AAM is hardware independent and compatible with equipment supported by platform vendors for Windows and supported UNIX operating systems. AAM also supports leading UNIX platforms such as Sun Solaris 8 and above, HP-UX 11.0, Redhat Linux,and AIX 5.1. AAM supports near identical functionality across platforms, making mixed-platform environments transparent to system administrators and clients. This consistency allows transparent cross-platform migration between and among nodes running Windows and UNIX.
22
AAMs Management Console can be used to manage all nodesWindows or UNIX. All nodes, regardless of platform, can be supported from a single console running on any node. The AAM Management Console automatically detects the platforms on which agents are running in the domain and displays the correct information as needed.
Centralized Management
AAM provides a centralized Management Console for full administration of all managed objects in the AAM domain. This single administration interface allows you to manage and view information about nodes, resource groups, individual resources, and rules for both Windows and UNIX nodes. The Management Console can be run from any machine on the network. In addition, any changes you make through the Management Console are automatically and immediately reected across the entire AAM domain, and are stored in AAMs fully replicated distributed database. The AAM Management Console provides: A centralized monitoring and administration tool for taking managed resources and resource groups on- and ofine. A real-time reection of managed cluster object states. As soon as AAM detects a state change for an object, the Management Console updates its display to reect that change. An interface to dene and congure all the managed resources in the cluster from a single local or remote location. An interface to create, congure, and run rules. Ability to manage multiple AAM domains from a single interface.
AAM Replicated Database

Legato Automated Availability Manager maintains a replicated database for added availability, and to ensure that all nodes in the environment have access to all domain information. Each database copy contains records of all dened components, their conguration information and their runtime states. Any changes you make are automatically and immediately reected across the entire AAM domain and stored in AAMs fully replicated distributed database.
23
AAM Reliability Features
AAM Reliability Features
Often, high availability solutions are not redundant or highly available themselves. When the HA solution fails, the environment becomes vulnerable. There is little point in trying to provide a high availability solution with a low availability product, so AAM was designed to be reliable and highly available with a redundant, self-monitoring architecture to keep its own processes running. AAM is built on top of a reliable messaging technology, ensuring that messages are distributed throughout the set of nodes. AAM automatically uses redundant network messaging if multiple networks are in place and maintains a fully replicated database of all its components and conguration information. The AAM agent is implemented as a group of specialized, self-checking processes that monitor each others health and automatically restart any component that fails. The Events and Rules Engine guarantees that rules execute, even if the node on which they were executing fails.
Scalable Solutions for Different Needs

AAM provides an extensible, scalable solution that can grow as your system and availability needs grow. Initially, you may need only a single resource group and a two-node cluster. As your environment requires more computing power or error and down-time prevention, you can easily modify and customize your conguration.
AAM Two Node Solution

AAM offers an availability solution that is easy to install and manage. No programming skills are required to create resource groups, although advanced scripting is supported. Setup and management take place from the Management Console. There is never a need to login to each machine to perform an administrative operation as AAM provides support for a multiple node conguration.
Features and Management Capabilities

AAM provides a solution for environments that need only resource group-based management.
24
Resource Groups These groups of applications and resource objects are at the heart of AAMs failover and migration capabilities. In the event of a failure or a user-initiated migration, all objects in the resource group are managed as a single package, providing reliability and consistency. Centralized Management AAM provides centralized management from the Management Console, a graphical interface which can be run on any nodeWindows or UNIXon the network to view a comprehensive view of all nodes and resources in the domain, regardless of platform or location. For more information, see "Centralized Management" on page 23. Data Sources A data source is a domain-wide name used to specify a storage device. Dening a shared data source allows access to the device from any node in the domain. Data sources can be dened for a shared disk where AAM controls which node has access to the data source. Network Interface Cards AAM enables you to test NICs on a node and to determine how the card will be used. Managed IP Addresses AAM enables you to associate unique IP addresses to nodes in the domain. The IP address can be moved from node to node in the domain, allowing client connections to applications to continue in the event of migration. Clients that connect to a server using a managed IP address need not know which physical machine in the cluster is hosting it. AAM also allows advance IP management including MAC address modication, default NIC selection on an IP address basis, and physical address migration. Node Aliases AAM enables you to associate a unique name or names with Windows nodes in the domain. Node aliases allow client processes to use host names for connections rather than IP addresses to nd the available resource group objects. The node alias can be moved from node to node in the domain, allowing client connections to continue even in the event of migration. Clients that connect to a server using a node alias are connecting to the logical node name and need not know which physical machine in the cluster is hosting it. Note that this works only for applications that access the application through NetBIOS. Security AAM provides a secure interface for dening and controlling domain-wide objects. AAM provides three types of security: user, operator, and administrator. Nodes must be added to the database before it can join the domain. Likewise, users must be added to the database and assigned a security level before they can access the domain.
25
Customizing AAM for Larger Environments
Event Viewer AAM log les provide a history of events, which system administrators can use to analyze patterns of events that have led up to failures. Event messages from AAM appear in AAMs own Event Viewer and log les, as well as in the Windows Event Log. Failure Detection The nodes in an AAM domain use a communication network for application and domain communication. Each participating AAM node is connected to at least one communication network, although multiple network support is built in to AAM. If a network failure occurs in a cluster with multiple cluster networks, AAM automatically utilizes the working network. Isolation Detection AAM provides a mechanism which allows a node to determine if it has lost its network connectivity, and to respond appropriately. This behavior prevents resource groups and rules from starting on the isolated node when these objects are already running correctly on the unaffected portion of the network. Command Line Interface (CLI) The CLI provides the same functions as the Management Console, but allows AAM commands to be executed from within scripts or in batch mode.
Customizing AAM for Larger Environments

AAM offers a highly customizable solution for providing a comprehensive high availability environment. Through AAMs rule and trigger editing interfaces the user can develop, debug and deploy a wide range of proactive and reactive management solutions, from scheduling maintenance to sophisticated load balancing algorithms. Since rule writing does require certain programming skills, Legato offers training and professional support to help develop rules that are right for your environment.
Features
AAM users have access to built-in sensors, triggers, and actuators, as well as over 70 API calls for use in their rules, which leverage all of AAMs power. Rules are created in AAMs Rule Editor. The Rule API calls provide management control over all elements in the domain, provide access to the replicated database for data from both inside and outside of the domain, and provide a consistent view of the domain from all rules. AAMs tracing capabilities allow inclusion of descriptive bug tracking information in rules. AAMs NIC failover capability ensures that network communication is not interrupted.
26
In addition, Proxy Processesaware applications to publish sensorsare also included in AAM. These application open up the management capabilities of the domain and create an integrated environment between AAM and the managed application.
Automated Availability Manager SDK

In addition to it built-in features, AAM offers an SDK package that allows custom sensor and actuator programming. The addition of the SDK allows nearly limitless extension to AAMs monitoring and reactive capabilities. A number of programming APIs are included with the AAM SDK to provide connection and communication with the agent.
Automated Availability Manager Modules

Legato also provides AAvailability modules, which allow quick and easy setup of AAM resource groups for commonly used applications. These modules allow you to enter pertinent data about your installation of applications such as Microsoft Exchange, Microsoft SQL 2000, Windows Print Services, Microsoft IIS, Oracle, NFS, and others. AAM then does the work of setting up resource groups which will then provide failover capabilities for those applications.
27
Automated Availability Manager Modules
28
Chapter 2: Continuum of Service Events
Legato Automated Availability Manager(AAM) provides solutions for events across the continuum of service events described in "Chapter 1: Introduction" on page 15. With its reliable system of resource groups for failover and rules for management solutions ranging from simple to sophisticated, AAM not only addresses failure in the domain, but also allows customized solutions specic to your business needs. The group of machines where AAM agents are installed and congured to work together are known collectively as the AAM domain, which is at the center of the AAM architecture. Domains can be as small as one node, or as large as 100. All of the nodes contained within the AAM domain share certain characteristics, such as a common replicated database, event log, and failure detection settings. The AAM agent that is placed on each machine is a reliable, self-monitoring set of processes which performs all of AAMs monitoring and management functionality. Sensor processing, rule triggering, rule execution and actuator ring in response to events all take place through the Events and Rules Engine, provided through the agent. Once the AAM agent is installed on a machine, that machine is referred to as a node. Nodes comprise the domain, over which rules can be used for exible management. A node can also host multiple domains. Rules can be designed to handle events in the domain, and the solution can be as simple or sophisticated as the problem dictates. Nodes can also be included in resource groups, which provide management and availability for applications and other resource objects such as data sources, IP addresses, and node aliases. For added protection, data sources can be replicated using Legato Replication. These tools provide all of the functionality necessary to provide availability protection for your system over the continuum of service events.
29
Resource Groups
Resource Groups
AAMs failover capabilities are implemented using resource groups. Each resource group coordinates the migration of a group or set of resources across the domain. With their failover and migration capabilities, resource groups go a long way toward covering events on the continuum of service events, providing solutions for Failure Application migration System changes New systems
Overview
Resource groups provide a standard, easy-to-dene solution for application migration, either planned or unplanned. Resource groups are dened in the Resource Group Editor, a point-and-click interface which allows quick conguration of the nodes in the domain and the objects to be managed. AAMs resource groups provide a comprehensive and easy-to-use mechanism for adding advanced multi-node failover and clustering capabilities to your AAM environment.
The Node List

When a resource group is dened, you supply a list of nodes creating a logical grouping of machines. The preferred node list denes the scope of the resource group, the nodes within the domain where the resource group and its managed objects can run. The preferred node list can include all of the nodes in the domain, or a specic subset. The preferred node list is ranked by the order of nodes within the list allowing cascading failover for up to 100 nodes in the domain. The rst node in the list is the most-preferred node, meaning that, if at all possible, objects in the resource group should run on this node. If that node is unavailable, AAM moves the resource group to the next node in the list, and so on. Furthermore, nodes can be included in an unlimited number of resource groups, which allows you to implement various availability solutions for the nodes and the applications on them.
30
The preferred node list allows you to congure static load balancing for applications in your domain. Since the order of nodes is congurable, multiple resource groups running on the same node need not migrate to the same node in the event of failure. Instead, each resource group can migrate to a different node, preventing the possibility of overloading the destination nodes. If advanced operation requires a resource group managed by a rule to determine the migration destination of the resource group, AAM supports nodeless resource groups. In this case, no node list is congured, and the rule is written to determine the most suitable location for the resource group if node failure occurs.
Resource Objects
Resource group are composed of a group of related AAM managed objects. These objects are processes, services, managed IP addresses, network interface cards, node aliases, and data sources. Once you dene the resource group, AAM treats it and all of the resource objects as a package. All objects within the resource group always run on the same node. If one of the objects within the resource group cannot run properly on a particular node, the whole resource group migrates to the next node in the preferred node list. Additionally, a resource group denition can contain further customization through scripts, timing delays, utility processes and failback options.
Startup and Shutdown Sequences

During resource group denition, resource objects are added to a congurable startup sequence. From here, each resource object can be dened and included in a modiable list which allows you to congure the resource group to operate in the way that best suits your environment. AAM provides additional hooks for application-specic needs, such as the ability to delay an action for a specic amount of time; launch an auxiliary program to perform an operation such as testing readiness of a component; scripts for performing application-specic activities; or sending messages to the event log. Similarly, the resource group provides a congurable shutdown sequence which allows you to determine the order in which managed objects are taken off of the node.
31
Overview
Resource Group States

Resource groups have two states, Online and Ofine. When the resource group is online, AAM assumes that all objects have started successfully, and manages and monitors the objects. In the event of any failure, AAM migrates the resource group to another designated node, where all of the objects can run normally. When the resource group is ofine, AAM does not manage the objects, and does not react to any change in the states of the objects.
Resource Group Monitoring

With resource group monitoring, the states of the specic nodes, processes, and services are continuously monitored to ensure that the critical resources are always up and operational within the domain. If monitoring is enabled, AAM handles all failure events and restarts resource group objects to keep them online. With monitoring disabled, AAM does not respond to resource group events and les or stopped objects remain ofine. The ability to disable monitoring is key to AAMs ability to allow maintenance of managed objects such as software upgrades. It allows an object in a resource group to be taken down without AAM treating it as a failure.
Migration
Migration is the movement of a resource groups objects from one node to another. When a resource group migrates, AAM moves all objects within the resource group from the current node and moves them to the specied node. All managed objects within a single resource group migrate to the same node. Resource groups can only migrate to nodes specied in the Preferred Node List that have the software properly installed. Automatic migration occurs in the event of a node failure or unrecoverable resource failure. Users can also initiate migration of a resource group for events such as planned node outages or software upgrades on the current node. Resource groups can optionally be congured to provide automatic failback of a resource group when the preferred node for the group comes back online after a failure. This functionality ensures that a resource group can always run on the node most suited for operation if that node is available.
32
Resource Group Benets

Resource groups are implemented internally as rules, making use of AAMs reliable rule architecture. Each resource group uses built-in sensors and triggers which monitor and react to process and node failures. The exibility of the rule engine allows simple two-node failover and other traditional HA features as well as a framework to do much more. AAM can relocate one or more resources independently, without waiting for the entire server to fail and without forcing the need to relocate every resource from that server concurrently. AAM can also automatically relocate resource groups to their original nodes when failed servers come back online. A resource group can be congured for operation on as many nodes as there are in the domain and there are no restrictions on the number of resource groups that AAM can manage. Your availability solutions are limited only by your needs. The resource group Live View screen in the Resource Group Editor provides a dynamic display of the resource object states and provides options for relocating the group, toggle its state, and toggle the monitoring state of the resource group.
Tracking
AAM allows you to get statistics for uptime and planned versus unplanned downtime using its availability tracking feature. This feature allows you to see at a glance the availability for the resource group over any given time since the resource group was created.
Reacting to the Continuum

The resource groups failover and migration capabilities provide protection for a number of events along the availability continuum, including failure protection, application migration, system changes, and new systems.
Failure Protection
Resource groups protect your system against failure using AAMs advanced capabilities. In the event of node failure, the resource group migrates to the next node in the preferred node list. Since AAM allows up to 100 nodes in a resource group, the availability of your applications and resources are almost assured. Because of the way resource groups are implemented, you can also be assured that all objects within the resource group always run together on the same node.
33
AAMs ability to load balance in the event of failure also ensures that you can prevent overloading destination nodes, which could cause another failure. If an individual process or Windows service that AAM monitors within a resource group fails, AAM performs one of two actions based on the users preference. AAM either: Restarts the failed process or Windows service, or Executes the resource group shutdown sequence followed by the resource group startup sequence on the same node to bring the resources back online.
If AAM is unable to restart the process or service on the original node, it then migrates the resource group to the next designated node specied for the resource group.
Application Migration
AAMs migration capabilities also provides an efcient mechanism for moving the resource group and its objects to another node in the cluster. Because migration is automatic, downtime caused by migration is minimal. Because AAM does all the work of moving the resources, you are ensured that they all migrate correctly. With AAM, migration of a resource group from one node to another is as simple as the click of a button.
System Changes
System changes can also be easily handled with AAMs resource group capabilities. When hardware is replaced or upgraded, resource groups can be migrated to other nodes in the cluster while the changes are taking place, and moved back as soon as the work is complete. During application upgrades, migration is also an option, or users can choose to disable resource group monitoring and upgrade the software in place. With resource group monitoring disabled, the processes and services in the group can be stopped without causing restart or migration. The affected process can be upgraded, while the other applications on the node continue to run unaffected. Once the upgrade is complete, monitoring can be re-enabled, and AAMs failure protection immediately resumes monitoring for failure.
New Systems
Adding a new system to the cluster is a matter of installing the proper processes and services on the node, adding the node to the domain, and then to the resource groups preferred node list.
34
Rules
AAMs exible proactive and reactive management capabilities are implemented using rules. The exible implementation of rules provides much more exibility and customization than resource groups. While resource groups provide resource monitoring and migration capabilities, rules can be created to perform much more complex procedures while still monitoring and managing resource objects in the domain just as resource groups do, as well as resource groups themselves. Rules that utilize AAMs built-in library of rule API subroutines provide centralized administration, a consistent domain-wide view, and direct access to the replicated database. They can provide solutions to problems within your domain ranging from simple to complex. Because of their exibility, rules can provide solutions across the full range of the continuum of service events.
Overview
With rules, you can move your solution beyond simple failover to specialized proactive and reactive responses to events within your domain. Rules are applicable over the entire domain. They are activated by a trigger, which can be specied to monitor sensor points from all nodes or a subset, or to re at a set time. When the triggering condition is met, the rule executes. Rules are dened through the AAM Rule Editor, and are written in Perl, an interpreted scripting language. Perl is portable, allowing the same rule text to execute in exactly the same manner on any platform. Rules are executed within the Events and Rules Engine which provides visibility at runtime into the entire domain and can make decisions at the domain level. With rules, availability decisions no longer must be static or made in advance. Rules can sense and react dynamically to changes in the environment and make decisions based on up-to-the-second runtime decisions. Multiple triggers can be associated with a rule, and one trigger can activate multiple rules. Figure 1 on page 36 provides an overview of relationships within the Events and Rules Engine.
35
Overview
Figure 1. Relationships in the Events and Rules Engine Rule A Rule B
Sensor-based Trigger
Sensor-based Trigger
Scheduled Trigger
2
User-dened Sensors AAM-provided Sensors Time
Solutions, Not Just Scripts

Although rules are written in Perl, they should not be confused with simple scripts. Rather, rules move beyond traditional simple scripts running on a specic node to centralized management policies throughout a domain. Because rules are node-independent, a single rule can handle management tasks for multiple nodes. Once the rule is dened it is available across the entire domain. Because rules are maintained in AAMs centralized, replicated database, any rule that executes after a change to the rule uses the updated code. The centralized administration of rules eliminates the need to manually modify the rule on individual nodes. Unlike standard scripts, when rules execute they have a consistent view of the whole domain and access to all information contained in the replicated database. No longer do you need to create separate scripts which only apply to a single node. Instead you can create one rule for the entire domain, and be assured that it runs correctly for any node in the domain. A rule executed on NodeA has access to all of the same information as a rule executed on NodeB. AAMs rule API calls allows a rule to see across the domain and make informed decisions about the management policy. For example, a rule can move an application to the node currently running the fewest applications or with the least load. AAM provides over 70 built-in rule API calls, which provide access to the database and allow the node independent interaction between rules and the domain. The API calls provide domain-wide visibility of nodes and resource objects, management of resource groups, and other advanced functionality which allows you to leverage all of AAMs power directly from your rules.
36
The result is a management tool which can be congured from a central location that watches the entire domain and responds accordingly.

Because of their exibility, rules can be created to provide protection over the entire continuum of service events.
Failure Protection
Because rules have visibility over the entire domain, they can respond to failures on both the node and application level. A typical rule could respond to either type of failure by migration to another node. Rules can also perform actions such as sending e-mail notication to system administrators, and writing messages to the AAM and Windows event logs. Through AAMs actuator capabilities, rules can interact with or notify third party management facilities. In the event of a failure, the third party could be notied by a page or through other means.
Performance Degradation
Using AAM sensors, you can dene a rule to execute based on CPU load or disk space. Using such conditions, a rule could be written to provide load balancing when CPU load reaches a certain level or to clean up the disk when disk levels reach a certain value.
Business Changes
Rules can be created to respond to dynamic resource deployment. For example, if your business is involved in e-commerce, a rule can be written to monitor the number of users on your web page, or the systems CPU load. When the load becomes too great, AAM can launch another web service on another node, relocate IP addresses, and respond to the additional trafc. Once levels return to normal, the second web server can be shut down, and return the system to its normal processing conguration.
Time of Day
Using AAMs scheduled triggers, you can run any rule once at a specic time, or on a recurring, scheduled basis. Once triggered, the rule can perform any scheduled operation, such as nightly backup or generating weekly reports.
37
Triggers, Sensors, and Actuators
For example, in an environment where AAM has been integrated with a data mirroring product such as Legato RepliStor, each night you can temporarily turn off mirroring of the source disk, start a backup procedure against the target disk, and then restart data replication updates when the backup is completed. Another application of scheduling rules could be shifting processing priority to the accounting and nance department on the 29th of the month for end-of-month processing.
2
Application Migration and System Changes
Although resource groups are specially designed to respond to the migration and maintenance needs of your domain, rules can also be designed to perform the same tasks, or to manage the operation of a number of resource groups. For example, you could create several resource groups to perform special tasks, and create a rule to run concurrently to coordinate the use of the resource groups.
New Systems
Because rules are not node specic, they are immediately available to nodes which are added to the domain. As soon as the triggers are activated, the same management capabilities that are available to other nodes in the domain are available on the new node. Information about the node and its processes are available to the node immediately.

AAM uses a system of Triggers, Sensors, and Actuators, which provide the foundation of its management and monitoring capabilities. A trigger is the dened condition that determines when a rule is executed. A trigger activates, or res, when the specied condition occurs. Triggers re based on a schedule, on demand, or as a result of a sensor value. Sensors provide data values for nodes, processes or other managed objects. Triggers provide the monitoring and reporting facility for sensors. When the trigger detects a specic condition, it reports the state and sensor value to one or more associated rules. When a rule executes, it can perform almost any action from starting a process to attaching a disk to sending a page. If a function that a rule needs to carry out is not directly available in the rule API, it may be accessed through an actuator.
38
Actuators are external function calls to an AAM-aware process that are dened and provided through the AAM SDK. Actuators can be invoked from rule text. The process that the actuator takes action upon need not be the same process whose sensor value red the trigger. Both resource groups and rules use this system of triggers, sensors and actuators. In rules, triggers, sensors and actuators are visible to, and congurable by, the user. AAM automatically creates them when resource groups are used. This combination of sensors, actuators, and triggers provides a powerful tool kit for solving a wide range of availability and clustering problems.
Triggers
Triggers are the objects which drive rules. Triggers may re based on a scheduled time or on a data value received from a sensor within a process. The ring of a trigger forces the evaluation of any rule associated with the trigger by the Events and Rules Engine. Note that multiple triggers can be associated with a single rule, allowing different conditions to trigger the same rule. A trigger can also be associated with more than one rule, allowing a single event to drive multiple rules. Triggers are dened using the Management Console and are stored in AAMs replicated database. When the rule is enabled, the associated triggers are activated and start monitoring the associated sensor. The trigger is then monitored and red whenever the congured condition is met. There are four types of triggers: Sensor-based The trigger res when its sensor returns a data value matching the condition criterion. There are a number of condition types which can be set for sensor-based triggers. A single trigger can monitor no more than one sensor. Scheduled Triggers re at a specic time or on demand. On-Demand Allows the user to run rules as needed. This can be useful during rule creation and testing to ensure that the rule is running as needed. Windows Event Log Enable a trigger to re when a particular Windows event occurs or when a series of events occur within a specic time interval.
39
Triggers
Sensor-based Triggers
A sensor-based Trigger res when its associated sensor returns a data value that matches the condition criterion set for the trigger. There are a number of condition types which you can set for sensor-based triggers, based on the expected values from the sensor: Threshold A value is entered, and the trigger res when the sensor value goes over, under, or is equal to a specied value. Absolute Change A value is entered, and the trigger res when the sensor value changes by the absolute value of the amount indicated. Percent Change (Relative To Range) A percentage and a value range are entered. The trigger res when the sensor value changes by the selected percentage of the entered range. Percent Change (Relative To Last Value) A percentage is entered, and the trigger res when the sensor value changes by the selected percentage relative to the sensors last reported value. State Sensor Comparison The trigger res when the process or node state is equal (or not equal) to indicated state. State Sensor Change The trigger res when process or node state changes in any way.
Scheduled Triggers
Scheduled triggers allow you to congure the trigger to re at a specic time or time interval. You can also congure the trigger to re on demand. Date The triggers re based on the time and date. Triggers can also be congured to re at a regular interval. Day Of The Week The triggers re based on the time on specic days of the week On-Demand The triggers re when invoked from a rule or through the AAM Management Console or command line interface.
Scheduled triggers are not connected to a sensor. At the congured time, the trigger res. This allows AAM to manage scheduled events in the domain. On-demand triggers allow the user to run rules as needed, and can be useful during rule creation and testing to ensure that the rule is running as intended. On-demand triggers can be red from within a rule, or from the Management Console.
40
On-Demand Triggers
On Demand triggers allow the user to run rules as needed. This can be useful during rule creation and testing to ensure that the rule is running as needed.
Windows Event Log Triggers

Windows Event Log Triggers enable a trigger to re when a particular Windows event occurs or when a series of events occur within a specic time interval. There are a number of options which you can set for Windows Event Log Triggers: Type Select between node or process type messages. Event Log AAM monitors the application, security, system, and user-dened logs. Event Source The name of the event log source. Event ID The Event ID of the message. Event Type Event types include: error, warning, information, audit success, and audit failure. Text The text to monitor for in an event message. Operation Select the triggering condition: OR, AND, or THEN.
Sensors
A sensor is an AAM object that publishes values from a data point such as a counter or environmental state. AAM publishes state sensors for managed processes and managed nodes. Users can also create additional sensors to publish data from within AAM-aware processes using the AAM SDK API. AAM also provides several aware processes called proxies which provide sensors on behalf of nodes and processes. During operation, sensors provide data to sensor-based triggers. Sensor values can also be explicitly polled through the Management Console, the CLI, or a running rule. AAM supplies two special state sensors which publish values for every node and managed process in an AAM domain. Unlike user-dened sensors, these sensors are not published from within a process, but are published by the AAM agent. AAM provides sensors for monitoring node status, process status, and time of day. Platform-specic sensors are also provided.
41
Actuators
User dened sensors provide data from aware processes based on the parameters that are dened and programmed by the user, allowing nearly limitless options on the types of information which can be gathered to trigger events.
Actuators
Actuator are function calls provided by an AAM-aware application which can be invoked from rules. When the rule text invokes the actuator, the actuators associated function is executed within the aware application. The actuator can be called either for a specic application or a number of processes within a class. Further, the application that the actuator takes action upon does not need to be the same process whose sensor red the trigger.
Proxies
Proxies are AAM-aware processes which publish sensor values and actuators on behalf of another process. AAM provides two proxy processes for each platform, a node proxy and a process proxy, which are used to publish information about node and processes respectively. Generalized platform and operating system specic sensors and actuators are provided. With proxies, developers can dene sensors to monitor the detailed behavior of an application, allowing monitoring of just about any data point for any user-dened condition. Actuators can then initiate actions within the application based on these triggered conditions. Proxies essentially provide the functionality of an AAM-aware application in any off the shelf product. Proxies are a powerful tool for total system integration and management through AAM.
42
Chapter 3: Case Studies
This chapter describes how a company goes about implementing the Legato Automated Availability Manager (AAM)for business-critical applications. The case study describes what actions the company administrator wants to take, the initial and acquired hardware resources, and conguration of AAM for the companys environment.
High Availability
Marvellus Corporation, a software reseller, has long offered its products through their own product catalog. Now, they want to enter the Internet Commerce market. They will be using a database to record their transactions. Since Marvellus is a PC-based company, they choose Microsoft Internet Information Server (IIS) for their online needs. IIS provides the World Wide Web Publishing Service that Marvellus wants. They also use Microsoft SQL Server for their database.
Initial Hardware Conguration

Marvellus IIS implementation includes four key components: The World Wide Web Publishing Service The SQL Server Service The Web pages, scripts and graphics Data les for the database The IP address through which clients, in this case browsers and FTP programs, communicate with the service
43
Initial Hardware Configuration
Traditionally, all of these pieces run on a single machine. Marvellus buys a machine to use as a server, installs the two services they will be using, congures the IP address, and creates their Web pages. They implement the single node conguration shown in Figure 2 on page 44, which has no backup and no failover scheme. Figure 2. Initial Hardware Conguration
Communication Network IP Address 1.2.3.4 Web SQL Server
D:
Web Pages CGI Scripts Database Files
gold
Running Applications Live Connections IP Address 1.2.3.4 Static IP Address Network Connections
The gure shows the single node conguration. Marvellus system administrator, Jack, installs and runs the Web and SQL services on the node called gold. Data, including Web pages and related CGI scripts for the Web Service and data les for the database is stored on the local disk drive D:. Internet communication for the Web Server and FTP Server takes place through the network IP address 1.2.3.4. Soon Marvellus Web site generates 40 percent of their sales, with many of the orders being placed during the evening and over the weekend. Operators process the weekend Web-based orders during normal work hours on Monday. Late one Friday night, the main power supply to gold is accidentally unplugged, and the server goes down. No one discovers the failure until Monday morning. The company estimates that they lost 20 percent of their potential revenues for the week because of this incident and the lack of a failover solution. The most serious drawback to the single node conguration is that when any sort of failure occurs, the application is unavailable until the machine and applications are brought online. Most often, the applications are brought back through manual procedures and after signicant down time. The single node conguration is vulnerable to various failure scenarios, including node failure,
44
application failure, and disk failure. Failure of a disk not only stops the application but prolongs downtime, since a system administrator must retrieve data from a backup. For Marvellus, the failure of one component results in a total node failure. The entire Web operation is unavailable until someone discovers the failure and brings the node back online manually.
Making the Services Highly Available

Downtime due to any type of failure is unacceptable; Marvellus needs a high availability solution. However, they certainly do not want to invest in an expensive fault-tolerant hardware solution. Marvellus wants a low-cost, easy-to-maintain, expandable solution which provides automatic failover in the event of application or node failure. They also want to ensure that there are no idle machines used only in the case of failure. With AAM, downtime due to machine or application failure can be signicantly reduced or even eliminated. By conguring selected machines on the network into an AAM domain and dening resource groups to manage clusters, AAM constantly monitors and manages the specied application resources within the domain. In the event of failure, AAM switches over the affected processes onto another node in the domain. Unless every node goes down at once, the migration is guaranteed to take place. To provide this type of availability solution, install and congure the services and processes you need AAM to manage on each node. During normal operation, each managed objects runs on a user-specied preferred node. This allows application level load balancing across the domain. If a managed application fails, AAM restarts it on the same node. If the entire node fails, AAM moves the managed applications to a healthy destination node within the cluster. The destination node now potentially takes on the name alias, managed IP address, disk resources and services of the failed node, in addition to its own. By properly conguring the application, its required resources, and AAM on each of those nodes, the company can reduce down time resulting from machine failure to a matter of seconds. As a result, users who are downloading and reading Web pages in real time may not even notice that a failure has occurred.
45
Hardware Conguration
Marvellus decides to create a two-node cluster. Along with gold, the machine that they use as a Web server, Marvellus decides to use an additional machine called silver. Marvellus Order Database will run on silver. Because each node runs only a single application, the CPU load for each should be low, providing a good failover location should the other fail. To increase availability, for a small additional expense Jack adds an Uninterruptible Power Supply (UPS) and one additional network interface card (NIC) per node. The extra NIC provides a redundant network in case one network fails. Now each machine has two static IP addresses to allow two communication paths between the nodes. Web les are replicated on each machines local disk, and uses SQL commands to access the data from the database. Finally, Jack moves the database les to an external shared disk connected to both nodes and managed by Automated Availability Manager as a shared disk data source. To avoid a single point of failure, the data on the disk may be replicated using a RAID array or a mirroring product such as Legato Replication. Figure 3 on page 46 shows the hardware and high-level AAM conguration the company has set up to keep its services highly available. Figure 3. High Availability Solution Hardware Conguration
Communication Network 2 Communication Network 1 IP Address 1.2.3.4 IP Address 1.2.4.4 IP Address 1.2.3.200 IP Address 1.2.3.210 Node Alias Mercury IP Address 1.2.3.5 IP Address 1.2.4.5 IP Address 1.2.3.200 IP Address 1.2.3.210 Node Alias Mercury
Web SQL Server
Web
Database Files
SQL Data Source SQL Data Source
SQL
AAM gold
AAM silver
Running Applications Application Failover Live Connections Failover Connections
IP Address 1.2.3.4 IP Address 1.2.3.200 IP Address 1.2.3.200
Static IP Addresses Managed IP Address Managed IP Destination Network Connections
46
In this conguration Jack connects each node to both communication networks so that each node has two static IP addresses, one each on the 1.2.3 and 1.2.4 subnets. In addition, he has congured the nodes to accept two managed AAM IP addresses, 1.2.3.200 and 1.2.3.210. Managed IP addresses are standard IP addresses, but they can be moved from node to node without rebooting. Since SQL Server uses a NetBIOS node name for default communication, Jack creates a movable node alias, Mercury, for SQL communication. Like the managed IP address, the node alias can be moved from node to node. Jack then installs the Web and SQL services on each node. Finally, he moves the database les from the local disks on each node to a shared disk, managed by AAM using a data source. The data source manages disk access to both nodes and ensures that only one node can access the shared disk at one time. The AAM resource group ensures the proper data source connection follows the application as it moves from node to node. The Web service runs initially on gold and the SQL service on silver. The AAM agent, congured to monitor the services and nodes for failure, actively runs on both nodes. Web browser connections from clients communicate with the Web service through the managed IP address, not through either machines standard IP address. In this case, the managed IP address is 1.2.3.200. Marvellus has published 1.2.3.200 through DNS as www.marvellus.com, and this is the address which Web browsers use for access. SQL Server has been congured to communicate using the node name Mercury, which has been dened as a movable node alias.
Resource Groups
The easiest way to congure application migration is through AAMs resource groups. Dene resource groups using the Resource Group Editor. Within the Resource Group Editor you can dene all of the objects to manage and the preferred nodes on which the resource group runs. Once you bring the resource group online, AAM managed the dened objects, providing high availability. Since Marvellus wants to manage two separate services typically running on separate nodesthe SQL service on silver and the Web service on goldthe availability solution requires two resource groups.
47
In preparation, Jack installs the Web and SQL services on each machine. Although AAM manages the cluster so that only one instance of the services is active at a time, they must be installed on each node so that they can be activated if AAM migrates the resource group to that node. Once he completes the installation, he can begin creating resource groups. Each resource group manages its own service. The Web resource group uses a managed IP address, while the SQL resource group uses a managed IP address, a node alias and a shared disk data source. Once Jack denes the two resource groups and brings them online, AAM begins managing the services. If node gold, running the Web service, fails, AAM recognizes the failure and moves the Web service and the managed IP address 1.2.3.200 to silver. Similarly, the SQL service and related objects are congured to run on silver with failover to gold. Similarly, if node silver, running the SQL service, fails, AAM moves the SQL service, the managed IP address 1.2.3.210 and the data source connection to gold.
Web Resource Group

The rst resource group monitors the Web service. This resource group contains the Web Service and the managed IP address. As shown in the gure, the preferred nodes for the resource group are gold and silver. The list of preferred nodes may include all nodes in the domain, or a subset. In this case, both domain nodes are listed. If, for instance, there were ten nodes in the domain, the list of preferred nodes could still be only these two. Jack denes the managed IP address and the data source from the Resource Group Editor as he adds them to the resource group. He congures the resource group to monitor and manage the Web service and the managed IP address 1.2.3.200. The Web pages access use SQL commands to access data. He also selects Auto-Failback. After a resource group migrates to another node due to a failure, the optional Auto-Failback mode congures AAM to move the resource group back to the most preferred node when that node comes back online. Auto-Failback ensures that the resource group does not remain on a failover node which may be less suited for optimal performance. Bringing the Resource Group Online When the resource group is brought online, it enters the Online Pending state and attempts to execute the startup sequence on the rst node listed in the Preferred Nodes listthe most preferred node. Since gold is the rst node in the list, this resource group executes the startup sequence on that node. Normally, the startup sequence executes without error and the resource objects are assigned to gold. The resource group enters the Online state.
48
If AAM encounters a problem, for example the service does not start or the managed IP address cannot be assigned, the shutdown sequence executes on the node. The resource group attempts to execute on the next node in the list, or in this example, silver. If AAM encounters more problems, it attempts to startup on each successive node in the Preferred Node List, until the resource group is online. When it reaches the end of the list it pauses, and returns to the rst node on the list.
In the Event of Failure Assuming a normal startup, the resource group enters the Online state, and remains running on the most preferred node until some type of failure occurs. If one of the services fails, AAM attempts to restart it on the same node, gold. If that fails, the resource group executes the shutdown sequence and executes the startup sequence on silver. If the node itself fails, AAM recognizes the failure and executes the startup sequence on silver. If at any point gold comes back online, the objects move because Auto-Failback is selected.
Once successfully completed, the resource group enters the Online state on the failover node, which then has the Web service running and responds to IP address 1.2.3.200. The SQL Resource Group The second resource group monitors the SQL Server service. The SQL Service resource group is very similar to the Web service Resource group. It manages the SQL service, a managed IP address a shared data source, and a node alias. Bringing the Resource Group Online For this resource group, silver is the rst node in the Preferred Node List. On startup, AAM attempts to execute the startup sequence on silver and then gold if problems are encountered. AAM repeats the process until the resource group enters the Online state on a node.
49
Automated Load Balancing
In the Event of Failure Like its Web counterpart, once in the Online state, this resource group remains running on the preferred node, silver in this case, until some type of failure occurs. Upon failure, the resource group moves to gold. Since the Auto-Failback option is selected, this resource group returns to silver once it comes back online.
Static Load Distribution

For the Web resource group, note that the Preferred Node List displays gold followed by silver. This conguration informs AAM that gold is the most preferred node for operation for the Web resource group and silver is next in terms of preference. For the SQL resource group the order of nodes should be reversed, indicating that silver is this resource groups most preferred node. If another node called platinum was added, the list of nodes could be changed to provide some manual load balancing, referred to as load balancing upon failure. In the event of a failure on gold, AAM can be congured to move its resource group objects to platinum, preventing a possible overload on silver, the server running SQL. Similarly, if silver fails, its load moves to platinum rather than to gold, where the busy Web Server runs. By properly specifying the order of preferred nodes for each resource group, you can reduce the instances when all resources migrate to a single node. With very little additional expenditure and minimal reconguration, Marvellus now has a failover solution. The variety of nodes in the domain allows for a distribution of load, preventing one node from bearing all of the work. Further, they can continue to scale this application, and create new failover solutions as needs arise. For instance, Marvellus could add an FTP server using a third resource group almost identical to that used for the Web service. When larger scale projects arise, Marvellus can make use of Automated Availability Managers custom rule programming interface to create nearly unlimited availability solutions.
Automated Load Balancing

As Marvellus Web business continues to grow, evenings and weekends become critical in terms of CPU load on the Web Server. In order to alleviate the load, they want to load balance the CPU load with another node. The companys intranet server, platinum, is generally idle during evenings and weekends. Since the Web service is already running, Jack decides to use this node as a backup.
50
This task can be performed using a combination of resource groups and AAMs rule capabilities. Rules can be reactive, responding to an event within the domain, or proactive, performing an action before a serious condition occurs. AAM rules are implemented as Perl scripts, and provides an extended version of the Perl interpreter which has API extensions that provide direct access to AAMs management infrastructure and agents. Since Perl is an interpreted rather than a compiled language, changes to rules can be deployed without the need to compile. Also, since Perl is platform-independent, scripts written for AAM on one platform will also run on another without modication. AAM also provides an extensive set of API calls for Rules which allow user interaction with the AAM domain. You use the Rule Editor to create and edit rules. AAM uses triggers which initiate the evaluation of rules. Scheduled triggers allow you to congure events to take place once, every interval, at a time of day, or day of the month. Sensor-based triggers get data values from sensors, which monitor data values from an application. When a trigger res, that is, initiates an event based on a change in its associated sensor value or its scheduled time, the AAM Agent evaluates the text for every rule in the domain which is associated with that trigger. More than one rule can be associated with a trigger. Rules are generally written to be stateless and conditional, with the rule reacting to the state and value of the received trigger. The Perl code, for example, could start with an if statement which causes action to be take only when a threshold trigger is in the ON state, and ignoring the trigger indicated the OFF state.
Hardware Conguration
To provide this level of load balancing, Jack must recongure the connections within the AAM domain to include the node platinum. The SQL server on silver is still a member of the domain with failover of the its resource group still taking place on gold. To provide the sensor data that will be needed to trigger the additional web server, the Web resource group must be modied to include the Proxy process called processProxy, which obtains sensor values from applications which are not AAM-aware. A proxy process is an application which publishes sensors and actuators on behalf of an unaware process. The processProxy and nodeProxy proxy processes are supplied with AAM, and other proxy processes can be programmed using the AAM SDK. The updated domain conguration is shown in Figure 4 on page 52.
51
Resource Groups
Figure 4. Updated Hardware Conguration

SQL QCM silver
Communication Network 2 Communication Network 1 IP Address 1.2.3.4 IP Address 1.2.4.4 IP Address 1.2.3.200 IP Address 1.2.3.205 IP Address 1.2.3.6 IP Address 1.2.4.6 IP Address 1.2.3.200 IP Address 1.2.3.205 Web Service
Web Service
AAM gold
Running Applications Application Failover Destinations Live Connections Failover Connections IP Address 1.2.3.4 IP Address 1.2.3.200 IP Address 1.2.3.200
AAM platinum
Static IP Addresses Managed IP Address Managed IP Destination Network Connections
The gure details the conguration of nodes gold and platinum. Node silvers conguration remains unchanged, as shown in Figure 3 on page 46. In the gure, gold is still the primary Web Server, with failover backup on silver. Node platinum is added to the AAM domain to be used as a backup node when the load on the Web server node becomes too high. Marvellus is now using a round robin DNS server which splits the Web Server requests between IP addresses 1.2.3.200 and 1.2.3.205. The web pages on both gold and platinum access the SQL Database on silver.
Resource Groups
The resource groups used in the rst solution are still valid, with the addition of the processProxy process for the Web service to monitor the CPU load of the node where the resource group is running. The Web service on platinum must also be congured within AAM to auto restart so that it will always run.
52
Rules
The rule is triggered by a sensor which tracks the CPU load on the node where the main Web service is running. Normally, if gold is running, the web service runs there. If gold is not available, the Web service runs on silver. The rule must be written to take into account the possibility of the node running on either node, however, for simplicity, assume that gold is the location of the Web service. Rule creation consists of dening triggers and creating Perl scripts that AAM evaluates when the trigger res. In this case, the Web service runs actively on gold and platinum at all times. When the CPU load on node gold reaches 70 percent, the managed IP address 1.2.3.205 moves to node platinum. When the CPU load on gold drops below 20 percent, the managed IP address moves back. See Figure 5 on page 53. Figure 5. Migration Graph
IP returns to gold 0% 20% No IP migration takes place IP migrates to platinum 70% 100%
Creating Triggers
To get the value of the CPU load Jack uses proxyProcess, provided with AAM. One of the sensors in that process provides the CPU load of a node. Triggers can be dened using this sensor to re the rule. Using the AAM SDK, other proxies could be programmed to get the average number of hits per hour or the number of web connections. Jack needs to create two triggers, both checking the value of the CPU load sensor. He sets the threshold of the rst trigger, LoadHigh, to >= 70. During operation, whenever the CPU load of node gold is greater than or equal to 70 percent, the managed IP address 1.2.3.205 migrates to node platinum. He sets the threshold of the second trigger, LoadLow, to <20. When the CPU load on gold drops below 20 percent, the managed IP address migrates back to that node. Jack congures the trigger to check the sensor value only every 60 seconds. Because the triggers are Threshold triggers, they re whenever the threshold is crossed. Each time the trigger res, any rule text associated with that trigger is evaluated. The state of the LoadHigh trigger is ON when the sensor value is
53
Rules
greater than or equal to 70 percent and OFF when it is less than 70 percent. The state of the LoadLow trigger is ON when the sensor value is less than or equal to 20 percent and OFF when it is greater than 20 percent. Assume the LoadHigh trigger checks the sensor value ve times. The values, in percent, are 60, 72, 85, 10, and 50. The rst value of 60 percent res the trigger, providing an initial value. Since 60 percent is below the Threshold, the Trigger state is off. The value 72 percent has crossed the threshold of 70 percent, so the trigger res, and the trigger state changes to on. The next value of 85 percent does not cause the trigger to re since the condition >70 is still true. The state is still on. The fourth value, 10 percent causes the trigger to re, since the value dropped below the threshold, and the trigger state toggles to off. The nal value of 50 percent does not cause the trigger to re, and the state remains off.
Creating Rules
When the triggers re, the rule evaluates the states of the two nodes, and performs the following tasks: If the initial CPU load of the node where the Web resource group is running, most likely gold, is less than 70 percent, the managed IP address 1.2.3.205 is also assigned to that node. Later, if the CPU load on that node is greater than 20 percent and less than 70 percent the managed IP address remains on the current node. If the CPU load of resource group node is greater than 70 percent, managed IP address 1.2.3.205 migrates to node platinum.
When the rule runs for the rst time, AAM checks to see if the managed IP address 1.2.3.205 is assigned to a node. If it is not assigned, AAM assigns it to node platinum, assuming that during operation, it will move to the proper location. Only if the domain has just been brought up should the IP address be unassigned. The rule text uses a condition statement to check the state of the triggers. When the CPU load of node gold is greater than 70 percent, the LoadHigh trigger state is on. If the LoadHigh trigger state is on, then the rule executes the portion of the case statement dealing with that condition. If platinum is running, AAM migrates the managed IP address 1.2.3.205 to that node. When the CPU load of the node running the Web resource group is less than 20 percent, the LoadLow trigger state is on. If the trigger state is on, then the rule executes the portion of the case statement dealing with this condition. In this case, the rule migrates the managed IP address 1.2.3.205 from platinum back to the resource group node.
54
If the CPU value is between 20 percent and 70 percent the state of both triggers is off. In this case, unless a failure occurs, the managed IP addresses remain on the current node. Marvellus has now created a load balancing solution for their Web server. This solution could be expanded to include three or more nodes using the same principles.
55
Rules
56
Chapter 4: Architecture Highlights
Legato Automated Availability Manager (AAM) uses a number of components during operation. Previous chapters have touched on some of these components. This chapter describes the following components in greater detail: "AAM Agent Architecture" on page 57 "Network Architecture" on page 62 "Data Source Architecture" on page 63 "Rule and Monitoring Architecture" on page 68 "Resource Group Architecture" on page 72
AAM Agent Architecture

An AAM agent is a set of related self-monitoring reliable processes that provide the management capabilities of AAM. Their integrated architecture provides monitoring, rule execution, and reaction for events in the domain.
Agents
Each node has an AAM agent installed. The agent provides the monitoring and management capabilities within the node. AAM supports two types of agents, primary and secondary. Conguration determines the agent type for the node. Primary agents provide application management and monitoring, and maintain AAMs replicated databases of conguration and state information. Primary agents also log events, evaluate rules, and service all AAM client requests. For correct operation, there must be at least one primary agent
57
Agent Components
running in the domain. Primary agents provide high availability of AAM itself. With multiple primaries, one can fail and management within the domain continues. Secondary agents only perform monitoring and management operations on behalf of, and are dependent on, primary agents. They perform no database replication or rule interpretation. Because secondary agents require fewer system resources than primary agents they allow scaling in an AAM domain up to 100 nodes. If necessary, you can promote and demote agents to maintain the processing balance in the domain.
Agent Components
There are four components which make up a primary agent: the agent process, process monitor, replicated database, and the rule interpreter. Secondary agents run only the agent process and process monitor. The following sections describe each agent component, and Figure 6 on page 58 illustrates their interaction.
Figure 6. Components of a Primary Agents

To Agents on Other Nodes
Proxy
SDK
Monitored Unaware Process
Socket Process Monitor
Socket
Agent Process
SDK Managed Aware Process Rule Interpreter Replicated Database
Primary Agent
58
Agent Process
The agent process coordinates AAM activities on the node and manages the activities which occur in the domain, either in response to a rule or user interaction. These activities include starting and stopping processes, attaching and detaching data sources, and assigning or unassigning node aliases and managed IP addresses. The agent process provide self-checking fault tolerance to provide high availability within AAM itself. If any of the agent processes fail, the others automatically restart it within seconds. When changes occur in the database, such as the denition of a resource group, this information is immediately, or synchronously, reected in all instances of the replicated database. If a primary agent is not running at the time of the database change, it is automatically updated as soon as it starts up again. Agent process are also responsible for communication with aware processes for sensor and trigger data and to invoke actuators. If an aware process tries to communicate trigger or event information while the AAM agent is down, the AAM SDK stores up to 200 events until the agent restarts.
Replicated Database
Each primary agent maintains a copy of AAMs replicated database, which stores information about the AAM domain. All object denitions and states are stored in the database, and whenever information changes in the domain, the AAM agent recognizes the changes and sends the information to each database copy. As a result, the database replicas always remain synchronized.
Process Monitor
The AAM Process Monitor is responsible for monitoring the state of all nodes, as well as managed processes and services on its node. It is also responsible for sending out heartbeats to the other agents in the domain to provide node level failure detections. In the event of a node or process failure, the process monitor reports the failure to the AAM agent process which then manages the recovery and passes the information to any rules or resource groups which must execute because of the failure.
59
Primary Agents
Rule Interpreter
The Rule Interpreter runs only on primary agent nodes. AAMs rule interpreter offers coherent, consistent, predictable rule behavior. When a rule executes, it executes in exactly the same way every time, executing the steps in the same order. Each rule is assigned a rule interpreter at the time the rule is enabled. The rule remains on that rule interpreter until either it fails, the node fails, or the rule is disabled. If the rule interpreter or node fails, all of the rules assigned to that rule interpreter are automatically reassigned to a rule interpreter on a surviving node. After they are reassigned, all rules are evaluated. AAM chooses the rule interpreter randomly from the active rule interpreters at the time the rule is enabled. The rule interpreter has direct access to the AAM agent and its replicated database, and it is this access which allows rules to perform almost any management operation automatically. For more information about AAMs rule architecture, see "Rule and Monitoring Architecture" on page 68.
Events and Rules Engine

The three agent processes work together to provide the AAM Events and Rules Engine. Their activities are tightly integrated, but each component performs a specic, distinct task. Each component works together when an event occurs in the domain. In the event of a node failure, a process monitor on a surviving node detects the failure, and noties the agent process on that node. The node state is recorded in the replicated database and any triggers monitoring the nodes state are notied. If the state causes a trigger event, any rules associated with the triggers are evaluated and initiate actions based on the trigger state.
Primary Agents
A node running a primary agent is comprised of all four agent components. It manages services and processes, executes rules, manages resource groups, monitors the state of other nodes and the processes running on its node. It also maintains a copy of the AAM replicated database. Primary agents communicate with all other agents in the domain, ensuring a consistent view of the domain. Events from secondary agents which require action are handled by primary agents, and any objects or state information is maintained in the AAM replicated database.
60
Primary agents cannot be deleted directly. However, they can be demoted to secondary agents and then deleted.
Secondary Agents
A secondary agent provides management and monitoring for all of the managed resources on its node, but does not execute rules or maintain a copy of the replicated database. The primary agents handle rule execution and database updates. Secondary agents connect to an AAM backbone process running on a primary agent through the network. If the connected primary agent fails, the secondary agent shuts down momentarily, and restarts after connecting to another primary agent in the domain. Secondary agents can be promoted to primary agents.
Agents and Scaling

When you are planning your domain, it is important to plan the number of primary agent nodes. Domains can be as small as one node, or as large as 100. For a one-node domain, that node must be a primary agent, since the primary agent maintains a copy of the database and runs a rule interpreter. In domains of two or more nodes, Legato recommends two to ve primary agents in the domain. Using two to ve nodes provides a healthy level of replication in the event of any node failures, and provides adequate assurances against network partitions and other reliability issues. More than ve primary agents, even in a domain as large as 100 nodes, creates unnecessary network trafc. The overhead can cause degradation of performance over the network without any appreciable benet in reliability. Other nodes in the domain should run as secondary agents. Secondary agents monitor for node and process failure and manage recovery, but rely on primary agents for rule and database operations. Because the network overhead of a secondary agent is low, they are more suitable for scalability to create larger domains. Using a combination of up to ve primary agents and the remaining nodes in the domain as secondary agents, you can create a single domain of up to 100 nodes. To maintain the appropriate processing balance within the domain, primary agents can be demoted and secondary agents promoted as necessary.
61
Network Architecture
Network Architecture
The nodes in an AAM domain use a communication network to communicate with each other. Each participating AAM node connects to at least one network. The nodes in the domain may be attached to one network subnet, or spread over multiple subnets using a router. The architecture fully supports redundant networks, and provides automatic failover between networks.
Network Conguration
Each node in the domain requires at least one network connection, over which it sends internal messages. AAM supports multiple networks but does not require them. AAM can be congured to share the same network that the managed applications use. AAM can also be congured to use one or more private networks for domain communication. The network conguration for your domain may all be on one network subnet, or span multiple subnets with communication through a router. If the router supports multicast broadcasts, AAM communicates normally without modication. If your router does not support multicasts, your domain can still span multiple subnets by using AAMs point-to-point communication option. However, when using managed IP addresses along with managed processes, the IP address must be in the same subnet as the static IP address dened for the Network Interface Card (NIC). For example, if the dened managed IP address is 1.2.3.250, the address works correctly on the ves nodes which are part of the 1.2.3 subnet, but cannot interface with the ve nodes in the 1.2.4 subnet. If a split conguration is necessary, you must ensure that the 1.2.3.255 managed IP address is congured only to nodes in the 1.2.3 subnet.
Network Redundancy
Although redundant networks are not required, Legato recommends that you congure each cluster to have at least two private networks to eliminate the network as a single point of failure. If a network failure occurs in a cluster with multiple cluster networks, AAM automatically fails over to the working network. However, if managed applications are also using the cluster communication networks, it is the responsibility of the application to move to another network. Application communication may cease if the network is unavailable. Client applications using network communications may not respond until the rst network returns.
62
Redundant networks may also be necessary to provide adequate protection when using shared disks. If only one network is used and a network partition occurs between the machines accessing the disk, communication is lost and there are no checks to ensure that only one machine accesses the disk at one time. Disk corruption may occur due to concurrent writes. The addition of redundant networks greatly reduces the risk of this scenario; if one network fails, AAM communication within the domain continues, and AAM can continue to manage the disk properly.
Failure Detection
Agents communicate with each other using a heartbeat mechanism over the domain communication networks. On primary agents, the process monitor both sends and receives heartbeats. On secondary agents, the process monitor only sends heartbeats. Once a primary agent receives a heartbeat from a node, it expects the heartbeats from that node to continue. If an agent no longer receives heartbeats from the source node, it assumes that the agent has failed. The agent then pings the source node. If the node pings successfully, the agent assumes that only the agent has failed on the source node. If the node does not ping, the agent assumes that the entire node has failed. For added reliability, AAM can be congured to ping the node over multiple network interfaces.
Data Source Architecture

A data source is the object through which a disk resource is dened to AAM. The name of the data source denition allows AAM to reference the data source using a common name throughout the domain. Using the data source denition name, AAM can attach and detach the disk resource, as well as query its state. AAM manages access to the storage device, helping to ensure that concurrent access does not corrupt data. However, AAM does not control disk operations or provide any access restrictions. Permissions and restrictions are controlled by the underlying operating system or le system. Once the data source is attached to a node in the domain, that node can then read and write data to the storage device. AAM currently supports a variety of data source types for both Windows NT and UNIX. There are a number of types of data sources shipped standard with AAM:
63
AIX Logical Volume Manager (LVM)
AIX_LVM AAM manages attaches using AIX Logical Volume Manager. HP_LVM AAM manages attaches using HP Logical Volume Manager for HP-UX. Shared_Disk AAM manages the attaches to a shared disk from two or more nodes in the domain. Network_Share AAM manages attaches from nodes within the AAM domain to a Windows share. The Windows share may be on a machine outside the AAM domain. Legato Mirroring for Windows 2000 The Mirroring for Windows 2000 Servers data source provides built-in volume-level mirroring for Windows 2000 Servers. NT_VxVM AAM manages attaches using Veritas Volume Manager. RepliStor File System This data source allows access to RepliStor mirrored applications. SUN_SDS AAM manages attaches using Solaris Solstice Disk Suite. SUN_VxVM AAM manages attaches using Veritas Volume Manager for Solaris. UX_File_System AAM manages attaches using a UNIX File system, either UFS, VxFS, or NFS. EMC SRDF Managed using EMC SRDF. EMC PowerPath Volume Manager Managed using EMC PowerPath Volume Manager.
AIX Logical Volume Manager (LVM)

The AIX Logical Volume Manager data source allows access to volume groups. The volume group must be congured using standard procedures before the volume group can be managed by AAM. The data source manages the volume group and attaches and detaches it to and from the appropriate node. When AAM attaches the volume group, all volumes are attached. When running the LVM data source in a resource group, the various le systems which are mounted on the logical volumes of the volume group should also be managed by AAM using the UNIX File System data source.
64
HP-UX Logical Volume Manager (LVM)

The HP-UX Logical Volume Manager data source allows access to volume groups. The volume group must be congured using standard procedures before the volume group can be managed by AAM. The data source manages the volume group and attaches and detaches it to and from the appropriate node. When AAM attaches the volume group, all volumes are attached. When running the LVM data source in a resource group, the various le systems which are mounted on the logical volumes of the volume group should also be managed by AAM using the UNIX File System data source.
Windows Shared Disk

The shared disk data source is used when two or more nodes in the domain are physically connected to a single shared disk drive. The shared disk data source type allows AAM to manage which node has access to the disk to perform read and write operations. The AAM installation includes a device driver which allows you to manage access to a shared disk. After AAM installation, the operating system activates the device driver at system startup. This device driver is a standard installation component of the AAM agent. It does not affect standard I/O operations but provides a mechanism to control access to a disk drive from AAM. AAM uses the supplied device driver to control disk access. Although more than one node is physically connected to the disk, AAM manages the attach operations. By default, AAM allows only one node to attach to the disk at one time. This method prevents disk operations from nodes that are not attached, avoiding possible disk corruption. Connections to the shared disk are limited by the number of physical connections to the disk.
Windows Network Share

A Windows share allows network users to access the contents of a remote network disk. The shared contents may be the entire remote disk, or specic portions of the disk as specied by the network share denition. The Windows Share data source type allows AAM to manage access to network shares dened in a Windows domain from nodes within the AAM domain. The share that is being accessed may be on a machine outside of the AAM domain.
65
Legato Mirroring for Windows 2000
Because the Windows Share does not require a physical connection to the node, it is much more scalable than shared disks, and a connection can be made from any node in the domain. You can determine the number of connections which can be made from node in the domain to limit access. AAM cannot control access to the share from machines outside of the domain.
Legato Mirroring for Windows 2000

The Legato Mirroring for Windows 2000 data source provides built-in volume-level mirroring for Windows 2000 Servers. Mirroring is synchronous and is supported only between two nodes. The Legato Mirroring for Windows 2000 data source may not function properly if the NIC used for mirroring is set to Autodetect speed. Shares that have been dened prior to dening the Legato Mirroring for Windows 2000 data source are captured only after the data source is attached to the source node. Once the data source is attached to the source node, the Agent monitors for new and deleted shares. Permissions are migrated along with the shares.
Windows Veritas Volume Manager (VxVM)

Legato Automated Availability Manager supports Windows Veritas Volume Manager disk groups. The disk group must be congured to be a Cluster Type Disk Group using Veritas procedures before the disk group can be managed by AAM. A disk group can be made up of multiple volumes, which in turn are made up of one or more disks. The operating system sees each volume as a single disk, while Veritas Volume Manager provides the underlying management of the various disks in each volume of the disk group. The data source manages the disk group and attaches and detaches it to and from the appropriate node. When AAM attaches the disk group, all volumes and their corresponding drive assignments, are attached.
RepliStor Data Source

The RepliStor data source allows access to mirrored applications. Legato RepliStor must be installed for this data source to work properly. The data source manages the communication between the two servers, allowing fail safe availability from within mirrored nodes. AAM manages the RepliStor service and manages any fail over of the mirrored pair.
66
Solaris Solstice Disk Suite (SDS)

The Solaris Solstice Disk Suite data source allows access to disksets. The diskset must be congured using SDS procedures before the diskset can be managed by AAM. The data source manages the diskset and oversees attaches and detaches to and from the appropriate node, ensuring that only one node is attached to the diskset at any one time. When AAM attaches the diskset, all meta devices in the diskset denition are attached. When running the SDS data source in a resource group, the various le systems which are mounted on the volumes of the diskset should also be managed by AAM using the UNIX File System data source.
Solaris Veritas Volume Manager (VxVM)

The Veritas Volume Manager data source allows access to disk groups. The disk group must be congured using Veritas procedures before the disk group can be managed by AAM. The data source manages the disk group and attaches and detaches it to and from the appropriate node. When AAM attaches the disk group, all volumes are attached. When running the VxVM data source in a resource group, the various le systems which are mounted on the volumes of the disk group should also be managed by AAM using the UNIX File System data source.
UNIX File System

AAM allows you to control le systems on UNIX nodes using data sources. In previous versions of AAM, you were required to modify the vfstab le on your node to facilitate use of the UNIX File System data source type. This information can now be entered using the Management Console. You can dene the le system table le entry to be used for the data source on a domain-wide or node-by-node basis. For compatibility with previous version of AAM, the functionality to check the modied vfstab le is also supported.
EMC SRDF
The EMC SRDF data source provides application availability to critical data, even in the event of node or data source failure. A typical conguration is a AAM domain of at least two nodes connected to a pair of EMC Symmetrix data arrays.
67
EMC PowerPath Volume Manager
The EMC SRDF software controls access within the Symmetrix data array, providing access to one or more devices managed as a device group. The device groups are congured into two sidesthe R1 (primary) side and the R2 (secondary) side. When a machine is connected to a pair of Symmetrix data arrays, the side that is its primary connection is always referred to as the R1. Normally, data is mirrored dynamically from R1 to R2. In the event that R1 becomes inaccessible, a failover takes place and the data is then accessible from R2, the secondary connection. When a host is connected to R2, no mirroring is taking place, although the host(s) have read and write access on R2. When a failover takes place, data can be re-synchronized by an R1 Update operation, which is initiated manually, or by a failback in which the host is reconnected to R1. In the case of a failback, the synchronization of data takes place automatically. The EMC SRDF data source controls data access from the node to the Symmetrix. This data source can be included in a resource group to provide high availability in the event that the data source or other resources fail. The resource group controls node access to the Symmetrix. The data source uses a predened device group name to determine which Symmetrix and corresponding devices it should use for connection.
EMC PowerPath Volume Manager

The EMC PowerPath Volume Manager data source supports the EMC PowerPath Volume Manager. This data source supports the importation and deportation of a pre-dened logical volume on a host. The data source does not perform operations on PowerPath logical volumes that create, destroy, resize, or repair volume groups or their components. Therefore, before creating an EMC PowerPath Volume Manager data source, a PowerPath logical volume must already exist.
Rule and Monitoring Architecture

AAMs system of rules runs using a system of triggers, sensors, and actuators to sense data values, automate management policies, and extend AAMs management capabilities. This functionality describes AAMs Events and Rules Engine.
68
Events And Rules Engine

Automated management is based on rules. Rules are driven by events that are reported by user-dened triggers. Triggers either monitor sensor values, such as CPU load or other measurable data points, or wait for a scheduled time. Through the trigger, the AAM agent watches sensor values and monitors the time. When a condition is met, the trigger res, and prompts the rule engine to evaluate any rules associated with the trigger. The rule text that the interpreter evaluates is customizable and exible. Rules have access to all of the information stored in the replicated database as well as maintaining a domain-wide view of events and object and node states. Rules also have access to persistent variables, can manage all domain objects, and perform system management. Rules can also implement changes in the environment, invoking AAM utility processes and actuator functions. Actuators are functions dened using the AAM SDK, and can be dened to perform such duties and managing applications, sending pages, or any other programmable action. Figure 7 on page 69 illustrates the relationship of triggers, sensors, the agent and the rule interpreter. Figure 7. Rule Architecture
Manage Windows Node Aliases Manage IP Resources Manage Shared Disks
Sensor Data
AAM Trigger
Trigger Event
Issue System Management Commands Rule Interpreter Evaluating Rules Process & Service Management
Scheduled Time
Persistent State or Data in Replicated Database
Manage Other Rules
Invoke Actuators
When a rule is enabled, AAM chooses a primary agent on which to run the rule. This can be any primary agent and as a result, rules must be node independent. Once AAM determines which agent is responsible for the rule, it is loaded into the rule interpreter on that node and the rule can then run whenever any of its associated triggers re. If the rule interpreter or node fails, the rule is immediately and automatically relocated to another rule interpreter.
69

AAMs system of triggers, sensors, and actuators form the foundation of its Events and Rules Engine for management and monitoring. Rules are triggered based on events which take place in the domain. Once triggered, the rule text is evaluated and the tasks dened within the rule are performed. In order for a rule to be evaluated when an event occurs, it must be enabled. When the rule is enabled, AAM begins monitoring the appropriate sensors, enabling its associated triggers and preparing the rule for evaluation. By only monitoring enabled triggers, AAM does not waste any resources monitoring useless data. All sensor and trigger activity for an aware process takes place on the local agent. The AAM-aware application must have the ft_Execute function in its code, and the function should be called on a regular basis. When the function is called, the agent checks all messages, polls the sensors values and reports any trigger state changes. Only if there is a change in trigger state does the local machine send a network message to the Rule Interpreter to evaluate the rule. This local operation for sensors and triggers cuts down on network trafc and provides efcient operation. Figure 8 on page 70 illustrates the operation of a managed aware application. Figure 8. Managed Application Operation
Managed Application Agent Process Primary Agent Rule Interpreter main loop
Sensor GetLoad()
Users code
Sensor GetMemUsage()
ft_Execute()
AAM SDK Trigger1 Sensor: GetLoad Freq: 20 sec. Cond: >70 Trigger2 Sensor: GetMemUsage Freq: 40 sec. Cond: >80
There are two basic types of triggers:
70
Sensor-based Triggers re when their sensors return data values matching the condition criterion. There are a number of condition types which can be set for sensor-based triggers. Scheduled Triggers re at a specic time or on demand.
Sensors and Sensor-based Triggers

Sensors, either those provided with AAM or programmed using the AAM SDK, publish data that can be monitored to drive rules. Triggers are congured to check the sensor values at a specic, user-dened interval. At that interval, the trigger checks the last reported sensor value and checks it against the triggering condition. If the sensor value matches the triggering condition, the trigger res.
Scheduled Triggers
Scheduled triggers are not connected to a sensor but rather are congured to re at a pre-scheduled time. Once congured, the agent tracks the schedule on which scheduled triggers re. At the congured time, the trigger res. A special case of scheduled triggers are on-demand triggers. On-demand triggers allow the user to trigger rules to run as needed and can be red from within the rule itself, or from the Management Console. When the user or rule invokes the trigger the agent receives the message and passes it on to any rules driven by the trigger.
Triggering the Rule

When a trigger res it sends a message to the appropriate rule interpreters running the rules associated with the trigger. That rule interpreter then evaluates the rule or rules and initiates any actions specied in the rule(s). The rule text can perform many tasks within the domain, including managing resources or resource groups.
Actuators
Actuators are functions provided by an AAM-aware process which can be invoked from rule text. When the rule interpreter evaluates the rule text and executes the actuator function, the message is sent to the agent, which in turn communicates with the appropriate aware process to invoke the function. Actuators allow users to expand the functionality available to AAM to manage the environment.
71
Proxies
Using the AAM SDK, users can program sensors and actuators. Some actuators are included with the AAM proxy processes.
Proxies
Proxies are AAM-aware processes that publish sensor values on behalf of another process. Actuators back to the application through a proxy are also provided, bringing the aware process functionality to an unaware application. Communication to and from the proxy process take place in the same manner as normal sensor, trigger, and actuator operation. The proxy process gathers sensor values from the unaware application and reports them to the agent. When the agent res an actuator for the process proxy, the proxy then runs the function which affects the unaware process. The agent communicates with the proxy process through the AAM SDK library, which gets sensor data and invokes functions through the managed applications API or CLI. Figure 9 on page 72 illustrates the relationship of the proxy to the AAM agent and the managed application. Figure 9. Proxy Process Architecture
AAM SDK
Agent
y ox Pr
oc Pr
es
s
Shrink-wrapped Managed Application
Application API/CLI
Resource Group Architecture

Resource groups are implemented internally by AAM as rules. When you create a resource group, AAM creates the sensors and triggers for the rule. The operation of the resource group is exactly the same as the discussion of Sensors and sensor-based triggers above. Once a resource group is online, AAM ensures that the resource objects remain online if possible. Their migration capabilities allow applications from a failed node to migrate to another, healthy node. AAMs retry algorithms force AAM to keep trying to put the resource group into the online state until it is successful.
72
Resource groups also use the same sensor and trigger system that rules use. Triggers and sensors for resource groups are created automatically when the resource group is saved, preventing the user from having to congure them.
73
Resource Group Architecture
74
Glossary
This glossary contains terms and denitions found in this manual. Most of the terms are specic to Legato Automated Availability Manager (AAM) Actuator An external function call in an AAM-aware process dened and provided through the AAM SDK. Actuators can be invoked from a rule or the Management Console. The process that the actuator takes action upon need not be the same process whose sensor value red the trigger. Name/value labels that be attached to the domain, resource group, nodes, or processes. Can be used by rules to organize objects into subgroups, or attach user-dened information to objects for use by rules. Selecting this option in a resource group results in the resource group being taken ofine, then brought online on the failback node if it is currently hosted on a node other than the failback node when the failback node comes online. A process dened to AAM that has bound the AAM programming library to publish sensors or actuators. The processes running on AAM primary agents that provide messaging services. The backbone includes the process monitor, rule interpreter, and the AAM replicated database.
Attributes
Auto-failback
Aware Process
Backbone
Legato Automated Availablity Manager, Release 5.1 Concepts Guide
75
Glossary
Bring Group Online
Initiate an attempt to execute the resource groups startup sequence. When the startup sequence is completed successfully, the resource group is in the Online state. The command line interface program provides access to the capabilities of the Management Console, including additional capabilities for debugging and batch processing. Also referred to as the CLI. A resource that provides data content to an application. With AAM, data sources can be congured to detach from a failed node and attach to a node that takes over the application associated with the data source. Turns off resource group monitoring. When disabled, AAM does not restart resource objects if they go ofine. This mode is useful for manual intervention and maintenance of resource objects without the need to relocate the entire resource group. Turns off a rule and its triggers. A group of nodes that participate in a particular management scope. A node can belong to more than one domain. All nodes on a network with an AAM agent installed are known collectively as the domain. The AAM domain should not be confused with the Windows networking domain.
Command Line Interface
Data Source
Disable Group Monitoring
Disable Rule
Glossary
Domain
Domain Communication Domain Network
Inter-agent messages. The network AAM agents use to communicate with each other. AAM clusters can be congured to use an optional independent network or to use the existing service network.
76
Glossary
Enable Group Monitoring
Turns on resource group monitoring. When enabled, AAM attempts to keep the resource group and all of its resource objects online in the event the object goes ofine. Includes the following AAM components: sensors to monitor the resources in the environment, triggers to initiate a response to received sensor data, and actuators to cause the appropriate actions to occur. A test to check for the physical existence of one or more managed processes. AAM provides the option of user-dened existence monitors for processes which can provide a more detailed and accurate status of the process. Settings for heartbeat interval and multicast or point-to-point addresses used by the AAM failure detection mechanism. A trigger res when its monitoring condition changes. The execution of a trigger is often referred to as ring. A periodic message sent from agent to agent for managing node state. Also a message sent from an Aware Process to the AAM agent to indicate its responsiveness. Heartbeats are used to determine node failure and isolation. A system which allows input and output to and from a managed process which would normally go to the le handles STDIN, STDOUT, and STDERR to be redirected to an alternate destination. Since AAM usually runs managed processes as background processes, the information to the locations may be lost without redirection.
Events and Rules Engine
Existence Test
Failure Detection
Fire
Heartbeat
Glossary
I/O Redirection
77
Glossary
Isolation Detection
Isolation detection is a mechanism which allows an AAM node to determine if it has become isolated on the network. Its primary purpose is to protect against a situation when a node cannot communicate with any resource. Isolation detection works in conjunction with the Minimum Detection Time set in the Domain Settings. The script that is invoked on an AAM agent when the node detects that it is isolated. This script includes the instructions that the node follows to ensure that the resource objects, resource groups, and rules on both the isolated node and on other nodes in the domain respond appropriately to the isolation. Media Access Control Address. A 48-bit address assigned by the interface card manufacturer. Most manufacturers specify a unique MAC address for each card. MAC addresses can be modied by AAM, if the model of network interface card allows it, when managed IP addresses are moved among interface cards. Process, Service, managed IP Address, Node Alias, Data Source. A process or service that has been dened to AAM and will be started and stopped under the control of AAM from the GUI, a rule or a resource group. The graphical user interface that provides access to create and manage cluster objects. The hardware component which allows the node to communicate over the network. IP addresses are assigned through software to the network interface card, and those addresses are used to determine the source and destination of network communication. Also referred to as a NIC.
Isolation Script
MAC Address
Glossary
Managed Object Types Managed Process
Management Console Network Interface Card
78
Glossary
NIC Group
A mechanism used by Legato Automated Availability Manager to group NIC cards using the same subnet. Once included in a group, NIC management tasks such as NIC testing and NIC to NIC failover can be performed. A failover scheme which allows all IP addresses assigned to a Network Interface Card (NIC), including subnet and MAC address information if any, to move to another NIC on the same node if the rst NIC fails. A name that can be added to an Windows node that NetBIOS can recognize, allowing access from computers on the network. Called a Virtual Server by MSCS, the AAM node alias name is not registered with DNS or WINS. Only accessible from NetBIOS clients. An aware proxy process provided by AAM that publishes node-related sensors and actuators on behalf of a node. See also Proxy Process. Running, Shutdown, Failed, Agent Failed, Unknown.
NIC to NIC Failover
Node Alias
Node Proxy
Node States Nodeless Resource Group
Glossary
A resource group intended to be controlled from a rule. Because the rule will determine the nodes on which the resource group will run, there are no nodes assigned in the Preferred Node List. The Nodeless Resource Group is congured, and once assigned to a node by the controlling rule, runs like a standard resource group. See also Resource Group. A variable that is maintained in the AAM Replicated Database rather than in local memory on the node. It is available even if the rule referencing it fails.
Persistent Variable
79
Glossary
Physical IP Address
The IP address that is manually assigned to a Network Interface Card (NIC) by a user. This IP address is usually one of the primary addresses used by the NIC. Once in place, this IP address can be managed by AAM. See also Virtual IP Address, Network Interface Card. The subset of nodes on which a resource group can be brought online. A node running an agent, process monitor, and rule interpreter which provide the main activities for management of a cluster. A primary agent also maintains a copy of the replicated database. At least one primary agent must be running at all times for AAM to function properly. There are typically two to ve primary agents in a domain. Any domain of two or more nodes should include at least two primary agent nodes. Other nodes in a domain should run as secondary agents. Monitors the state of processes and services on a node, and has the secondary function of managing the node failure detection mechanism using heartbeats. Also referred to as the ProcMon. A proxy process provided by AAM that publishes process-related sensors and actuators on behalf of a managed process. The Process Proxy is associated with the process in AAMs process conguration and publishes sensors to the given sensor class. See also Proxy Process. Unknown, Running, No Response, Stopping, Stopped, Failed Static characteristics of a node, such as OS version, physical memory, etc. Allows an application to publish actuators and sensors on behalf of another (unaware) process in a tightly integrated manner. See also Node Proxy, Process Proxy, Managed Process.
Preferred Nodes Primary Agent
Process Monitor
Glossary
Process Proxy
Process States Properties Proxy Process
80
Glossary
Publish Relocate Resource Group Replicated Database
The act of providing sensor data to AAM. Move a resource group currently running on one node to another node in the Preferred Node List. Relocation allows for manual load balancing. Internal database where objects and objects states are kept. Each primary agent maintains a synchronous copy. The collection of objects, such as services, processes, IP addresses, data sources, and others, that comprise a failover group. See also Nodeless Resource Group. The Management Console screen that allows for the selection of objects, nodes, and attributes of a managed failover group. The list of resource objects that will be taken ofine, the scripts that will execute, the events that will be posted, and the delays that will executed in the order specied when attempting to take the resource group ofine. The list of resource objects that will be brought online, the scripts that will execute, the events that will be posted, and the delays that will executed in the order specied when attempting to bring the resource group online. The objects that can be managed by AAM, such as processes, services, IP address, node aliases, and data sources. An optional test of a process response state. Response monitors test that the target process is healthy and responding correctly. This test is optional, but indicates a processes health. AAM provides the option of user-dened response monitors for processes and services which can provide a more detailed and accurate status of the process or service.
Resource Group
Resource Group Editor Resource Group Shutdown Sequence
Resource Group Startup Sequence
Glossary
Resource Objects
Response Test
81
Glossary
Response Time
The time period within which the agent should expect to receive a message from an aware process. A user-denable Perl script which is evaluated when a trigger condition changes. A collection of AAM subroutines that are accessible from a rule for performing AAM-related management capabilities. Most rule API subroutines can also be used in resource group scripts. The agent process that evaluates custom rules (and resource groups) on primary agent nodes. The range over which a trigger applies. For node-sensors, the scope indicates for which nodes the trigger applies, for process sensors, the scope allows for the specication of the processes for which the trigger applies. Two AAM processes, the agent and process monitor, that provide management capabilities of a node. It does not execute rules, nor does it maintain a copy of the replicated database. Secondary agents allow AAM domains to scale to as many as 100 nodes with minimal overhead. The list of users with permissions in the domain and the level of access allowed to each user. AAM provides three forms of user-level access: User, Operator, and Administrator. A piece of data to be monitored. Provides data values for use by triggers and rules. Sensors are published by the agent, by proxy processes, and by any other AAM aware processes.
Rule Rule API
Rule Interpreter Scope
Secondary Agent
Glossary
Security
Sensor
82
Glossary
Stop Script
A script used in conjunction with a managed process which performs specic actions each time the process is shutdown gracefully by AAM. The shutdown script may ensure that all processes associated with an application are properly closed. See also Start Script. Provides the AAM programming API, enabling users to develop their own aware applications. With the SDK, users can write additional sensors and actuators to compliment and extend those provided by AAM. A script used in conjunction with a managed process which performs specic actions each time the process is started. The start script is often used to launch a number of related processes which are necessary for the proper execution of an application. See also Stop Script. AAM provides state monitoring capabilities for processes and services dened in the domain. There are two types of monitors to test different aspects of a process healththe Response Monitor and the Existence Monitor.
Software Developers Kit (SDK)
Start Script
State Monitor
Glossary
Subnet Mask
A TCP/IP conguration parameter that extracts network and host conguration from an IP address. This 32-bit value allows TCP/IP to distinguish the network ID portion of the IP address from the host ID portion. The host ID identies individual computers on the network. TCP/IP hosts use the subnet mask to determine whether a destination host is located on a local or a remote network. A subnet mask is a 32-bit number expressed as four decimal numbers from 0 to 255 separated by periods, for example: 255.255.0.0
Take Group Ofine
Initiate an attempt to execute the resource groups shutdown sequence.
83
Glossary
Trigger
An object that monitors a sensor for a logic condition and reports any matching conditions to one or more rules. Triggers are the objects that cause rules to execute. A process that does not use the AAM runtime API and so is unaware of AAM. This includes third-party and shrink-wrapped software that is monitored by AAM. Unaware processes can be associated with a proxy process to allow AAM to monitor various conditions. A process that can be executed from AAM through a rule or from the Management Console. The utility process is started, but not managed, by AAM. The utility process is typically a short-lived program, often used to perform some auxiliary function. Utility processes are also referred to as UtilProcs. The movable IP address that is assigned to a Network Interface Card (NIC) by AAM. See also Physical IP Address, Network Interface Card.
Unaware Process
Utility Process
Virtual IP Address
Glossary
84

Aam CG 51

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Aam CG 51

Diunggah oleh

Hak Cipta:

Format Tersedia

Automated Availability Manager

Release 5.1 UNIX and Microsoft Windows Version

Legato Systems, Inc. End-User License Agreement

Chapter 1: Introduction............................................................................... 15.

Chapter 2: Continuum of Service Events ..................................................29.

Chapter 3: Case Studies ............................................................................. 43.

Chapter 4: Architecture Highlights............................................................ 57.

Glossary ....................................................................................................... 75.

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Information and Services

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Licensing and Registration

Licensing and Registration

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Continuum of Service Events

Continuum of Service Events

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Sensors Can Detect Problems Before They Occur

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Legato Automated Availability Manager, Release 5.1 Concepts Guide

A Flexible and Transparent Solution

A Flexible and Transparent Solution

Legato Automated Availability Manager, Release 5.1 Concepts Guide

AAM Replicated Database

Legato Automated Availability Manager, Release 5.1 Concepts Guide

AAM Reliability Features

AAM Reliability Features

Scalable Solutions for Different Needs

AAM Two Node Solution

Features and Management Capabilities

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Customizing AAM for Larger Environments

Customizing AAM for Larger Environments

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Automated Availability Manager SDK

Automated Availability Manager Modules

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Automated Availability Manager Modules

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Chapter 2: Continuum of Service Events

Legato Automated Availability Manager, Release 5.1 Concepts Guide

The Node List

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Chapter 2: Continuum of Service Events

Startup and Shutdown Sequences

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Resource Group States

Resource Group Monitoring

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Chapter 2: Continuum of Service Events

Resource Group Benets

Reacting to the Continuum

Legato Automated Availability Manager, Release 5.1 Concepts Guide

Reacting to the Continuum

Chapter 2: Continuum of Service Events

Legato Automated Availability Manager, Release 5.1 Concepts Guide