Anda di halaman 1dari 16

Veritas Cluster

Veritas Cluster enables one system to failover to the other system. All related software processes are simply moved from one system to the other system with minimal downtime. Veritas Cluster does NOT have both boxes up at once servicing requests. It only offers a hot standby system. This enables the system to keep running (with a short transfer period) if a machine fails or system maintenance needs to be done.

Veritas Cluster Install Preparation & Planning


This is the MOST important part of the veritas cluster installation process. If you skip a step - you pay for it later! Machines racked Machines jumpstarted or equivalent (last 4 e3500's needed to be done by hand) On one machine set scsi-initiator to 7 Have array installed, verify disks can be seen on both machines Additional patches for veritas installed (download latest sun recommended patches) -- these are not in jumpstart! Send off for veritas licemse -- if not there in time, get temp license Have software ready for install Hardware meets specifications Veritas checklist should be filled out & ready to go Change internet/network link to qfe1 ; cross over cables on hme0 & qfe0 (REMOVE hostname.home and hostname.qfe0)

Veritas Cluster Install


Verify that preplanning & preparation was completed properly. If not, do NOT proceed. If preparation was completed, on BOTH machines: /etc/init.d/volmgt start insert cdrom for db edition 2.1 [or higher] cd /cdrom/cdrom0 ./installDBED [answer default/yes to all EXCEPT say no to single / group ] [may need to change cdroms here - not sure] installvx remove cdrom vxdiskadm -- encapsulate root - specify your 2 root drives process will reboot twice create mount points for your db Once done, on ONE machine:

vxdiskadm - initialize driives in array vxassist -g oradg -maxsize drive1 drive2 - setup array drives mount disks add to vfstab DBAs install oracle on ONE machine, update /etc/system on BOTH machines, add table for vcs On BOTH machines /etc/init.d/volmgt start insert cluster server cd cd /cdrom/cdrom0 pkgadd -d . (add packages 3,2,5,1,4,6; yes to everything) eject cdrom; mount oracle cluster agent cd /cdrom/cdrom0 pkgadd -d . eject cdrom cd /opt/VRTSllt cp llttab /etc cd /etc vi /etc/llttab -- uncomment/change following: set-node 0 [on one machine set to 1] set-cluster 0 link hme0 & qfe0 low-link pri qfe1 start (at bottom) vi /etc/gabtab uncomment gabconfig -c -n 2 cd /etc/rc2.d start llt and gab on both machines /sbin/lltconfig -a list [check for all 3 interfaces] gabconfig -a [check for membership] add /sbin and /opt/VRTSvcs/bin to /.profile PATH] On ONE machine: mkdir /etc/VRTSvcs/conf/config cd /etc/VRTSvcs/config cp *.cf config cd sample-oracle cp main.cf ../config [may need to copy other files - check] cd ../config vi main.cf update systema and systemb update SystemList and AutoStartList add diskgroups IP qfe1 nic-qfe1 add mountpoints update oracle info build dependancies

listener oracle mount volumes diskgroup vip nic hacf -verify . manually stop listener [lsnrctl stop] manually stop db [svrmgrl; connect internal ; shutdown immediate] take mountpoints out of vfstab hastart [start cluster] hagrp -switch oragrp -to systemb [test switchover] run veritas testing on both machines.

Cluster Startup
Here is what the cluster does at startup: Node checks if other node is already started, if so -- stays OFFLINE If no other machine is running, checks communication (gabconfig). May need system admin intervention if cluster requires both nodes to be available. (/sbin/gabconfig -c -x) Once communication between machines is open -- or gabconfig has been started, it sets up network (nic & ip adddress) (starts cluster server) If also brings up volume manager, file system, and then oracle. If any of the critical processes fail, the whole system is faulted. The most common reason for failing is expired licenses, so check licenses before doing work with vxlicense -p.

File Locations (Logs, Conf, Executables)


Log location: /var/VRTSvcs/log There are several logs in this directory: hashadow-log_A: hashadow checks to see if the ha cluster daemon (had) is up and restarts it if needed. This is the log of that process. engine.log_A: primary log, usually what you will be reading for debugging Oracle_A: oracle process log (related to cluster only) Sqlnet_A: sqlnet process log (related to cluster only)

IP_A: related to shared IP Volume_A: related to Volume manager Mount_A: related to mounting actual filesystes (filesystem) DiskGroup_A: related to Volume Manager/Cluster Server NIC_A: related to actual network device Look at the most recent ones for debugging purposes (ls -ltr). Conf files: Llt conf: /etc/llttab [should NOT need to access this] Network conf: /etc/gabtab If has: /sbin/gabconfig -c -n2 , will need to run /sbin/gabconfig -c -x if only one system comes up and both systems were down. Cluster conf: /etc/VRTSvcs/conf/config/main.cf Has exact details on what the cluster contains. Most executables are in: /opt/VRTSvcs/bin or /sbin

Changing Configurations
ALWAYS be very careful when changing the cluster configurations. The only time I needed to change the cluster configuration was when Vipul upgraded Oracle versions and ORACLE_HOME changed directories. This is a very dangerous thing to do. There are two ways of changing the configurations. The method one uses if the system is up (cluster is running on at least one node, preferably on both): haconf -makerw run needed commands (ie. hasys ....) haconf -dump -makero If both systems are down: hastop -all (shouldn't need this as cluster is down) cp main.cf main.cf.save vi main.cf hacf -verify /etc/VRTSvcs/conf/config hacf -generate /etc/VRTSvcs/conf/config hastart

Veritas Volume Manager

Veritas Volume Manager divides disks into disk groups and partitions these groups as desired. There is a nice GUI which helps alot. You can even pull up a command window to see what the gui is running. The newest version of the gui is vmsa. General Commands: Veritas Volume Manager licenses info: /usr/sbin/vxlicense -p What volume groups: vxdg list Import volume group (see details on cluster debugging) vxdg import oradg Specific voulme group info: vxprint -ht What is veritas doing (if running another command and it is hanging) vxtask list

Veritas file systems


VERITAS filesystem is a journaling filesystem. This means transactions are logged so that if a failure occurs, the transactions can be rolled forward. This prevents filesystem corruption as well as allows the system to do a filesystem check (fsck) faster. Normally, at boot-up after a crash, a system needs to manually check the integrity of all the filesystems. With VERITAS, it checks the journalled logs, and then comes up. This can save 30-60 minutes on large filesystems. Generally, maintanence for VERITAS file system is often done with VERITAS Volume Manager. Further documentation for Veritas FileSystem is in: /opt/VRTSfsdoc

Veritas Cluster Debugging Tips

Veritas cluster server is a high availability server. This means that processes switch between servers when a server fails. All database processes are run through this server - and as such, this needs to run smoothly. Note that the oracle process should only actually be running on the server which is active. On monitoring tools, the procs light for whichever box is secondary should be yellow, because oracle is not running. Yet, the cluster is running on both systems. Cluster Not Up -- HELP The normal debugging of steps includes: checking on status, restarting if no faults, checking licenses, clearing faults if needed, and checking logs. To find out Current Status: /opt/VRTSvcs/bin/hastatus -summary This will give the general status of each machine and processes /opt/VRTSvcs/bin/hares -display This gives much more detail - down to the resource level. If hastatus fails on both machines (it returns that the cluster is not up or returns nothing), try to start the cluster /opt/VRTSvcs/bin/hastart /opt/VRTSvcs/bin/hastatus -summary will tell you if processes started properly. It will NOT start processes on a FAULTED system. Starting Single System NOT Faulted If the system is NOT FAULTED and only one system is up, the cluster probably needs to have gabconfig manually started. Do this by running: /sbin/gabconfig -c -x /opt/VRTSvcs/bin/hastart /opt/VRTSvcs/bin/hastatus -summary If the system is faulted, check licenses and clear the faults as described next. To check licenses:

vxlicense -p Make sure all licenses are current - and NOT expired! If they are expired, that is your problem. Call VERITAS to get temporary licenses. There is a BUG with veritas licences. Veritas will not run if there are ANY expired licenses -- even if you have the valid ones you need. To get veritas to run, you will need to MOVE the expired licenses. [Note: you will minimally need VXFS, VxVM and RAID licenses to NOT be expired from what I understand.] vxlicense -p Note the NUMBER after the license (ie: Feature name: DATABASE_EDITION [100]) cd /etc/vx/elm mkdir old mv lic.number old [do this for all expired licenses] vxlicense -p [Make sure there are no expired licenses AND your good licenses are there] hastart If still fails, call veritas for temp licenses. Otherwise, be certain to do the same on your second machine. To clear FAULTS: hares -display For each resource that is faulted run: hares -clear resource-name -sys faulted-system If all of these clear, then run hastatus -summary and make sure that these are clear. If some don't clear you MAY be able to clear them on the group level. Only do this as last resort: hagrp -disableresources groupname hagrp -flush group -sys sysname hagrp -enableresources groupname To get a group to go online: hagrp -online group -sys desired-system If it did NOT clear, did you check licenses?

System has the following EXACT status: gedb002# hastatus -summary -- SYSTEM STATE -- System

State

Frozen

A A

gedb001 gedb002

RUNNING RUNNING

0 0

-- GROUP STATE -- Group State B oragrp OFFLINE B oragrp OFFLINE gedb002# nic-qfe3 nic-qfe3

System

Probed

AutoDisabled

gedb001 gedb002

Y Y

N N

hares -display | grep ONLINE State gedb001 ONLINE State gedb002 ONLINE

gedb002# vxdg list NAME STATE rootdg enabled gedb001# vxdg list NAME STATE rootdg enabled Recovery Commands:

ID 957265489.1025.gedb002

ID 957266358.1025.gedb001

hastop -all on one machine hastart wait a few minutes on other machine hastart Reviewing Log Files If you are still having troubles, look at the logs in /var/VRTSvcs/log. Look at the most recent ones for debugging purposes (ls -ltr). Here is a short description of the logs in /var/VRTSvcs/log: hashadow-log_A: hashadow checks to see if the ha cluster daemon (had) is up and restarts it if needed. This is the log of that process. engine.log_A: primary log, usually what you will be reading for debugging Oracle_A: oracle process log (related to cluster only) Sqlnet_A: sqlnet process log (related to cluster only) IP_A: related to shared IP Volume_A: related to Volume manager Mount_A: related to mounting actual filesystes (filesystem)

DiskGroup_A: related to Volume Manager/Cluster Server NIC_A: related to actual network device By looking at the most recent logs, you can know what failed last (or most recently). You can also tell what did NOT run which may be jut as much of a clue. Of course, if none of this helps, open a call with veritas tech support. Calling Tech Support: If you have tried the previously described debugging methods, call Veritas tech support: 800-634-4747. Your company needs to have a Veritas support contract. Restarting Services: If a system is gracefully shutdown and it was running oracle or other high availability services, it will NOT transfer them. It only transfers services when the system crashes or has an error. hastart hastatus -summary will tell you if processes started properly. It will NOT start processes on a FAULTED system. If the system is faulted, clear the faults as described above. Doing Maintenance on DBs: BEFORE working on DB Run hastop -all -force AFTER working on Dbs: You MUST bring up oracle on same machine Once Oracle is up, run: hastart on the same machine as you started the work on (the first on system with oracle running) wait 3-5 minutes then run hastart on the other system If you need the instance to run on the other system, you can run: hagrp -switch oragrp -to othersystem Shutting down db machines:

If you shutdown the machine that is running veritas cluster, it will NOT start on the other machine. It only fails over if the machine crashes. You need to manually switch the services if you shutdown the machine. To switch processes: Find out groups to transfer over hagrp -display Switch over each group hagrp -switch group-to-move -to new-system Then shutdown machine as desired. When rebooted will start cluster daemon automatically. Doing Maintenance on Admin Network: If the admin network is brought down (that the veritas cluster uses), veritas WILL fault both machines AND bring down oracle (nicely). You will need to do the following to recover: hastop -all On ONE machine: hastart wait 5 minutes On other machine: hastart Manual start/stop WITHOUT veritas cluster: THIS IS ONLY USED WHEN THERE ARE DB FAILURES If possible, use the section on DB Maintenance. Only use this if system fails on coming up AND you KNOW that it is due to a db configuration error. If you manually startup filesystems/oracle -- manually shut them down and restart using hastart when done. To startup: Make sure ONLY rootdg volume group is active on BOTH NODEs. This is EXTREMELY important as if it is active on both nodes corruption occurs. [ie. oradg or xxoradg is NOT present] vxdg list hastatus (stop on both as you are faulted on both machines ) hastop -all (if either was active make sure you are truly shutdown!) Once you have confirmed that the oracle datagroup is not active, on ONE machine do the following: vxdg import oradg [this may be xxoradg where xx is the client 2 char code] vxvol -g oradg startall mount -F vxfs /dev/vx/dsk/oradg/name /mountpoint [Find volumes and mount points in /etc/VRTSvcs/conf/config/main.cf]

Let DBAs do their stuff To shutdown: umount /mountpoint [foreach mountpoint] vxdg deport oradg vxvol -g oradg stopall clear faults; start cluster as described above

Testing Veritas Clusters


1. Check Veritas Licenses - for FileSystem, Volume Manager AND Cluster vxlicense -p If any licenses are not valid or expired -- get them FIXED before continuing! All licenses should say "No expiration". If ANY license has an actual expiration date, the test failed. Permenant licenses do NOT have an expiration date. Non-essential licenses may be moved -- however, a senior admin should do this. 2. Hand check SystemList & AutoStartList On either machine: grep SystemList /etc/VRTSvcs/conf/config/main.cf You should get: SystemList = { system1, system2 } grep AutoStartList /etc/VRTSvcs/conf/config/main.cf You should get: AutoStartList = { system1, system2 } Each list should contain both machines. If not, many of the next tests will fail. If your lists do NOT contain both systems, you will probably need to modify them with commands that follow. more /etc/VRTSvcs/conf/config/main.cf (See if it is reasonable. It is likely that the systems aren't fully set up) haconf -makerw (this lets you write the conf file) hagrp -modify oragrp SystemList system1 0 system2 1 hagrp -modify oragrp AutoStartList system1 system2 haconf -dump -makero (this makes conf file read only again)

3. Verify Cluster is Running First verify that veritas is up & running: hastatus -summary If this command could NOT be found, add the following to root's path in /.profile: vi /.profile add /opt/VRTSvcs/bin to your PATH variable If /.profile does not already exist, use this one: PATH=/usr/bin:/usr/sbin:/usr/ucb:/usr/local/bin:/opt/VRTSvcs/ bin:/sbin:$PATH export PATH . /.profile Re-verify command now runs if you changed /.profile: hastatus -summary Here is the expected result (your SYSTEMs/GROUPs may vary): One system should be OFFLINE and one system should be ONLINE ie: # hastatus -summary -- SYSTEM STATE -- System A A e4500a e4500b

State RUNNING RUNNING

Frozen 0 0

-- GROUP STATE -- Group State B oragrp ONLINE B oragrp OFFLINE

System

Probed

AutoDisabled

e4500a e4500b

Y Y

N N

If your systems do not show the above status, try these debugging steps: If NO systems are up, run hastart on both systems and run hastatus summary again. If only one system is shown, start other system with hastart. Note: one system should ALWAYS be OFFLINE for the way we configure systems here. (If we ran oracle parallel server, this could change -- but currently we run standard oracle server) If both systems are up but are OFFLINE and hastart did NOT correct the problem and oracle filesystems are not running on either system, the cluster needs to be reset. (This happens under strange network situations with GE Access.) [You ran hastart and that wasn't enough to get full cluster to work.]

Verify that the systems have the following EXACT status (though your machine names will vary for other customers): gedb002# hastatus -summary -- SYSTEM STATE -- System A A gedb001 gedb002

State RUNNING RUNNING

Frozen 0 0

-- GROUP STATE -- Group AutoDisabled

System State

Probed

B oragrp OFFLINE B oragrp OFFLINE gedb002# nic-qfe3 nic-qfe3

gedb001

gedb002

hares -display | grep ONLINE State gedb001 ONLINE State gedb002 ONLINE

gedb002# vxdg list NAME STATE rootdg enabled gedb001# vxdg list NAME STATE rootdg enabled

ID 957265489.1025.gedb002

ID 957266358.1025.gedb001

Recovery Commands: hastop -all on one machine hastart wait a few minutes on other machine hastart hastatus -summary (make sure one is OFFLINE && one is ONLINE)

If none of these steps resolved the situation, contact Lorraine or Luke (possibly Russ Button or Jen Redman if they made it to Veritas Cluster class) or a Veritas Consultant. 4. Verify Services Can Switch Between Systems Once, hastatus -summary works, note the GROUP name used. Usually, it will be "oragrp", but the installer can use any name, so please determine it's name.

First check if group can switch back and forth. On the system that is running (system1), switch veritas to other system (system2): hagrp -switch groupname -to system2 [ie: hagrp -switch oragrp -to e4500b] Watch failover with hastatus -summary. Once it is failed over, switch it back: hagrp -switch groupname -to system1 5. Verify OTHER System Can Go Up & Down Smoothly For Maintanence On system that is OFFLINE (should be system 2 at this point), reboot the computer. ssh system2 /usr/sbin/shutdown -i6 -g0 -y Make sure that the when the system comes up & is running after the reboot. That is, when the reboot is finished, the second system should say it is offline using hastatus. hastatus -summary Once this is done, hagrp -switch groupname -to system2 and repeat reboot for the other system hagrp -switch groupname -to system2 ssh system1 /usr/sbin/shutdown -i6 -g0 -y Verify that system1 is in cluster once rebooted hastatus -summary 6. Test Actual Failover For System 2 (and pray db is okay) To do this, we will kill off the listener process, which should force a failover. This test SHOULD be okay for the db (that is why we choose LISTENER) but there is a very small chance things will go wrong .. hence the "pray" part :). On system that is online (should be system2), kill off ORACLE LISTENER Process ps -ef | grep LISTENER Output should be like: root 1415 600 0 20:43:58 pts/0 0:00 grep LISTENER oracle 831 1 0 20:27:06 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit kill -9 process-id (the first # in list - in this case 831)

Failover will take a few minutes You will note that system 2 is faulted -- and system 1 is now online You need to CLEAR the fault before trying to fail back over. hares -display | grep FAULT for the resource that is failed (in this case, LISTENER) Clear the fault hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500b] 7. Test Actual Failover For System 1 (and pray db is okay) Now we do same thing for the other system first verify that the other system is NOT faulted hastatus -summary Now do the same thing on this system... To do this, we will kill off the listener process, which should force a failover. On system that is online (should be system2), kill off ORACLE LISTENER Process ps -ef | grep LISTENER Output should be like: oracle 987 1 0 20:49:19 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit root 1330 631 0 20:58:29 pts/0 0:00 grep LISTENER kill -9 process-id (the first # in list - in this case 987) Failover will take a few minutes You will note that system 1 is faulted -- and system 1 is now online You need to CLEAR the fault before trying to fail back over. hares -display | grep FAULT for the resource that is failed (in this case, LISTENER) Clear the fault hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500a] Run: hastatus -summary

to make sure everything is okay.

Links to Veritas Software products, services, and support.


1. Veritas Software Home Page Symantec Corp. and VERITAS Software merged on July 5, 2005. As a result, we are transitioning content from the VERITAS web sites to www.symantec.com.

2. Veritas Software Technical Services


Now on the Symantec site.

3. VERITAS Cluster Server Simulator


VERITAS Cluster Server Simulator is a new feature of VERITAS Cluster Server. The Simulator allows cluster administrators to simulate application failover scenarios and familiarize themselves with VERITAS Cluster Server. 4. Veritas Cluster Server definitions and links From Wikipedia, the free encyclopedia 5. Veritas Cluster Server datasheet. This is an adobe PDF of the VCS 6. Veritas Cluster Server Overview, from Symantec 7. Security advisory for Veritas Cluster Server for Unix, local access buffer overflow. Dated Nov. 14, 2005

Anda mungkin juga menyukai