Anda di halaman 1dari 5

Fixing troubled agents - Kevin Holman's System Center Blog - Site Home... http://blogs.technet.com/b/kevinholman/archive/2009/10/01/fixing-troub...

Fixing troubled agents


Kevin Holman 1 Oct 2009 5:23 PM

11

Sometimes agents either will not talk to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine. Agent health is an ongoing task of any OpsMgr Admins life. This post in NOT an end to end manual of all the factors that influence agent health. but that is something I am working on for a later time. There are so many factors in an agents ability to communicate and work as expected. A few key areas that commonly affect this are: DNS name resolution (Agent to MS, and MS to Agent) DNS domain membership (disjointed) DNS suffix search order Kerberos connectivity Kerberos SPNs accessible Firewalls blocking 5723 Firewalls blocking access to AD for authentication Packet loss Invalid or old registry entries Missing registry entries Corrupt registry Default agent action accounts locked down/out (HSLockdown) HealthService Certificate configuration issues. Hotfixes required for OS Compatibility Management Server rejecting the agent

How do you detect agent issues from the console? The problem might be that they are not showing up in the console at all! Perhaps they might be a manual install that never shows up in Pending Actions? Or a push deployment, that stays stuck in Pending actions and never shows up under Agent Managed. Or even one that does show up under Agent Managed but never shows as being monitored returning agent version data, etc.

One of the BEST things you can do when faced with an agent health issue if to look on the agent, in the OperationsManager event log. This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent. That is ALWAYS one of my first steps in troubleshooting.

Another way of examining Agent health is by the built in views in OpsMgr. In the console there is a view Located at the following:

This view is important because it gives us a perspective of the agent from two different points:

1 of 5

1. The perspective of the agent monitors running on the agent, measuring its own health.

3/6/2013 9:50 AM

Fixing troubled agents - Kevin Holman's System Center Blog - Site Home... http://blogs.technet.com/b/kevinholman/archive/2009/10/01/fixing-troub...

This view is important because it gives us a perspective of the agent from two different points: 1. The perspective of the agent monitors running on the agent, measuring its own health. 2. The perspective of the Health Service Watcher which is the agent being monitored from a Management Server".

If any of these are red or yellow that is an excellent place to start. This should be an area that your level 1 support for Operations manager checks DAILY. We should never have a high number of agents that are not green here. If they aren't this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc Use Health Explorer on these views to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.

Now. the following are some general steps to take to fix broken agents. These are not in definitive order. The order of steps really comes down to what you find when looking at the logs after taking these steps.

Start the HealthService on the agent. You might find the HealthService is just not running. This should not be common or systemic. Consider enabling the recovery for this condition to restart the HealthService on Heartbeat failure. However if this is systemic it is indicative of something causing your HealthService to restart too frequently, or administrators stopping SCOM. Look in the OpsMgr event log for verification.

Bounce the HealthService on the agent. Sometimes this is all that is needed to resolve an agent issue. Look in the OpsMgr event log after a HealthService restart, to make sure it is clean with no errors.

Clear the HealthService queue and config (manually). This is done by stopping the HealthService. Then deleting the \Program Files\System Center Operations Manager 2007\Health Service State folder. Then start the HealthService. This removes the agent config file, and the agent queue files. The agent starts up with no configuration, so it will resort to the registry to determine what management server to talk to. From the registry it will find out if it is AD integrated, or a fixed management server to talk to if not. This is located at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PROD1\Parent Health Services\ location, in the \<#>\NetworkName string value. The agent will contact the management server request config, receive config, download the appropriate management packs, apply them, run the discoveries, send up discovery data, and repeat the cycle for a little while. This is very much what happens on a new agent during initial deployment.

Clear the HealthService queue and config (from the console). When looking at the above view (or any state view or discovered inventory view which targets the HealthService or Agent class) there is a task in the actions pane - Flush Health Service State and Cache. This will perform a very similar action to that above. as a console task. This will only work on an agent that is somewhat responsive. if it does not work you need to perform this manually as the agent is really broken from communication with the management server. This task will never complete, and will not return success because the task breaks off from itself as the queue is flushed.

Repair the agent from the console. This is done from the Administration pane Agent Managed. You should not run a repair on any AD-integrated agent as this will break the AD integration and assign it to the management server that ran the repair action. A repair technically just reinstalls the agent in a push fashion, just like an initial agent deployment. It will also apply/reapply any agent related hotfixes in the management servers \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.

2 of 5

Reinstall the agent (manually). This would be for manual installs or when push/repair is not possible. This section is where the combination of options gets a little tricky. When you are at this point where you have given up, I find just going all the way with a brute force reinstall is the best way. This means performing the following steps: Uninstall the agent via add/remove programs. Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe. This is designed to make sure that the service, files, and all registry entires are removed.

3/6/2013 9:50 AM

Fixing troubled agents - Kevin Holman's System Center Blog - Site Home... http://blogs.technet.com/b/kevinholman/archive/2009/10/01/fixing-troub...

Uninstall the agent via add/remove programs. Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe. This is designed to make sure that the service, files, and all registry entires are removed. Ensure that the agents folder is removed at: \Program Files\System Center Operations Manager 2007\ Ensure that the following registry keys are deleted: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService Reboot the agent machine (if possible) Delete the agent from Agent Managed in the OpsMgr console. This will allow a new HealthService ID to be detected and is sometimes a required step to get an agent to work properly, although not always required. Now that the agent is gone cleanly from both OpsMgr console and the agent Operating System. manually reinstall the agent. Keep it simple install it using a named management server/management group, and use Local System for the agent action account (these will remove any common issues with a low priv domain account, and AD integration if used) If it works correctly you can always reinstall again using low priv or AD integration. Remember to import certificats at this point if you are using those on the individual agent. As always look in the OperationsManager event log. this will tell you if it connected, and is working, or if there is a connectivity issue.

To summarize. there are many things that can cause an agent issue, and many methods to troubleshoot. However to summarize at a very general level, my typical steps are: 1. 2. 3. 4. Review OpsMgr event log on agent Bounce HealthService Bounce HealthService clearing \Health Service State folder. Complete brute force reinstall of the agent.

If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you. but those should be available from the OpsMgr event log.

Also make sure you see my other posts on agent health and troubleshooting during deployment: Console based Agent Deployment Troubleshooting table Agent discovery and push troubleshooting in OpsMgr 2007 Getting lots of Script Failed To Run alerts- WMI Probe Failed Execution- Backward Compatibility Agent Pending Actions can get out of synch between the Console, and the database Which hotfixes should I apply-

Comments

Sameer Dave 2 Oct 2009 5:19 PM

Thats a very good article Kevin. I have seen one more problem where agents are hung in one of the tables of SQL, specially during new installations. I have seen that once you delete that information from the tables, then you could install the agent again fine. Thanks for the great article once again

Kevin Holman 2 Oct 2009 5:35 PM

We really try hard to come up with ways to solve the problem without resrting to editing a SQL table directly.... doing so is really unsupported and should only be done under the direct guidance (or should I say order) of PSS in a case with Microsoft. There are a few circumstances, where that seems to be the only recourse... but we should exhaust all other options first.

3 of 5

3/6/2013 9:50 AM

Fixing troubled agents - Kevin Holman's System Center Blog - Site Home... http://blogs.technet.com/b/kevinholman/archive/2009/10/01/fixing-troub...
only recourse... but we should exhaust all other options first.

TechJet 2010 30 Jun 2010 2:52 PM

Thanks Kevin, very good blog. Can you provide any additional advice or reasons why an agent health turns grey, we get this a lot ? the agents are multi-homed, could this be impacting ?

DJ 24 Sep 2010 8:16 AM

This Blog is very useful. As to other potential issues with grey agents, check out this kb support.microsoft.com/.../2288515.

Muhammad Saad 4 Apr 2011 5:06 AM

Simply Log on to DC and run the following commands 1. hslockdown /L you will see NT Authority\system is in denied state Then run the command to bring it in allowed state hslockdown /A "NT AUTHORITY\System" Cheers Saad

Coolz203 7 Apr 2011 6:18 PM

hi Kevin, Great article. I have used this advice a few times to help with agetns issues. However I have come across with an issue I am having a hard time with. I have an agent deployed and teh agent is showing healthy in the Agent State view. This particular agent is on a Windows 208 R2 server. For some reason the disovery of this windows 2008 server is not working. I have other windows 2008 servers that are working fine. The agent knows enough that it is on a windows server, but all of the OS specifc monitors are not active. The logs show nothing. I am at a loss here. I have cleared the cache, repaired the agent. Any help is apprecieated. thanks.

Varun Kaura 18 Aug 2011 3:55 AM

Useful information Kevin, I'm also into a situation where 1-2 agents are not healthy. While checking i found config.xml file is not updated though i cleared the cached and even allowed the system to recreate Health Service State folder but that failed too in updating the xml file. I've also noticed Temp folders are not getting created on these agents. I've reinstalled the agent as well. Agent gets into gryed state after a while even if i restart the service. In the event log i see lot of logs generated Rule/Monitor "Microsoft.SystemCenter.LearningModule.FailedInitialization.Alert" cannot be initialized and will not be loaded" and many more similar to this. Any help is appreciated.

jayson 20 Dec 2011 2:32 PM

Kevin, My problem lies on the Root Management Server. Absolutely everything is running with no issues but for some reason, the Server is greyed out I can restart the service and it is okay for a few minutes, then goes

4 of 5

3/6/2013 9:50 AM

Fixing troubled agents - Kevin Holman's System Center Blog - Site Home... http://blogs.technet.com/b/kevinholman/archive/2009/10/01/fixing-troub...

My problem lies on the Root Management Server. Absolutely everything is running with no issues but for some reason, the Server is greyed out I can restart the service and it is okay for a few minutes, then goes right back into the greyed out status... Operationally and all functions correctly but it just never looks good to see the RMS greyed out... Any ideas?

MattM 12 Jan 2012 5:37 AM

Jayson, your issue may be related to SPNs. We just wrestled with it and finally got it right. www2.wolzak.com/.../15-the-opsmgr-connector-could-not-connect-to-msomhsvcrms01local mymomexperience.blogspot.com/.../ops-manager-2007-agent-not-connecting.html

shahar 15 Jan 2012 11:26 PM

After upgrading System Center Essentials 2007 with the latest OS Management Pack, the owners agent of the Hyper-V cluster became grayed out. If I change the cluster current host server to other server, it becomes grayed out and the previous one (which was the current host server before) becomes healthy again.

Hemant 13 Apr 2012 11:53 PM

How to Flush the Health Service State and Cache on multiple machines at a time? any command line utility available?

5 of 5

3/6/2013 9:50 AM

Anda mungkin juga menyukai