A Guide To Unexpected System Restarts - Red Hat Customer Portal

12/17/2014
AGuidetoUnexpectedSystemRestartsRedHatCustomerPortal
C USTO ME R PO RTAL
Products & Services > Knowledgebase > Articles > A Guide to Unexpected System Restarts
A Guide to Unexpected System Res

( Updated June 27 2014 at 9:16 PM
Contents
1.
2.
3.
4.
Introduction
Understanding the Environment
Investigating /var/log/messages
Where to Go Next
5. Kernel Panic
6. SysRq
7. IPMI and Baseboard Management Controllers
8. Failing Hardware
Introduction
Red Hat provides a Kernel Oops Analyzer (https://access.redhat.com/labs/kerneloopsanalyzer/)
tool to help you diagnose a kernel crash issue. When you input a text or a le including one or more
kernel oops messages, we will walk you through diagnosing the kernel crash issue. Try using the tool
before you perform the manual steps below. It may nd a solution for your kernel crash issue in
seconds. You can leave feedback on the tool at Kernel Oops App Info
(https://access.redhat.com/site/labsinfo/kerneloopsanalyzer).
While a Red Hat Enterprise Linux system will not reboot unless specically congured to do so,
there are still several instances in which an unexpected reboot can occur. At a basic level, these
occurrences fall into three categories:
A deliberate action on the part of a user (fence event, shutdown commands, etc.)
A software fault upon the server (kernel panic, NMI, etc)
A hardware fault/power failure in the server (power supply failure, disk or memory corruption,
etc.)
In this article we discuss how to identify these occurrences and steps to alter or prevent future
occurrences.
Understanding the Environment

https://access.redhat.com/articles/206873
1/7
12/17/2014
There are some important questions to ask when an unexpected reboot has recently occurred that
will help narrow down likely causes. Taking our lead from the categories above:
Identifying deliberate actions/congurations that would cause a restart:
Is the server in question a cluster node with an attached fence device?
Was the software on the server performing any tasks which would change its typical
resource use?
Is the server congured with health monitoring software, such as HP ASR?
Is there a Baseboard Management Controller connected to the system? HP iLO, Dell
DRAC, etc.
Potential software faults will most typically leave traces in /var/log/messages, investigated
in the next section.
Potential hardware faults are difcult to diagnose from an operating system level, but it
remains important to note power failures, maintenance events, or other environmental
occurrences around the time of the restart.
Investigating /var/log/messages
Many of the most common restart causes will leave traces in /var/log/messages. All full system
restarts will begin by listing the kernel command line, so searching the message log for the phrase
"Command line" is a good rst step when beginning an investigation.
For example:
Sep2904:18:15<hostname>kernel:Commandline:roroot=LABEL=/rhgbquietcrashkernel=128M@
Starting from this point and working backwards, look for messages similar to the following. Note
that these are examples of trouble indicators, actual errors found may vary by application and
release version:
User-initiated Shutdown (https://access.redhat.com/site/solutions/31411)
shutdown:shuttingdownforsystemreboot
init:Switchingtorunlevel:6
exitingonsignal15
GotSIGTERM,quitting.
Veritas Cluster Fence Event (https://access.redhat.com/site/solutions/22910)

GABWARNINGV15120138Porthisolatedduetoclientprocessfailure
RHEL High-Availability Cluster Suite Fence Event

(https://access.redhat.com/site/solutions/15575)
fenced[xxxx]:fencingnode"node1.example.com"
[TOTEM]Aprocessorfailed,formingnewconfiguration.
[TOTEM]ThetokenwaslostintheOPERATIONALstate.
2/7
12/17/2014
Hardware Fault (https://access.redhat.com/site/solutions/18723)

CPU1:MachineCheckException:4Bank4:ba00000000070f0f
Kernelpanicnotsyncing:Machinecheck
Kernelpanicnotsyncing:Uncorrectedmachinecheck
Thermal Event/Cooling Failure (https://access.redhat.com/site/solutions/134973)

kernel:CPUX:Temperatureabovethreshold,cpuclockthrottled
kernel:CPUX:Corepowerlimitnotification(totalevents=1)
Power Button Pressed (https://access.redhat.com/site/solutions/43732)

receivedevent"button/powerPWRF0000000000000000"
Non-Maskable Interrupt Received (https://access.redhat.com/site/articles/267533)

kernel:Uhhuh.NMIreceivedforunknownreasonXX.
kernel:NMIreceivedforunknownreason00
kernel:Dazedandconfused,buttryingtocontinue
kernel:Doyouhaveastrangepowersavingmodeenabled?
Kernel Soft Lockup (https://access.redhat.com/site/articles/371803)

kernel:BUG:softlockupCPU#7stuckfor10s!
Task Blocked for Too Long (https://access.redhat.com/site/solutions/31453)

kernel:INFO:task<process>:60blockedformorethan120seconds.
These messages may not necessarily be the root cause of the reboot, but are important clues worth
investigating further.
Where to Go Next
Should a situation become apparent in which the system has suffered a hang, lockup, or loss of
service causing an external application to reboot it then an investigation of server load and
performance leading up to the event is in order. By default, the System Activity Reporter facility
provided by the sysstat package is the recorder of such data. Analyzing any SAR les collected is
detailed further in our Knowledge Base. See How to analyze and interpret sar data.
(https://access.redhat.com/site/articles/325783)
Should none of the above messages show up in the logs, then the reboot cause can be narrowed
down to an event that does not print messages to the logs. There are a limited number of operations
that perform in this manner. The most prevalent of these follow.
Kernel Panic
A Red Hat Enterprise Linux system can be congured to reboot after experiencing a kernel panic.
The kernel parameter by which this is set represents the number of seconds after a panic has been
experienced before a reboot command will be issued, and is exposed in the /proc lesystem:
3/7
12/17/2014
#cat/proc/sys/kernel/panic
If this value is set to 0, this functionality is disabled. Should an unexpected restart occur when this
feature is enabled, there is a strong likelihood that the system is experiencing kernel panics. In these
cases, we strongly recommend conguring netdump
(https://access.redhat.com/site/solutions/6854) (version 4 or below) or kdump
(https://access.redhat.com/site/solutions/6038) (version 5 or above) on the affected system to
gather information regarding the panic cause.
NOTE: On a Red Hat Enterprise Linux 6 system, you can often speed up analysis of a kernel panic
through use of a small le called the kernel log. See RHEL6: Speeding up kernel crash / hang
analysis with the kernel log (https://access.redhat.com/site/articles/424743) for more information.
SysRq
The SysRq facility (https://access.redhat.com/knowledge/articles/231663) contains functionality
that can force an instantaneous system reboot. While shutdown commands are generally logged to
the system's messages le, SysRq commands are not always captured in the same way. There are
two ways a SysRq can be issued to cause a reboot. If the "Magic" SysRq key sequence has been
enabled, then the key sequence Alt+PrintScreen+b will trigger a system reboot on the spot. This
can be enabled and disabled with the kernel parameter kernel.sysrq, again exposed through the
/proc lesystem:
#cat/proc/sys/kernel/sysrq
If this command returns 0, then triggering SysRq command with the above key sequence is disabled.
A 1 indicates that this functionality is enabled.
Alternatively, the le /proc/sysrqtrigger can be used to issue a SysRq command whether or not
the "Magic" key sequence is enabled and the command
#echo'b'>/proc/sysrqtrigger
will instantly trigger a system reboot. Many different clustering software suites use this le and
functionality as a fencing solution. The cluster management software will monitor the cluster nodes
for errors or hangs, and upon detection that a node has become unresponsive the above command
will be issued on the unresponsive node resulting in a restart. The Red Hat High Availability
clustering software does not use this functionality, but if there is non-Red Hat clustering software
present on the system it is recommended to investigate what fencing solution that cluster software
employs.
IPMI and Baseboard Management Controllers

There are many pieces of software that will monitor a system for perceived performance difculties,
and if detected will use an IPMI signal to a BMC on the system board to restart the poorly
performing server. Different implementations of IPMI exist on different hardware platforms,
4/7
12/17/2014
including HP iLO and Dell DRAC. A frequent culprit of this type of unexpected reboot is the
Automated System Recovery (ASR) functionality provided by the hp-health package on HP
hardware with iLO cards. If this packages is installed, one can check for ASR events with the
following commands:
#hpasmclis"showasr"
#hpasmclis"showiml"
Additionally, some clustering software, including Red Hat's own, can use IPMI signals to fence
unresponsive nodes. If the server in question has such hardware installed, investigating the related
hardware logs and/or cluster logs can shed further light on reboot occurrences.
Failing Hardware
Should no evidence of the above be present, then the remaining piece of the equation to investigate
is hardware. There have been previous cases where a bad motherboard, faulty CPU, or a failing
Power Supply Unit has caused power to be lost to the machine causing a hard shutdown. This
behaviour is entirely dependent on the hardware within the system, and performing full hardware
diagnostics against the machine is generally the only method to rule this out as a possibility.
Article Type
General (https://access.redhat.com/search/browse/articles#?&portal_article_type=General)
Product(s)
Red Hat Enterprise Linux (https://access.redhat.com/search/browse/articles#?
&portal_product=Red+Hat+Enterprise+Linux)
Component
acpi (https://access.redhat.com/search/browse/articles#?&portal_component=acpi)
kernel (https://access.redhat.com/search/browse/articles#?&portal_component=kernel)
mcelog (https://access.redhat.com/search/browse/articles#?&portal_component=mcelog)
Category
Tags
Troubleshoot (https://access.redhat.com/search/browse/articles#?&portal_category=Troubleshoot)
kernel (https://access.redhat.com/search/browse/articles#?&portal_tag=kernel)
labs_kerneloopsanalyzer (https://access.redhat.com/search/browse/articles#?&portal_tag=labs_kerneloopsanalyzer)
panic (https://access.redhat.com/search/browse/articles#?&portal_tag=panic)
rhel_4 (https://access.redhat.com/search/browse/articles#?&portal_tag=rhel_4)
COMMENTS
How can we print these articles in PDF ?
(/user/1035553)
5/7

A Guide To Unexpected System Restarts - Red Hat Customer Portal

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

A Guide To Unexpected System Restarts - Red Hat Customer Portal

Diunggah oleh

Hak Cipta:

Format Tersedia

12/17/2014

A Guide to Unexpected System Res

Understanding the Environment

Veritas Cluster Fence Event (https://access.redhat.com/site/solutions/22910)

RHEL High-Availability Cluster Suite Fence Event

Hardware Fault (https://access.redhat.com/site/solutions/18723)

Thermal Event/Cooling Failure (https://access.redhat.com/site/solutions/134973)

Power Button Pressed (https://access.redhat.com/site/solutions/43732)

Non-Maskable Interrupt Received (https://access.redhat.com/site/articles/267533)

Kernel Soft Lockup (https://access.redhat.com/site/articles/371803)

Task Blocked for Too Long (https://access.redhat.com/site/solutions/31453)

IPMI and Baseboard Management Controllers

Anda mungkin juga menyukai