Ibm

Autonomic T. B.
Mathias
P. J. Callaghan
computing and
IBM System z10
active resource
monitoring
Among the essential components of the IBM System z10e
platform is the hardware management console (HMC) and the
IBM System ze support element (SE). Both the SE and the HMC
are closed fixed-function computer systems that include an
operating system, many middleware open-source packages, and
millions of lines of C, Cþþ, and Javae application code developed
by IBM. The code on the SE and HMC is required to remain
operational without a restart or reboot over long periods of time. In
the first step toward the autonomic computing goal of continuous
operation, an integrated, automatic software resource monitoring
program has been implemented and integrated in the SE and HMC
to look for resource, performance, and operational problems, and,
when appropriate, initiate recovery actions. This paper describes
the embedded resource monitoring program in detail. Included are
the types of resources being monitored, the algorithms and
frequency used for the monitoring, the information that is collected
when a resource problem is detected, and actions executed as a
result. It also covers the types of problems the resource monitoring
program has detected so far and improvements that have been
made on the basis of empirical evidence.
Introduction SEs, as shown in Figure 1(b). An SE is a laptop computer

A typical IBM System z10* platform is shown in Figure 1(a). running an OS, many integrated open-source packages,
The customer operates the z10* system using a graphical and several million lines of LIC, developed by several
user interface (GUI) available on the hardware IBM development teams throughout the world. The SEs
management console (HMC) [1]. The HMC is a desktop and HMCs are critical to the continuous operation of the
computer running an operating system (OS), many System z10 platform, because they are essential for
integrated open-source packages (such as controlling the system.
communications, security, and GUI packages), and a few Autonomic computing is part of the IBM information
million lines of custom software, often described as technology (IT) service management vision. Autonomic
firmware or Licensed Internal Code (LIC). Using a local computing systems have the ability to manage themselves
area network (LAN), it communicates with the two and dynamically adapt to change in accordance with
support elements (SEs) that are located in the Z-frame [2]. business policies and objectives, enabling computers to
While only one HMC is required, customers normally identify and correct problems, often before they are
purchase more for redundancy. Both the HMC and SEs noticed by IT personnel. A key goal of IBM autonomic
are closed fixed-function devices, and the customer computing is continuous operation, achieved by
cannot install any code or applications on the systems eliminating system outages and making systems more
other than the firmware supplied by IBM. resilient and responsive. Autonomic computing takes its
A typical high-availability configuration is to have at inspiration from the autonomic nervous system, which
least two HMCs communicating via two LANs to the two automatically regulates many lower-level functions in
Copyright 2009 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each
reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this
paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other
portion of this paper must be obtained from the Editor.
0018-8646/09/$5.00 ª 2009 IBM
IBM J. RES. & DEV. VOL. 53 NO. 1 PAPER 13 2009 T. B. MATHIAS AND P. J. CALLAGHAN 13 : 1
implemented in three different ways: locally, using
The A-frame contains the processors, knowledge obtainable on the system; through a peer
memory, and some I/O group autonomic function that requires the local
The Z-frame contains the support community to share knowledge; and through a network-
elements (SEs) and some I/O based autonomic function that can include software
Hardware updating and backup and restore [7]. Bantz et al. [7] state
management that ‘‘a computer system is autonomic if it possesses at
console (HMC) least one of the four key attributes.’’
Local area Referring to Figure 1, one could consider the SEs and
network
HMCs to be merely part of a single z10 system. However,
for purposes of this paper, Figure 2 is a more appropriate
representation because it shows the HMCs and SEs as
(a) separate computer systems that happen to work with
other firmware components to form the complete IBM
HMC System z10 platform. Thus, here we treat the SE and the
Z-frame A-frame
HMC as separate systems.
Solution requirements
HMC SE The SE and HMC have had considerable problem
detection and reporting capability for many years. For
example, the firmware is written to verify the ability to
allocate new resources and, if resources are exhausted,
SE
report an error back to IBM through the IBM RETAIN*
database (Remote Technical Assistance Information
Network) used for system health monitoring and
HMC preventive maintenance. This would result in the
customer taking some action through the GUI or in
Front view
service personnel being dispatched to resolve the
(b)
problem. In severe cases, the SE or HMC would
automatically reboot, with the intention of cleaning up
Figure 1 the resource problem; however, even the few minutes
required to do this can have a negative impact on the
IBM System z* mainframe: (a) basic configuration; (b) a typical
configuration.
customer’s business.
Despite a rigorous development process, such as design
reviews, code reviews, and extensive manual and
automated simulation testing, code bugs still persist. As
animals [3], for example, balance, digestion, and blood the first step toward an autonomic SE and HMC, it was
circulation [4, 5]. decided to focus on the self-healing aspect of autonomic
According to Ganek and Corbi [3], four key computing and, in particular, on an autonomic element
fundamental attributes are required for a computer (AE) that primarily monitors and performs low-risk
system to be autonomic: It must be self-configuring, self- actions to remediate problems when they are detected.
healing, self-optimizing, and self-protecting. Ganek and In addition to detecting and solving problems, any
Corbi outline how these features can be implemented in a solution also had to meet the following requirements:
computer system via a control loop that can be described
as a monitor, analyze, plan, and execute (MAPE) loop. Any management or remediation to be performed had
Also required for implementation are the resources that to be low risk. Because the AE is code, there is a
can be monitored (via sensors) and changed (via potential for it to make a wrong decision. It is
effectors). This means that an autonomic system will important that an action taken does not create a more
monitor resources, analyze the data, figure out what to severe problem or adversely affect other parts of the
do, and then make the necessary changes—all within a system.
loop intended to ensure that the system performs the The solution had to have a relatively minimal impact
desired functions. Additionally, autonomic elements may on SE and HMC performance and other resources,
be cascaded together to interact with each other over an such as RAM (random access memory) and direct
autonomic signal channel [6]. Autonomic function can be access storage devices (DASDs).
13 : 2 T. B. MATHIAS AND P. J. CALLAGHAN IBM J. RES. & DEV. VOL. 53 NO. 1 PAPER 13 2009
The solution had to be something that can be shipped
to the field with few to no false positives (see the LPAR
HMC CFCC
section ‘‘False positives’’ below), but at the same time hypervisor
it has to be powerful enough to use during Alternate
development and test in order to find resource SE
i390
FSP Channel
problems. (To date, quite a few resource problems HMC code
GA n
have been detected and fixed in development and test
I/O
and a few were found in the field.) processor
The solution had to work with different OSs and not Primary code
be limited to only a few compiler languages. SE
FSP Processing
The solution could not require the LIC in the SE or GA n ⫹ 1
HMC unit
HMC to be instrumented, as this was impractical millicode Channel
given the many different packages that comprise the
LIC. An additional concern was that instrumentation
Power
could slow down the SE or HMC.
The solution had to work well in two environments:
the customer site where the SE and HMC were Figure 2
shipped, and in the development test environment,
System z10 firmware structure.
where the SEs and HMCs are restarted on a very
regular basis, often daily.
The solution could not be limited to only the System z
platform SE and HMC. It had to also be operational
and Nagios** [15]. Both of these tools provide a means to
in the HMC for the IBM pSeries* [8] product.
monitor resources and log and make notification of
problems, but they still require an expert to set up the
thresholds and, when a notification of a problem is
Related work
received, diagnose the system to determine the cause of
Monitoring of computer systems is not new. OSs
the problem. Similar to the active monitoring solution,
normally offer tools and additional resource monitoring
Nagios also has plug-ins to perform actions, such as
programs that are available for common OSs. Examples
restarting an HTTP (Hypertext Transfer Protocol) server
of basic monitoring tools include the procps package for
if it goes into a nonoperational state.
Linux** [9] (which includes several commands, such as
Nagios also provides a rich infrastructure for plug-ins
top, ps, free, and vmstat), the IBM z/OS* Resource
to be developed. This infrastructure can be used to
Measurement Facility (RMF) [10], or the Microsoft
monitor resources and to configure services for event
Windows** Task Manager [11]. These tools show
utilization of resources, but they are premised on an handling and event notification. However, it does not
expert reviewing the data to determine whether there is a support the concept of the infrastructure aggregating
problem and where. Also, there are many excellent tools information from several sources and sending it along
available to help with the determination of the problem, with a notification—a capability that is required to
but in order to function, the tools have to be in use during analyze the problem offline from a system perspective.
the time the problem is occurring. These tools, for For example, when the IBM solution described in the
example, Valgrind [12], are not normally run in next section detects a problem, data is collected on how
production since the impact on system performance is too the entire system is performing—including aggregated
severe. system-level information such as the levels of CPU,
There are basic resource-limiting tools, for example, memory, file handling, and network resource usage and
the ulimit command built into shells such as Bash [13], internal process information, such as the trace events in
that can be used to restrict the resource usage for a the HMC and SE firmware processes that are spread
process started by the shell. However, the configuration across multiple processes and address spaces. When the
of these limits does not allow for notification or first collection phase has completed, the aggregated
failure data capture (FFDC) as the limits are approached. information is automatically sent to the RETAIN system.
For more details, see the section ‘‘First failure data Several autonomic monitors have been described, for
capture’’ below. example, Personal Autonomic Computing Tools [16].
More sophisticated tools include the Microsoft This work describes self-monitoring for potential
Management Console (MMC) Performance Console [14] personal system failures and outlines several components
that were monitored in a systems monitor utility 5. Programmatic control of the monitor and its
prototype: processor work rate, memory usage, thread thresholds is provided.
monitoring, and services and processes. However, this
was only a prototype and it did not include any support Overview
to actually execute an action automatically. When ARM is initiated, it reads in a properties file that is
Another area of personal computer health monitoring used to configure it. The properties file is a standard key
is the concept of a pulse monitor [17]. The assumption and property value file and is easily edited if an expert
behind pulse monitoring is that dying or hanging needs to alter the behavior of ARM. The properties file
processes can indicate how healthy or unhealthy a system contains the following: a flag to disable the checking if
is. The problem with using an approach such as this on desired, the checking frequency, the thresholds (i.e., the
the HMC or SE is that many processes are considered so rules to indicate when something is a problem and to
critical to normal operations that if they die, the SE or control automatic recovery actions), and a list of monitor
HMC must be restarted. In other words, even one dying extensions to call.
or hung process is a critical failure that must be avoided. ARM is basically a large MAPE loop. It periodically
checks resources (60 seconds is currently specified in the
The IBM solution: Active resource monitoring properties file). It obtains data about a type of resource
The approach we took was to implement a permanent and then checks that resource against the thresholds. If
active resource monitoring (ARM) program as an AE for needed, it will initiate recovery actions at that time if
the SE and HMC using the standard MAPE approach. In supported. If any problem is detected, at most, one
the monitoring phase, it periodically runs and gathers service call is placed to service personnel.
resource information. Then it analyzes the data and In addition, after a second longer period (currently
determines whether a problem exists. In the plan and 6 hours), additional long-term trends are checked and a
execute phases, a set of actions is determined and snapshot of the collected resource data is saved in a file.
performed. For example, upon detection of a problem, This file is one of the pieces of data collected when a
FFDC information is collected and saved, and a resource problem is identified because it can show long-
notification of the service being required is generated. In term trends in the system.
some cases, the program will also initiate a recovery Because ARM monitors some resources that are
action or actions. inaccessible to low-privilege users, and since the monitor
Currently, the actual recovery actions are limited by performs actions such as collecting FFDC data and
previously stated requirements to minimize risk. They terminating processes, it runs under a high-privilege user
consist of erasing selected files and terminating programs. identification.
Only programs that are deemed noncritical are eligible for
termination unless the overall SE or HMC resources are Configurable thresholds
so low that terminating the program is preferable to All resource issues that can be reported have configurable
problems that would occur if the entire SE or HMC ran thresholds in the properties file. The thresholds are
out of that resource. relative, that is, instead of absolute values, they are
Compared with the related work described above, the expressed as a percentage of the capacity of a resource.
ARM solution has the following benefits: For example, most DASD partitions have a threshold of
80.0%, so if that DASD is 80% or more full, then it is
1. Automatic cleanup of some resources is performed. considered a problem. If we change the size of the DASD
2. Under certain conditions, programs with unhealthy partition, the properties file does not have to change.
amounts of resource utilization are terminated. Likewise, the threshold for the total amount of memory
(RAM plus swapped memory) is also a percentage.
3. When a problem occurs, instead of just knowing that
Again, if the amount of the RAM or the swap space
there is a problem, sufficient data is captured from
changes, it does not require a change in the properties file.
multiple sources to enable an offsite expert at IBM to
Absolute thresholds include those related to memory
later investigate the problem and correct the
usage within a process. The SE and HMC are currently
firmware bug without requiring a recreation of the running a 32-bit OS, so it is important to make sure that
problem. all processes fit within the address space limitations of
4. The code is not a prototype; it is being used by IBM such a system.
customers. It has undergone extensive use and In addition to a properties file, application
modification to verify that it captures the problems programming interfaces (APIs) are provided with ARM
we want to find and that it does not report to allow other firmware to dynamically adjust the
nonproblems (false positives). thresholds. For example, some firmware needs an extra
large amount of memory when performing a specific task, checking loop. Unless otherwise stated, the checks
such as the power-on reset. This program calls the API to described in the following sections are performed on
raise its process limits, and then when the power-on reset every loop.
is complete, it calls the API again to restore the checking
to normal limits. All process-specific thresholds can be Memory usage and trends
overridden, and the system-wide limit for processor ARM checks that the amount of free memory (RAM plus
utilization can be overridden as well. The DASD limits swap space) must be at least a minimum percentage. It
cannot be overridden at this time because sufficient also checks that the total amount of free space in the swap
DASD is provisioned on the SE and HMC so that the area is above a minimum threshold and the total swap
thresholds should not be reached. size is above a minimum. Together, these two checks
ensure that the swap area is accessible and functioning as
Establishing thresholds expected. In earlier versions, these checks were not made,
Initially, thresholds were established by simply looking at so we did not become aware of swap problems until the
the available resources in the SE and HMC. For example, total amount of free memory was too small. With these
it was decided that we did not want used file space to new checks, we can immediately identify a problem rather
exceed 80% of that available. This threshold has remained than waiting for the system to run low on memory.
set at this value; when we found we were nearing this limit The percentage of free memory in the Java** Virtual
due to the system design, the file system was changed to Machine (JVM**) [18] is checked. For each process, the
provide more space. amount of real and virtual memory used is checked.
For process memory utilization, the monitoring There is one threshold for reporting an error and a second
program long-term trend files were captured from test set of higher thresholds that, if exceeded, cause ARM to
systems that had run for a lengthy period of time or that terminate the offending process. In addition to the data
had performed testing that stresses the SE or HMC. directly collected, existing firmware in the SE or HMC
Fortunately for the SE and HMC, most processes have a then captures data from the terminating process. Also,
unique name. A tool was written to find the maximum the existing firmware may restart the application if it is
memory utilization for each process, and the thresholds critical enough.
were then set to 125% of this value. During the internal The thresholds for a process can be tailored to each
test cycles, the monitoring continues to run, and as process. Regular expressions are used to match the line in
problems are detected, they are manually analyzed. Some the properties file with the name of the process. This
have turned out to be due to resource problems; in those allows a process to be assigned unique thresholds and
cases, the program was corrected. However, if the ensures that all processes have a threshold.
resource utilization was determined to be correct (perhaps In addition to checking on every loop, a snapshot of
it increased due to a required design change), then the the usage is stored in a table periodically at a longer
threshold was increased. interval, called the snapshot interval (currently once every
The SE and HMC do have advantages over a normal 6 hours). It is then analyzed for memory leaks. The
personal computing environment in that they are closed. algorithm is fairly straightforward: A process is deemed
The SE or HMC installs itself from the master media, and to be leaking if over the last n snapshots (currently n ¼ 8,
after restoring any previous user customizations, it which means the code is looking at the last 48 hours
automatically starts and runs. This master image is of data), its memory usage has increased m times
created through a very tightly controlled development (currently m ¼ 7) and it has never gone down. The one
process, so it is well known when a new piece of code will process that is skipped for this analysis is Java. A JVM
appear requiring that experimentation be performed to loads in classes only as needed, so it takes a very long time
determine the proper threshold. (longer than 48 hours) for it to stop increasing in size.
While we could fully automate the setting of thresholds (See the section ‘‘JVM operational monitoring,’’ below,
on the basis of a collection of trend data from various for details on how ARM detects memory leaks in the
systems, we decided that, for now, we still want to have JVM.)
an expert in the process. By requiring that a developer
justify a significant increase in resource utilization, it DASD usage: File size, number of files, and number of
forces a design or code inspection of the changed area, open files
and that ensures that the design and code are acceptable. For each DASD partition, the amount of free space and
the number of free inodes1 is checked. In addition, the
Specific resource monitoring
1
Many different types of resources and usage An inode is a Linux structure that stores basic information about a regular file. It is
possible to have so many files that the file system runs out of inodes but still has room
characteristics are checked with each iteration of the on the hard drive.
number of file descriptors used by each process is checked elapsed time of 45 seconds plus the remaining time of 50
and the total number of file descriptors in use by the seconds totals 95 seconds, and that is more than 150%
entire system is checked. Currently, the list of partitions above the nominal 60-second period.
checked is fixed in the code in order to ensure a unique When ARM wakes up, the time consumed by the
error code being reported for each partition. previous periods of sleeping is not used in estimating how
If a DASD partition has a problem, then a full list of all long the remaining loops might take, because we only
files and their sizes is collected for FFDC purposes. Then, want to know whether it looks as if there has been a
special cleanup firmware is called. This cleanup firmware problem. If there are, for example, 45 loops remaining, we
will automatically erase files in known temporary know that these will take at least 45 seconds. What we
directories or with certain names that, by convention, really want to know is given the length of time already
indicate files that should be erasable. taken, if we factor in the minimum amount of time the
remaining loops will take, will we be over the threshold?
CPU usage If so, then we want to flag an error immediately, as we
The average CPU usage of a process and the entire system want to collect data as close to the onset of the problem
is examined over the last n minutes (currently 10 minutes). as possible.
There is a threshold for a process and a higher threshold If a firmware developer thinks that the SE or HMC is
for the system, and a problem is reported if either running too slowly, there is an API call to trigger an
threshold is exceeded. immediate report of a performance-related problem and
Certain programs tend to make the CPU busy for a appropriate data is collected. Not only is the current
much longer period of time. These programs use APIs to performance data collected, but also the performance
completely disable the CPU checking while they execute. data associated with the previous main polling period.
The override is valid only for a relatively short period of The delta between these two sets of data can be used to
time (currently 1 hour). Thus, if the program forgets to see which threads ran in the interval and hopefully
restart the CPU checking, the checking will automatically identify whether a high-priority thread monopolized the
resume. Additionally, a record of all overrides of CPU CPU during the time period.
usage are kept in a log file for later analysis if necessary. Thus, if someone is looking at FFDC data for a
different problem, such as a timeout, and they think that
Performance and dispatch testing a slow system is involved, then the presence or absence of
Not all code on the SE or HMC runs with the same one of these problem reports can prove or disprove the
priority level. This means that it is possible for a high- hypothesis.
priority program to monopolize the CPU time, thus
starving other programs of CPU time. Also, DASD JVM operational monitoring
problems can sometimes result in programs not being A key process monitored on the HMC and SE is the Java
dispatched (i.e., not being given processor time to run). In bytecode interpreter (i.e., the JVM) because this process
addition, sometimes firmware developers think the SE or provides many key functions, including providing the
HMC is running too slowly, either because it is internal Web server and servicing all GUI functions
encountering a timeout in its firmware or it is examining selected by the user. Therefore, ARM ensures that this
FFDC data for another problem. process does not have operational problems such as hang
In order to try to detect problems such as this during conditions or resource problems such as out of memory.
the time between full resource checks, ARM divides the ARM provides the ability to add extensions via code
full resource checking time (currently 60 seconds) into supplied in a shared library. These extensions are listed in
1-second intervals. Every second, it wakes up and adds the configuration file and are coded to a designed
the elapsed time to the remaining number of seconds and interface. The interface allows the extension to be
compares that against the expected time. If the computed presented with the values in the configuration file of the
time exceeds a threshold (currently 150%), then it active monitor and for the extension to return a free
concludes that dispatching has been delayed and it format description of the status of the extension. The
reports a problem. For example, suppose the code has JVM monitoring support is provided as one of these
iterated ten times with 50 iterations (i.e., 50 seconds) extensions.
remaining. If the elapsed time since the start of the loop is The JVM monitoring extension provides four functions
12 seconds, then no problem is detected because 12 seconds by default. The first is to perform an HTTP request to the
plus the remaining 50 iterations would yield a time of internal Web server running in the JVM. This call is made
only 62 seconds, which is below the 150% threshold. in order to measure the roundtrip response time for a call
However, if it woke up and found an elapsed time of between the active monitoring process and the Web
45 seconds, then an error would be reported because the server. If a response is not received in a configurable
amount of time (the default is 10 seconds), then the single ProcessCheck(48,48,60*60,3*1024*1024) rule
request is assumed to fail. If there is a configurable was defined, then a large memory leak could be detected,
pattern of these failures (the current default being eight but it would not catch a small memory leak. Conversely,
consecutive failures), the JVM is assumed to be not if a single ProcessCheck(24,24,6*60*60,5*1024*1024)
operating normally, and FFDC information is obtained. rule was defined, a large memory leak would probably
The second function performed by the JVM cause the system to become unstable before the leak could
monitoring support is to use the infrastructure in which be detected. The algorithm in Listing 1 attempts to catch
other non-JVM processes on the system communicate large, medium, and small memory leaks as quickly as
with the JVM. The ARM process uses this infrastructure possible.
to communicate with the JVM and to ask the JVM for its
unique instance identifier. This request is performed for Reporting problems found
two purposes. First, the identifier is used so that the active
monitor can track unique JVM invocations and ensure First failure data capture
that any JVM problem is reported once and only once for The LIC package for the System z platform was designed
that invocation. Second, the call is used to measure the to support FFDC. The goal is to collect all of the data
response time for the roundtrip processing of a request necessary to fix a problem (hardware or LIC) anywhere in
from the active monitor process to the JVM process. If a the system at the time the problem occurs without having
response is not received in a configurable amount of time to rely on the ability to reproduce a problem in order to
(the default is 10 seconds), then the request is assumed to fix it.
fail, and if there is a configurable pattern of these failures ARM was designed to meet this goal. On any detected
(the default is eight consecutive failures), the JVM is problem, it logs all of the resource data to a file and also
assumed to be not operating normally. Also, if there are an error log. Firmware in the SE or HMC then analyzes
configurable patterns of failure in both the first function the problem and gathers up the error log, the resource
and the second function (the default is four consecutive data file, and other data into a bundle that is sent to IBM
failures), the JVM is assumed to be not operating. In for further investigation and service. Included in this
other words, failures in both functions indicate stronger bundle are many types of data, including various views of
evidence that the JVM is unhealthy. performance and recent trace data from the firmware.
The third function performed is to measure the Java IBM service personnel have access to tools that sort and
heap usage in the JVM. If the usage is over a configurable interpret the data. The tools can convert the data to a
threshold, a log entry is taken along with a JVM heap human-readable form, consolidate the information as a
dump [19]. Additional log entries are not taken for high sequence of events sorted by time, and automatically
usage unless the heap usage drops below a lower attempt to find problems in the data.
configurable threshold and then, once again, One of our lessons learned is that it is extremely
subsequently surpasses the threshold. important to make sure that enough data is collected to
The fourth function performed by the JVM monitoring understand the problem being identified and where in the
support is to attempt to discover memory leaks in the customer’s system the problem might originate. A second
JVM process that are outside of the Java heap, for lesson learned on the SE and HMC is that the collection
example, in the JNI** (Java Native Interface) [20] of data must be done as close to the detection of the
extensions. To implement this support, a function was problem as possible. In early implementations, when this
created, ProcessCheck(x,y,z,b), which returns true if was not always the case, we would sometimes see that the
the virtual address space of the JVM increases at least x DASD was very full, but by the time the data was
times out of y checks, and where the duration between the collected, the large files were gone. We had a similar
checks in seconds is z, and b is the minimum number of problem with performance problems, where the offending
bytes that have to be increased; false is returned if the process had terminated or stopped doing whatever had
specified increase does not occur. led to the issue.
The algorithm is coded such that it can detect large
memory leaks quickly before they cause the system to Long-term history and trend file
become unstable, and at the same time, it can detect ARM maintains a long-term history and trend file. Every
smaller memory leaks if they are observed over a so often (every 6 hours), the program writes a snapshot of
prolonged period and before they can cause the system to all resource information available. It prunes this file as
become unstable. This is done while attempting to needed to make sure that it does not exceed a size
prevent false positives from occurring (see the section specified in its properties file. The size of this file was
‘‘False positives,’’ below) and without significantly selected to allow for several weeks of data to be retained.
impacting the system performance. For example, if a This file is included with any problem reported and can be
Listing 1 Graduated memory-leak detection algorithm.
if (ProcessCheck(48, 48, 60*60, 3*1024*1024)) f // 3 Meg per hour for 48 consecutive hours
logErrorLog();
g else if (ProcessCheck(96, 96, 60*60, 1024*1024)) f // 1 Meg per hour for 96 consecutive hours
logErrorLog();
g else if (ProcessCheck(16, 16, 6*60*60, 10*1024*1024)) f // 10 Meg per 6 hours consecutive for 4
days
logErrorLog();
g else if (ProcessCheck(24, 24, 6*60*60, 5*1024*1024)) f // 5 Meg per 6 hours consecutive for 6 days
logErrorLog();
g else if (ProcessCheck(40, 40, 6*60*60, 2*1024*1024)) f // 2 Meg per 6 hours consecutive for 10
days
logErrorLog();
g else if (ProcessCheck(22, 24, 6*60*60, 5*1024*1024)) f // 5 Meg per 6 hours for 22 of 24 checks
(over 6 days)
logErrorLog();
g else if (ProcessCheck(75, 80, 6*60*60, 2*1024*1024)) f // 2 Meg per 6 hours for 75 of 80 checks
(over 20 days)
logErrorLog();
g else if (ProcessCheck(110, 120, 6*60*60, 1024*1024)) f // 1 Meg per 6 hours for 110 of 120 checks
(over 30 days)
logErrorLog();
g
used to determine whether the problem began sometime called and describes the module and function name to
in the past or whether it occurred suddenly. In addition to call. Each called function must conform to a standard
the periodic entries, if an error is detected, then a template that gives it access to the options file (so all
snapshot of the data at the time of the error is also thresholds are in one file) and provides a way for it to
appended to the file. return resource information. This allows ARM to save a
snapshot of monitored resource data when desired.
Automatic recovery actions Currently, the JVM monitoring described in this paper is
If the memory usage of a process exceeds a relatively high implemented via an extension.
threshold, then the process is terminated.
Even before ARM was implemented, the firmware Practical experience
contained code to restart a process that trapped or This section outlines a number of resource problems that
otherwise terminated abnormally and it also contained we have investigated and fixed. In most cases, the monitor
support to restart all of the firmware code if a critical worked well, and many resource problems were found.
piece were to fail. ARM can, therefore, terminate a However, we found instances where we needed to
process if necessary; the process, or even the entire SE or enhance the monitoring and FFDC.
HMC, is automatically restarted if necessary.
If a DASD partition becomes too full, then there is a Java thread leak
program that will automatically erase files that are In a Java thread leak, Java code that made calls to the
deemed by convention to be erasable. JNI was incorrect and did not properly terminate these
native threads. Over time, a large number of the threads
Extensions built up. ARM detected this when the number exceeded
On every pass through the checking loop, ARM calls the threshold and there was enough FFDC data to
extensions described in the properties file. The properties investigate a handful of specific Java classes and
file controls the order in which these extensions are determine the source of the problem.
Java heap leak Running out of space on a DASD partition
Since the introduction of the Java heap support in ARM, When space began to run out on a DASD partition,
there have been numerous detections of excessive heap ARM started to report a problem on the DASD
usage. For all of these, a Java heap dump was requested partition. Further investigation revealed that this was a
and made available in the FFDC information collected design problem in that the DASD partition was not large
for these types of problems. When this heap dump was enough to handle what needed to be stored.
processed with JVM tools, it was always clear which Java The result was a decision to increase the size of the
objects were being leaked, and usually an examination of DASD partition prior to shipment to customers.
the source code was enough to determine the problem.
For some instances of these problems, examination of the JVM hang detection
source code was not enough, and trace buffers included in Before JVM monitoring support was added to ARM,
the FFDC information were used to determine the there were several JVM hangs that went undiagnosed.
problem. All occurrences of this type of problem have This was because the system user, upon encountering the
been successfully diagnosed and corrected. HMC or SE in an unusable state, would reboot the
system and then report the problem to development.
Priority inversion problem Since the system had been rebooted, there was little
In this classic priority inversion problem, some low- information about what caused the problem. When JVM
priority code obtained a resource but then high-priority monitoring support was added, the user performed the
code ran and tried to obtain the resource. Because the same action of power recycling the system, but in all
high-priority program polled instead of yielding the instances, the JVM was hung long enough so that the
monitor detected the problem, collected FFDC
processor, the low-priority program could not finish with
information, and saved the data to disk. This FFDC
the resource and release it. Eventually the high-priority
information proved to be very useful in diagnosing these
program timed out, and the firmware resumed normal
types of problems.
operations. This problem could be recreated every so
often (perhaps once every few hours), so it was hard to
False positives
track down. It was finally identified by altering ARM to
One thing that all implementations of ARM must deal
collect more historical CPU utilization kept by the OS,
with is a false positive, that is, an incident when ARM
which allowed an expert to manually discover the thread
reports a problem when no problem actually exists. One
that was consuming an unexpectedly large amount of
example relates to checking the long-term memory usage.
CPU, and after a brief code inspection, he found the
The initial checking was for no decreases over the last
problem.
eight 6-hour samples and at least four increases. This was
As a result of this problem, the CPU dispatching test
found to be too sensitive, so the threshold was changed to
support was added to ARM, and additional historical the current one, which is to report a potential leak if there
CPU utilization data from the OS was also permanently are no decreases in the last seven 6-hour samples and at
collected. least six increases.
Another example of a false positive occurred when
Memory leak checking the long-term memory usage of the JVM.
Before support was added to monitor long-term process Because the JVM loads a class the first time it is needed, it
memory usage, we had a problem from the field whereby was found that it was still loading classes several days
a process was using too much memory and the memory after it was started. This fooled the long-term analysis
leak was fairly severe. Within a few weeks, a process routine enough that we had to develop the more
would leak enough memory to trigger the threshold for sophisticated JVM memory monitoring described above
problem reporting and, within a few more weeks, would in the section ‘‘JVM operational monitoring.’’
have completely run the system out of memory or
exhausted its addressing space. While we did have enough Potential enhancements
information to find the problem, the lesson learned was to
try to identify memory leaks sooner. Automatic learning
As a result of this problem, support for long-term One potential enhancement to ARM is to have it learn
memory checking was added to ARM. Since its what amount of memory usage for each process is
introduction, this support has successfully identified considered normal and to report unusual changes in order
many memory leaks found in the development and testing to minimize the number of false positives. This will
phase. require further investigation and experimentation.
Restarting the JVM 3. A. G. Ganek and T. A. Corbi, ‘‘The Dawning of the
Autonomic Computing Era,’’ IBM Syst. J. 42, No. 1, 5–18
A potential enhancement to ARM is in the area of
(2003).
monitoring the JVM. Currently, upon detecting a hung 4. D. M. Russell, P. P. Maglio, R. Dordick, and C. Neti,
JVM, FFDC information is captured and a log entry is ‘‘Dealing with Ghosts: Managing the User Experience of
Autonomic Computing,’’ IBM Syst. J. 42, No. 1, 177–188
created. This is useful because it captures FFDC (2003).
information very close to the time of the failure. 5. IBM Corporation, ‘‘An Architectural Blueprint for
However, it probably does not alert anyone about the Autonomic Computing,’’ white paper (June 2005); see http://
www-03.ibm.com/autonomic/pdfs/AC%20Blueprint%20White%
problem until the JVM is restarted, because the code to 20Paper%20V7.pdf.
send an alert about the problem runs in the hung JVM 6. R. Sterritt and D. Bustard, ‘‘Towards an Autonomic
process. Therefore, a potential enhancement is to Computing Environment,’’ Proceedings of the 14th
International Workshop on Database and Expert Systems
automatically restart the JVM upon detecting that the Applications, Prague, Czech Republic, 2003, pp. 694–698.
process is hung. After it is restarted, the problem analysis 7. D. F. Bantz, C. Bisdikian, D. Challener, J. P. Karidis, S.
code detects that the problem was logged but not yet Mastrianni, A. Mohindra, D. G. Shea, and M. Vanover,
‘‘Autonomic Personal Computing,’’ IBM Syst. J. 41, No. 1,
reported, so it does so. (Problem analysis is a component 165–176 (2003).
on the SE and HMC that analyzes all errors. It correlates 8. IBM Corporation, Operations Guide for the Hardware
Management Console and Managed Systems, Version 7,
these errors and decides which one or ones are the most Release 3, Document No. SA76-0085-04, April 2008; see
important and then proceeds to collect data related to http://publib.boulder.ibm.com/infocenter/systems/scope/hw/
these problems before transmitting the data to IBM for topic/iphdx/sa76-0085.pdf.
9. SOURCEFORGE.NET, Procps - The /proc File System
service.) In addition, the restart of the JVM should allow Utilities; see http://procps.sourceforge.net/.
for the HMC or SE to be usable once again. Because 10. IBM Corporation, Resource Measurement Facility User’s
restarting the JVM is a potentially destructive action, Guide, Document No. SC33-7990-11, September 2006; see
http://publibz.boulder.ibm.com/epubs/pdf/erbzug60.pdf.
there would be two levels of confidence that the JVM is 11. Microsoft Corporation, Task Manager; see http://
hung: a lower level of confidence needed for capturing www.microsoft.com/technet/prodtechnol/windows2000serv/
FFDC information and a higher level needed before reskit/core/fneb_mon_oyjs.mspx?mfr¼true.
12. Valgrind Developers, Valgrind; see http://valgrind.org.
restarting the JVM. 13. GNU Project, Bash Reference Manual; see http://
www.gnu.org/software/bash/manual/bashref.html.
14. Microsoft Corporation, Microsoft Management Console; see
Conclusions http://technet2.microsoft.com/windowsserver/en/library/
ARM has proven to be a useful technique to detect design 329ce1bd-9bb4-4b63-947e-0d1e993dc27d1033.mspx?mfr¼true.
and programming defects. It has also proven to be 15. Nagios Enterprises, LLC, Nagios Open Source Project; see
http://www.nagios.org.
extendable to monitor different types of resources and for 16. R. Sterrit, B. Smyth, and M. Bradley, ‘‘PACT: Personal
different types of problems than did the original Autonomic Computing Tools,’’ Proceedings of the 12th IEEE
implementation. We expect to continue to extend ARM International Conference and Workshops on the Engineering of
Computer-Based Systems, Greenbelt, MD, 2005, pp. 519–527.
as necessary in the future. 17. R. Sterrit and S. Chung, ‘‘Personal Autonomic Computing
Self-Healing Tool,’’ Proceedings of the 11th IEEE International
Conference and Workshop on the Engineering of Computer-
Acknowledgment Based Systems, Brno, Czech Republic, 2004, pp. 513–520.
We thank Kurt Schroeder, who contributed some code to 18. IBM Corporation, System z10 Enterprise Class System
monitor the Java heap utilization of the JVM. Overview, Document No. SA22-1084, June 2008; see http://
www-1.ibm.com/support/docview.wss?uid¼isg29ea3f936978cba
27852573f900774732.
*Trademark, service mark, or registered trademark of 19. IBM Corporation, Java Diagnostics Guide 5.0; see http://
International Business Machines Corporation in the United States, publib.boulder.ibm.com/infocenter/javasdk/v5r0/
other countries, or both. index.jsp?topic¼/com.ibm.java.doc.diagnostics.50/diag/
welcome.html.
**Trademark, service mark, or registered trademark of Linus 20. S. Liang, The Java Native Interface: Programmer’s Guide and
Torvalds, Microsoft Corporation, Nagios Enterprises, LLC, and Specification, Prentice Hall PRT, Upper Saddle River, NJ,
Sun Microsystems, Inc., in the United States, other countries, or 1999; ISBN 0-201-32577-2.
both.
Received January 18, 2008; accepted for publication
References June 4, 2008
1. IBM Corporation, System z Hardware Management Console
Operations Guide, Version 2.10.0, Document No. SC28-6867,
July 22, 2008; see http://www-1.ibm.com/support/docview.
wss?uid¼isg2bac11e0b02e3aa73852573f70056c860.
2. IBM Corporation, System z10 Enterprise Class Support
Element Operations Guide, Version 2.10.0, Document No.
SC28-6868, February 26, 2008; see http://www-1.ibm.com/
support/docview.wss?uid¼isg2e4d256a8a69d49da852573f
7006c82db.
Thomas B. Mathias IBM Systems and Technology Group,
1701 North Street, Endicott, New York 13760 (mathiast@us.ibm.com).
Mr. Mathias is a Senior Engineer. He received his B.S. degree in
electrical engineering from Ohio State University. He worked in
System z hardware development and later in firmware
development. He is a licensed Professional Engineer in the state of
New York. He is coinventor of three U.S. patents, and he has one
pending patent application. He has received numerous IBM
awards.
Patrick J. Callaghan IBM Systems and Technology Group,

1701 North Street, Endicott, New York 13760 (patrickc@us.ibm.com).
Mr. Callaghan is a Senior Engineer. He received his B.S. degree in
computer science from the State University of New York at
Buffalo. He worked on a variety of advanced technology projects
and recently worked on the team developing System z firmware. He
is the inventor or coinventor of three U.S. patents, and he has four
pending patent applications. He has published five articles as IBM
Technical Disclosure Bulletins and received numerous IBM
awards.

Ibm

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Ibm

Diunggah oleh

Hak Cipta:

Format Tersedia

Autonomic T. B.

Introduction SEs, as shown in Figure 1(b). An SE is a laptop computer

0018-8646/09/$5.00 ª 2009 IBM

Patrick J. Callaghan IBM Systems and Technology Group,

Anda mungkin juga menyukai