ARG FMA-EMPDevelopment WhitePaper

Failure Modes Analysis and
Equipment Maintenance
Plan Development
Reliability its in our DNA.

www.AlliedReliabilityGroup.com
Table of Contents
INTRODUCTION ............................................................................................................................... 3
DEVELOPING A FAILURE MODES DRIVEN MAINTENANCE STRATEGY ........................ 7
Understanding Asset Defects ............................................................................................................... 7
Three-Tiered Approach to Failure Mode Analysis ............................................................................ 11
Tier 1 The Asset Health Matrix .................................................................................................. 11
Tier 2 RCM Blitz ...................................................................................................................... 12
Tier 3 RCMCost ........................................................................................................................ 12
Failure Mode Analysis Process .......................................................................................................... 12
Step 1 Construct the 100% Theoretical Model ........................................................................... 12
Step 2 Construct the Validated AHM ......................................................................................... 13
Step 3 Conduct Bad Actor Analysis ............................................................................................ 15
Reliability Centered Maintenance ............................................................................................. 15
RCM Blitz ............................................................................................................................ 15
Why Blitz? ............................................................................................................................... 15
How Is RCM Blitz Different? .............................................................................................. 16
Root Cause Analysis ................................................................................................................... 17
Step 4 Complete the Equipment Maintenance Plan .................................................................. 19
Step 5 Optimize the Equipment Maintenance Plan ................................................................... 21
Failure Reporting, Analysis, and Corrective Action System............................................................. 22
Failure Mode Analysis ................................................................................................................... 23
Failure Codes Creation .................................................................................................................. 23
Work Order History Analysis......................................................................................................... 23
Root Cause Analysis ....................................................................................................................... 23
Strategy Adjustments .................................................................................................................... 24
Failure Modes Analysis .................................................................................................................. 24
SUMMARY ........................................................................................................................................ 24
Allied Reliability Group 2 Copyright 2013

Introduction
In 1978, F. Stanley Nowlan and Howard F. Heap released a United Airlines study that contained a
new methodology called Reliability Centered Maintenance (RCM). This methodology quickly
became the gold standard for risk assessment and risk management. As the years progressed, two
major improvements happened:
1. Improvements in the use of advanced statistics to analyze machinery failure patterns shed
new light on the nature of how machinery fails; and
2. Advancements in computer technology produced extremely powerful inspection methods
that changed the face of assessing a components condition.
Very few practitioners of RCM kept up with either of these changes, and as such, the development
of a comprehensive and efficient Equipment Maintenance Plan (EMP) became a resource-
consuming process.
This paper will lay out a new system for using a combination of traditional RCM Analysis,
Condition Monitoring technologies, and statistical data to produce a complete EMP in a fraction of
the time that it would take traditional RCM Analysis alone to produce. This paper will display how
the reliability engineering field as a whole can fully integrate the real power of Condition Based
Monitoring (CBM) technologies, sometimes referred to as Predictive Maintenance (PdM)
technologies, with traditional RCM Analysis by means of a three-Tiered approach to Failure Mode
Analysis (FMA), thereby accelerating the EMP development.
While this paper will not instruct the reader on the finer points of RCM Analysis, a short review is
in order. A recollection of the fundamentals will be necessary later in this paper. The core of RCM
Analysis lies in the answer to seven questions:
1. What are the functions and associated desired standards of performance of the asset in its
present operating context (functions)?
2. In what ways can the asset fail to fulfill its functions (functional failures)?
3. What causes each functional failure (failure modes)?
4. What happens when each failure occurs (failure effects)?
5. In what way does each failure matter (failure consequences)?
6. What should be done to predict or prevent each failure (proactive tasks and task intervals)?
7. What should be done if a suitable proactive task cannot be found (default actions)?
NOTE: For question number 6, inspection techniques that adequately identify

defects during normal operation are preferred over those that require downtime.
Less invasive action is preferred to more invasive. This is one of the fundamental
concepts of any well-defined maintenance strategy. It is a bit ironic that RCM
prefers inspections during operations (the use of Condition Monitoring techniques),

and it is this category that most faithful RCM facilitators have traditionally
overlooked or misunderstood.
Contained within question number 3 is a term we have not previously defined, failure mode. A
failure mode is the condition that exists that will cause a functional failure. Another way to think
of it is simply: the part + what is wrong with the part + the reason.
a. The part is a group of pieces that make up a component. Examples from the ISO standard
include impeller, seal, shaft, bolt, nut, and bearing.
b. What is wrong with the part refers to the effect of the failure mechanism. Examples
include failed, damaged, out of adjustment, abrasion, erosion, corrosion, fatigued, burnt,
and broken.
c. The reason is the physical cause of the problem. This could be age, improper lubrication,
misalignment, imbalance, or improper installation, among others.
An example that illustrates failure modes follows.
Example: One month, a misalignment condition is noted on a pump that is directly coupled to an electric
motor. The pump was recently replaced due to wear on the impeller. The post maintenance follow-up by the
Vibration Analyst revealed that the alignment was performed improperly. The motor-pump combination is
now running in a misaligned condition. The failure mode noted on the Vibration Analysts Exception Report
by would be Shaft Misalignment Improper Installation. Just two months later, the misalignment
condition having never been corrected, the Vibration Analyst now detects an outer race defect on the pump
bearing. The failure mode noted on the Exception Report is now Bearing Fatigued Shaft Misalignment.
This example should help demonstrate how Condition Monitoring programs allow for the
acceleration of Root Cause Failure Analysis (RCFA). Having the Vibration Analysis reports to
review during the RCFA makes the process much faster because documented proof is readily
available. The above situation is an extremely typical example of defects creating increased
maintenance labor and material costs while increasing the amount of unplanned (or even planned)
downtime. Planned downtime is increased by the fact that replacing the pump bearings takes
significantly longer than properly aligning the shafts, the initial failure that was noted. This
example also builds an excellent case for procedure-based maintenance and improved craft skills.
Had the craftsmen aligned the shafts properly upon replacing the impeller, none of this would
have occurred. This is also an excellent example of why the RCM Analysis Team should include a
CBM specialist.
NOTE: The list of failure modes covered in an RCM Analysis need not be an
exhaustive list. It should only include the predominant failure modes that represent
the failures that have previously occurred and the failures that are very likely to
happen. This philosophy is decidedly different from the original system laid out by
Nowlan and Heap, which John Moubray later formalized. They recommended that
the list of failure modes take into account every possible failure mode, however
unlikely, so as to avoid as much risk as possible. Modern variations focus the
attention of the Analysis Team on known failure modes and those with a high
conditional probability of occurrence. Some people tend to label this streamlined

RCM, though this is not accurate. Streamlined RCM is one that is not fully SAE
JA1011 compliant, due to some shortcut being taken in the interest of time.
The biggest advantage of RCM is the fact that the Analysis Team, and by extension the
organization, begin to think in a failure modes manner. They realize that there are a myriad of
non-value added tasks in a typical maintenance program that not only waste valuable crafts time,
but by means of intrusive inspections, actually increase their chances for infant mortality
problems like improper reassembly or lubrication contamination. The Analysis Team also realizes
a clear set of guidelines for determining what tasks are to remain a part of the maintenance
strategy. Any task that is to remain in the maintenance program must meet at least one of the
following criteria:
1. Task prevents a failure mode from occurring.

These are the most powerful tasks. Anything that can be done to prevent a failure mode
from occurring and not simultaneously create more risk is the best answer.
Examples include inspecting and calibrating a meter to prevent equipment malfunction
and lubricating a motor bearing to prevent damage.
2. Task detects failure modes once they have occurred.
These are the most prevalent. Remember, operating inspections are most preferred
because they require no downtime. These are also known as failure finding tasks.
One type of failure finding task is an inspection to check for the functional failure of a
hidden function component that is not/may not be evident to the operating crew during
normal operation. A classic example is the testing of an Emergency Stop (E-Stop) on a
machine. This can only be tested by actually activating the E-Stop, which is not
normally done during the operation of the machine.
3. Task is statutory or regulatory in nature (i.e. required by federal or state agencies for
environmental, health, and safety reasons).
4. Task is administrative in nature.
This failure modes style of thinking quickly separates the value added from the non-value added.
Additionally, this type of analysis need not be limited to the creation of an EMP; it can also be
applied to the redesign of an existing EMP.
Performing the preceding analysis on an already existing Preventive Maintenance (PM) strategy
allows for the non-value added tasks to be removed from the PM program and either deleted or
reassigned to more appropriate personnel. It also calls out which PM tasks need to be kept and
which ones need to be cleaned up in terms of wording and formatting to create a more
quantitative, repeatable procedure. This exercise is called a Preventive Maintenance Evaluation
(PME).
A PME can be done one of two ways. A sample PME can be done at the beginning of a reliability
improvement initiative to build some momentum around the types of changes possible and start to
define the size of the changes that could happen. This is typically performed on 200-300 PM tasks
that are deemed to be representative of the entire PM program. The PM tasks should be selected

from across 20-25 different equipment types in the plant and from a combination of monthly,
quarterly, and annual PM tasks.
Secondly, a full PME can be done. This is typically performed on the entire PM library, towards
the middle of a reliability improvement initiative. This is done to calculate precisely how many
craft resources will be freed up and how many PM tasks need to be reengineered into the proper
format.
The organization does not have to implement the results of the PME right away. The output of the
study should follow a staged implementation, as outlined below.
1. All tasks deemed Non-Value Added: These tasks can and should be deleted from the
program immediately. This will not be detrimental to the equipment performance because
the tasks, by definition, hold no value. This is an excellent time to begin reengineering the
tasks that need to be made more qualitative and repeatable.
2. All tasks deemed Non-Value Added: Reassign to Operator Care: These tasks do
not require a skilled maintenance craft person to be successfully completed. These tasks
should only be assigned to the individual operator(s) after the proper task procedure has
been created and the operator has been task qualified to both the written procedure and the
physical procedure. This step gets the operator(s) more intimately involved with the
maintenance of the equipment and provides another line of defense against equipment
failures.
3. All tasks deemed Non-Value Added: Reassign to Lube Route: Lubrication tasks
require a significant amount of training to be performed correctly. Contamination control
and sound lubrication fundamentals are broad topics and should be accounted for in the
design of the procedure. These tasks, like the operator care tasks, should only be assigned
to the individual lube technician after the proper task procedure has been created and the
lube technician has been task qualified to both the written procedure and the physical
procedure.
4. All tasks deemed Reassign to PdM: Parallel to the PME process, the PdM
improvement process should be taking shape (more on that later) and it should be time to
relieve the PM program of all tasks that were previously deemed Reassign to PdM. This
should only be performed after the PdM program is up and running, and like all of these
steps, can be done department by department as the technicians are ready to increase their
coverage.
Remember, this will not be as simple as throwing a light switch. Changes to the process workflows
have to be made and people need to be trained on changes in the workflow. Different metrics may
need to be created to measure implementation effectiveness and overall system efficiency. All four
of these steps can be done department by department, as the Reliability Improvement Team
tackles tasks and gets everything ready for rollout.

Developing a Failure Modes Driven Maintenance Strategy

As a part of our effort to create a failure modes driven maintenance strategy, we have to decide how
much of our equipment will receive a full RCM Analysis. The old methodology was predicated on
every asset in the facility eventually getting an RCM Analysis. The Analysis Teams would start
with the top 10% of the machines in that area, based on criticality. Criticality is a combination of
relative importance to the process and failure frequency. Once they were done with the top 10%, the
next 10% would be targeted and so on until all of the assets in that area had been analyzed. This is
the classical method and is not economically feasible to perform on all of the assets in a plant.
This situation gave birth to all of the different variations of streamlined RCM methodologies.
Streamlined methodologies are RCM Analysis methods that are not fully SAE JA1011 compliant
because they skip one or more steps of the original 7-step process.
An analysis of the history of failures for a given plant will normally indicate that a large portion of
the failures are happening on a small portion of the assets. Often it is found that the 80/20 Rule
applies: 80% of the problems are being created by only 20% of the machines. Even more
interesting, the 80/20 of the 80/20 generally applies as well. This means 64% (80% of 80%) of the
problems are being caused by only 4% (20% of 20%) of the assets. As shocking as this may seem, it
has been found to be true more often than not. This is further evidence that the streamlined RCM
approach is the correct analysis strategy. Therefore, a simple statistical calculation dictates that a
full-blown RCM Analysis need only be performed on 5-20% of all of the machines in a facility to
address the majority of the equipment failures creating the unplanned downtime. This 5-20% is
determined by the Criticality Analysis process.
NOTE: This does not preclude an RCM Analysis from being performed on any
machine at any time. After a maintenance strategy is put into place, RCM Analysis
is a very powerful tool to use on any machine, regardless of criticality ranking, that
exhibits chronic failures (otherwise known as a bad actor).
Understanding Asset Defects

Before we talk about which RCM methodology to use, a quick discussion of the different sources of
defects might indicate why particular FMA methodologies work better for some defect types than
they do for others.
Defects come from two different sources: people and environment. The first source of defects is
people and how they deal with the machinery. As people work on machines, they sometimes create
defects, such as:
We do not properly align the shafts before we operate the fan;

We do not properly address pipe strain on a pump;
We do not clean the machine properly before operation;
We do not follow the start-up procedure correctly or there is not a start-up procedure; and
We over-speed or overload the machine during operation.

All of these cause defects and have everything to do with how we interact with the machine. We
will refer to these as systemic problems, with the system being the man-machine system.
The impact of systemic problems is easily seen in an I-P-F curve. An I-P-F curve is the standard P-
F curve with an I-P portion added. Point I is defined as the point of installation of the component.
The I-P portion of the I-P-F curve is the failure free period. This is the time during which the
operation is defect free. On machines that were installed improperly, this may be just a few
seconds. On machines that were installed and/or repaired properly, this may be years.
An excellent way to determine the maturity of the maintenance effort is not by looking at the age
of the program, but by looking at where their focus is on the I-P-F curve. An organization that is
constantly focused on Point F and staying clear of it, will undoubtedly be a reactive culture. How
long can we run it before it fails? and Just how bad is it? are typical questions that might be
heard around this type of organization.
As the organization matures, the focus shifts from Point F to Point P. The organization then
focuses its efforts on understanding how things fail and their ability to detect these failures early.
Typical things overheard in this type of organization may be something like Is this the best way
to detect these defects early? or I appreciate you letting me know about this problem, even
though it is very early.
Another transition follows, moving the focus from Point P to Point I. Overheard in the hallways of
these organizations are statements similar to Take the time to do it right, it will pay big dividends
for us not too far down the road and Lets update the procedures for that job to reflect what we
just learned. This organization is trying to prevent failures from occurring in the first place by
applying best practices with fits, tolerances, alignment standards, contamination control, and well-
documented procedures. They are the ones who will see the step change in performance and they
are the ones whom we label mature, not the organization that has had a maintenance effort in
place for a longer period of time and follows it poorly.
Secondly, there are the defects that stem from the operating environment or the operating context.
These defects are the result of the environment in which the machine operates. Some examples of
this type of defect are detailed below:
Overhead crane gearboxes in plants located in hot regions of the United States will see
drastically hotter temperatures in the summer than the same gearbox in Wisconsin. As a
result, lubrication problems may be much more prevalent.
Motors operating in a dusty environment, such as inside a building where carbon is
produced, are much more likely to have their cooling fins clogged than a motor operating a
fresh water pump down by the river.
Pumps that have to change speeds to constantly adjust flow characteristics to account for
system adjustments are much more likely to run with flow turbulence issues, such as
cavitation, than a pump that never has to change speeds and has a constant inlet and
outlet pressure.

Electrical equipment that sees constant changes in load are more likely to develop thermal
anomalies or hot spots as a result of the heating-up and cooling-down process than
electrical equipment that see much more stable loads.
Motors that are constantly starting under high load conditions or motors that experience
large variations in load during normal operations are much more prone to rotor bar defects
than motors that run all of the time and at the same load level.
These are just a few examples of how operating context can be a source of defects. We will refer to
these causes as operating envelope problems.
Both systemic problems and operating envelope problems manifest themselves in the same way as
specific component or part level defects. Some examples include:
Rolling element defects

Gear defects
Contaminated lubricant
Loose/dirty electrical connections
Shorted motor windings
Knowing that both systemic problems and operating envelope problems produce the same type of
defects, a maintenance strategy that merely attempts to discover the defects and correct them will
never be able to reach a proactive state. Technicians will be too busy fixing the symptoms of
problems instead of addressing the root cause. To reach a truly proactive state, the root cause of
the defects will need to be identified and eliminated. Maintenance strategies that accomplish this
are able to see the step change in performance and achieve incredible cost savings. Maintenance
strategies that do not attempt to address the root cause of defects will continue to see lackluster
results and struggle with financial performance.
Another way to maintain focus on identifying and eliminating the root cause of the problem is to
focus on the defect and not the failure. Organizations that operate in a reactive mode typically
spend all of their time reacting to failures. This behavior finds them with all of their focus
around Point F on the curve. For that organization to make a step change in performance, they
will need to shift their focus to Point P, where a defect enters the system. As an organization shifts
its focus to detecting the defects early, they buy themselves the single most valuable commodity a
maintenance organization can have time. Detecting defects the moment they begin allows for the
maximum amount of time for the defect to be eliminated. While detecting defects the moment they
begin is not exactly possible, understanding the nature of the defects, how they are initiated, and
how they propagate is possible. A comprehensive inspection strategy, performed at the correct
intervals, will increase the conditional probability that the defects will be found very near their
origination. In addition, detecting the defects early allows for a proper RCFA to be performed
because many of the conditions that led to the defect are still in effect and can be more accurately
analyzed. Letting the defect progress down the curve or degrade, changes the nature of the defect
and makes proper analysis more challenging. Perhaps just as importantly, it makes the failure
that much more expensive to correct.

So, how can an organization successfully shift their focus from F to P and get a step change in
performance?
The implementation of an Asset Health Management program will accomplish that goal. Asset
Health is defined as the percent of machines in the plant that do not have an identifiable defect.
Notice this is not the percent of machines that experienced a failure or that are about to fail, but
the percent of machines that have some type of defect, no matter how small.
Using this as a measure pushes the focus from Point F to Point P. No longer is an organization
putting all of their attention toward what is about to fail, thus keeping themselves in a short cycle,
high intensity, resource draining reactive environment. By detecting defects early in their life and
initiating the maintenance planning cycle upon detection, they transform themselves into an
organization that values planning over firefighting. They move themselves into the long cycle, low
intensity, resource maximizing proactive environment.
It should be noted here that the planning and scheduling function of an organization should
already be well established to be able to take advantage of the information that a Condition
Monitoring program provides. That being said, through the early detection of defects, the
Condition Monitoring program becomes an empowering force for the planning and scheduling
effort.
Technical Note on Asset Health Management:
An asset that has an identifiable defect is said to be in a condition RED. An

asset that does not have an identifiable defect is said to be in condition
GREEN.
That is it. It is that simple. There are no other but ifs, what ifs, or if
thens to consider. If there is an identifiable defect, the asset is RED. If there
is no identifiable defect, it is GREEN. The percentage of machines that are in
condition GREEN is the Asset Health, stated as a percentage, for that plant
or area.
Example: A plant has 1,000 pieces of equipment. Of that number, 750 of them
have no identifiable defects. The plant is said to have 75% Asset Health.
There is an interesting aspect about Asset Health. Once this change is

underway, Asset Health, as a metric, becomes what most maintenance
managers and plant managers have wanted for a long time a leading
indicator of maintenance costs. Given two areas of the plant or even two
different plants, it is a mathematical certainty that the one with the better
Asset Health in January will have lower relative maintenance costs in
February. There are no two ways around it. The asset base with the lower
number of defects will always have the lower relative maintenance costs.
You may be reading this and thinking, The amount of Asset Health I have at
a given moment depends on what inspection techniques I employ as a part of

my Asset Heath Management Program. You are correct. If a plant is only

considering one inspection technique for their Asset Health Management
program, then of course, it will be quite easy to attain and maintain a high
level of Asset Health. Make no mistake though, people who ascribe to this
philosophy are only managing to a metric, which of course, is a nice way of
saying they are fooling themselves.
Three-Tiered Approach to Failure Mode Analysis

Now that we have discussed the sources of defects, how they manifest themselves, and how they
are reported, lets discuss the FMA methodologies that can be used to create maintenance
strategies capable of detecting and dealing with these defects. FMA methodologies are as varied as
the machinery on which we use them. Each was invented for a specific reason and each has its own
advantages and disadvantages.
While this paper will not go into detail for each methodology, we will spend some time discussing
the three different methodologies that Allied Reliability Group has used with considerable success.
This is the three-Tiered approach mentioned in the introductory paragraph. Each of these
methods/tools will be explained in greater detail in the sections following this summary.
Tier 1 The Asset Health Matrix

The Asset Health Matrix (AHM) tool is a software application developed by Allied Reliability
Group that is used for several purposes. The primary purpose is to quickly design Condition
Monitoring programs for groups of assets. Think of the AHM as a streamlined FMA methodology.
The AHM tool utilizes the most common failure modes in asset components to identify which PdM
technologies could apply to the asset. Similarly, a brief description is provided of which
Quantitative Preventive Maintenance (QPM) inspections tasks could be performed to identify a
given defect or failure mode. This tool also identifies tasks that could be performed by task-
qualified operators, sometimes referred to as Basic Care. This tool utilizes asset criticality to
determine maintenance strategy.
Using benchmark data, each technology is assigned percent coverage models to populate the Asset
Versus Technology Matrix. A complete, granular Criticality Analysis is vital. The accuracy of the
resulting EMP is directly proportional to the ability of the Criticality Analysis to differentiate one
asset from another.
Inside the matrix, failure modes are mapped to specific component types and then Condition
Monitoring technologies and PM tasks are mapped to the applicable failure modes. This creates a
database of sorts where simply knowing the component types means knowing the Condition
Monitoring technologies that apply. The tool allows the user to map PM tasks to failure modes as
well. This is a critical function as there are failure modes that no Condition Monitoring technology
can detect and a PM task will be required.
There are two different types of AHM models: the 100% Theoretical Model and the validated AHM.

Tier 2 RCM Blitz

RCM Blitz is a fully SAE JA1011 compliant RCM Analysis methodology that incorporates the
use of two facilitators who focus the Analysis Team on small portions of the plant, accomplishing
rapid results. The output of the RCM Blitz process is a complete maintenance strategy, from
Condition Monitoring tasks to plans surrounding the run-to-failure components. The RCM Blitz
process is very effectively used to analyze both systemic and operating envelope problems that
manifest themselves as component level defects.
Tier 3 RCMCost
RCMCost is a powerful software tool that combines the Failure Mode, Effects, and Criticality
Analysis (FMECA) library developed using RCM Blitz with the failure data from the plants
assets to produce an optimized maintenance strategy. The RCMCost software uses Weibull
statistics and Monte Carlo Simulation routines to calculate the optimum maintenance strategy,
failure mode by failure mode, over the entire life cycle of the asset.
Failure Mode Analysis Process

Using the three-Tiered approach summarized in the previous section, we will now lay out the
specific steps in the Allied Reliability Group FMA methodology. This methodology has been
designed to take advantage of the power of Condition Monitoring and RCM simultaneously, thus
creating a rapid EMP development. Once the program matures, the power of advanced statistics is
added to optimize the tasks and task frequencies.
Each of the five steps contained in the process map below has a specific tool/facilitation style or
standard that is used. Each step has a specific purpose and creates unique value for the
maintenance organization.
Step 1 Construct the 100% Theoretical Model

The 100% Theoretical Model is the everything you could do scenario, created by simply taking an
equipment list, declaring an equipment type, and running the model. The model does not take into
account asset criticality, nor does it take into account which technologies are going to be used. It
simply lays out all of the possibilities.
The 100% Theoretical Model is always done first. It is a quick method of starting to build in the
maintenance departments mind what the depth and breadth of a Condition Monitoring program
could be once the program has reached full maturity. This model lays the groundwork for what is
to come in the complete EMP process.
This model is usually created based solely on the equipment list. If the plant has not already
declared an equipment type for each asset (and most plants do not), then the Analyst makes his
best guess as to what type of machine it is. This process is sometimes done remotely and sight
unseen. A major challenge for this step is that the equipment list is seldom accurate. It normally
has machines that are no longer in the plant and it does not always contain new machines that
have been added. The more accurate the list is, the more complete the picture is at this stage.

Step 2 Construct the Validated AHM

This is where the 100% Theoretical Model is turned into a full-fledged design document for the
Condition Monitoring program. The validated AHM becomes the foundation for the improvement
plan.
A validated AHM requires the following items:
1. The 100% Theoretical Model.

2. The organizations decisions for technology inclusions.
3. Early on in a program, it may be prudent not to implement every technology immediately
upon commencing Condition Monitoring. Things like workflow maturity, craft skills, and
knowledge/acceptance of the technologies may be barriers to implementation. Pick the
technologies that will generate the quickest Return on Investment (ROI). Use these early
wins to generate a positive momentum for the Condition Monitoring program. We will
discuss these items in much more detail in later sections.
4. A complete asset criticality ranking.
5. The validated AHM requires that the criticality ranking be known for the assets in
question. Some people choose to begin with a focus area. This is fine as long as there is a
realization that once other areas are analyzed and considered for program inclusion, certain
items in the initial area may have to be dropped from the program due to resource
constraints.
6. The organizations decisions on percent coverage.
7. This is a function of workflow maturity. Where the workflow is well established and
efficient, then an organization can afford to start with a relatively high percent of coverage
and a decent set of technologies. Where the workflow is not as stable and efficient, there is
certainly nothing wrong with beginning the program at a small percent of coverage. In
some cases, this is a wise thing to do. In doing so, the organization has time to get used to
the idea of technologies finding defects and the differences in workflow that represents. It
also gives the planners and crafts personnel time to get used to the new information and
adjust their practices to incorporate the additional data into their daily decision matrices.
However, there should be a Master Plan in place that provides expectations to increase
coverage levels as the workflow matures. This maturation should occur over a 3-5 year
period, and during that time, the percent coverage and the number of technologies
implemented should be at best practices levels.
Technical Note on the Asset Health Matrix as a Streamlined RCM

Tool:
The RCM faithful may be asking: If the AHM is a tool that is predominantly
focused on developing the inspection strategy for an asset base, what of the
preventive strategy for those same assets? That is an excellent question.

While it is true that the AHM tool generally focuses on the inspection
strategy to identify individual component level failures, the preventive
strategy is more complex than can be generalized and put into a tool that is
universally applicable across multiple industry verticals. So, while the tool
has the capability of mapping PM strategies to different equipment types, the
tasks themselves are to be analyzed within the operating context and then
put into the tool.
In general, for the balance of plant assets (those assets not earmarked for a
full RCM Analysis), there are only a few necessary strategies that would
address the lions share of the preventive maintenance needs. They are:
Craft Skills
More often than not, the root cause of many of the failures found in
modern facilities is the result of poor craft skills. Specifically, these are
the precision skills of the crafts personnel to properly address fits,
tolerances, and alignment.
Lubrication Program
The storage and handling of lubricants is usually poor and oftentimes

atrocious. The importance of contamination control and its effect on
equipment reliability cannot be understated.
Operator Care Program
The operators are the first line of defense for any asset. Where the
operator is involved in the care and maintenance of the asset, the
equipment always runs more reliably.
If these three items are considered at the plant level and not the individual
component level, as is often the case, reliability would improve significantly.
A comprehensive strategy of education and operating discipline would see the
immediate elimination of a tremendous amount of unplanned failures and
the improvement of Overall Equipment Effectiveness (OEE). As such, these
systemic level initiatives are considered adequate as the preventive strategy
for the balance of plant assets. After deployment, any asset that displays
chronic failures can be earmarked for RCA/RCM Analysis as well.
Additionally, the preventive strategy for these assets was addressed during
the PME portion of the process. The existing PM program was streamlined to
contain only the PM tasks that added value by addressing specific failure
modes that were not going to be covered by PdM technologies. It should be
noted that, in the past, some organizations have chosen to eliminate a PM
task by scheduling it to be covered by a PdM technology and when it came
time to implement the different technologies, that particular technology was
not chosen. This left those failure modes undetected and exposed the

organization to significant risk. The PME process and the design of the
Condition Monitoring program have to work in concert to ensure that all of
the failure modes are covered by the most appropriate technique(s).
Step 3 Conduct Bad Actor Analysis

As noted previously, a bad actor is a machine that has displayed chronic problems. It frequently
causes downtime, and after a little bit of analysis, it becomes apparent that the root cause of the
problem is not known.
A two-tiered approach consisting of RCM and RCA is applied to combat these bad actors.
Reliability Centered Maintenance

RCM is used to develop comprehensive maintenance strategies. Due to the rigors associated with
this type of analysis, it is typically associated with only the top 5-20% of a facilitys most critical
assets, but can be applied to other equipment on an as-needed basis to solve chronic problems.
RCM Blitz
If a site is looking for more information than is found in the AHM, and wants to be assured that all
failure modes and effects, as well as a spares inventory, are identified through a formal RCM
process, RCM Blitz should be utilized.
Why Blitz?
The challenge at hand was to develop an RCM process that delivered a thorough analysis in a
much shorter cycle. It was from this challenge that the RCM Blitz Method was created. RCM
Blitz is a process that completes the traditional RCM Analysis cycle in seven days or less and
delivers actionable tasks the first day of analysis.
The dramatic RCM Analysis cycle reduction is accomplished by combining advanced facilitation
techniques with the traditional RCM thought process.
The advancement in facilitation came from the idea of co-facilitation and the use of a real time
RCM database. These advancements made it possible to facilitate the RCM Analysis and enter the
information into the database at the same time. While one person facilitates, the other captures
the information. This provides the RCM Team and its company with more accurate information
and a complete RCM report with prioritized and assigned action items at the end of the analysis
cycle.
As previously stated, RCM Blitz is a fully SAE JA1011 compliant RCM Analysis methodology. Do
not let the term Blitz lead you to think that shortcuts are taken; the output is a complete
maintenance strategy for the failure modes analyzed. The term Blitz comes from the way the
analysis process is facilitated and is based on two factors.
1. The Analysis Team is fully dedicated for one week and one week only. The size of the
system being considered for analysis is such that the Analysis Team can complete the

analysis in that week. In this way, the analysis of the system is completed quickly (Blitz)
and not dragged out over weeks or months due to partial attention from the Analysis Team.
2. The process utilizes two facilitators, one facilitator works with the Analysis Team to keep
the process rolling while the other facilitator works the RCM Blitz software tool (typically
projected on the wall for the entire team to view). In this way, the conversation does not
have to stop for the facilitator to update the software tool when results need to be recorded.
This accelerates the analysis process (Blitz).
How Is RCM Blitz Different?

Allied Reliability Groups RCM Blitz methodology differs from most RCM offerings in that the
client is trained to become RCM Blitz facilitators, enabling them to perform their own RCM
Analysis. Unlike most RCM offerings, Allied Reliability Group does not simply train a client on
RCM theory and leave them dependent upon us to complete their analyses.
RCM Blitz results in a complete maintenance strategy for the failure modes that were analyzed.
Below is a list of the different tasks and strategies that are the typical output of the RCM Blitz
process.
Implementation Strategy: When the RCM Analysis is complete, there is often a large
action list that needs to be acted on and implemented. One thing we quickly realized was
that if you did not also have a structured way to prioritize and assign these actions, the
speed and thoroughness of implementation would be compromised. RCM Blitz provides a
way to quickly prioritize tasks, assign actions, establish due dates, and track the
implementation of the RCM Analysis.
Predictive Maintenance Tasks: Condition Monitoring or PdM tasks are extremely
powerful defect detection inspections that are usually performed while the equipment is in
normal operation. These are preferred inspection types versus inspections that require the
equipment to be shut down. Condition Monitoring technologies include, but are not limited
to, Vibration Analysis, Infrared Thermography, Ultrasonic Analysis, Lubricant Analysis,
and Motor Circuit Analysis.
Preventive Maintenance Tasks: Some failure modes cannot be detected with Condition
Monitoring and, therefore, require some other type of inspection. Additionally, some failure
modes can be prevented with the correct intervention technique at the correct interval.
Process Verification Techniques: One of the most cost effective on-condition
maintenance tasks, process verification uses field instruments and Programmable Logic
Controller (PLC) programming of trends, set-points, and alarms to detect and give
notification of any potential failure conditions. Unlike many single-point-in-time PdM
techniques, process verification tasks are monitored continuously as your process operates.
Listed below are just a few process verification tasks:
o Pump output pressure or flow
o Motor amps

o Process temperature
o Process sequence timers
Failure Finding Tasks: Whenever a component is subject to functional failure that will
not be evident to the operating crew under normal operating conditions, a scheduled
maintenance task is needed to protect the availability of that function. While hidden-
function failures have no immediate or noticeable consequences, these undetected failures
increase the exposure or probability of a possible multiple failure. As a result, hidden
function items are assigned failure finding tasks, which are scheduled inspections put into
place to detect functional failures versus potential failures.
Consequence Reduction Tasks: In teaching RCM, many vendors instruct that run-to-
failure is the remaining option when no preventive or predictive task can be used to
eliminate or reduce the failure to an acceptable level. If run-to-failure is your strategy, you
should be prepared to reduce the consequences of that failure. Allied Reliability Groups
process not only tracks the tasks that need to be implemented for predictive and preventive
maintenance, but also tracks the tasks that need to be completed to ensure all failure
consequences are kept to a minimum. To reduce failure consequences, the proper spare
parts, procedures, and resources need to be in place. Reducing failure consequences is the
key tool in reducing costly equipment downtime. If this strategy is not a part of your RCM
process, the process is incomplete.
Spare Parts Strategy: Allied Reliability Groups spare parts strategy offers a decision
process to ensure you have the parts needed to reduce the consequences of all failures. If
your RCM methodology ends the discussion of maintenance tasks with the phrase No
Scheduled Maintenance, your company is in danger of running a critical item to failure
without ever considering if you should have a spare on hand. RCM Blitz uses a risk-based
flow diagram to make sound spare parts decisions that consider:
o The probability that the failure will occur;
o The consequences to your business should the failure occur;
o The availability and lead time to place the component at your facility; and
o Obsolescence.
Root Cause Analysis

RCA is used to:
Verify the EMP

Drive continuous improvement
Address selected failures/Bad Actors
RCA should be used when certain triggers are tripped and should not necessarily be based on asset
criticality. The RCA methodologies that Allied Reliability Group uses are 5 Why, Cause Mapping,
and Logic Tree. The methodologies each have specific strengths and should be used at different
times for optimal results.

RCA Methodology: 5 Why

Specific Strength: Simple and quick.
Easy enough for anyone to use.
Typical Everyday problems that occur in operations.
Application:
Example Triggers: Pressure transmitter failure resulting in 2 hours of
downtime on packager.
Operator error in starting up equipment resulting in loss of
one product batch.
RCA Methodology: Cause Mapping

Specific Strength: Identifies creative solutions.
Looks for best point to break causal chain of events.
Typical Safety or operations incidents.
Application:
Equipment failures (chronic or sporadic).
Example Triggers: Procedure violations resulting in near-miss major injuries.
Loss of equipment function resulting in 12 hours of
production line outage.
RCA Methodology: Logic Tree

Specific Strength: Emphasis on verification of cause/effect relationships.
Finds solutions to systemic root causes.
Typical Safety or operations incidents.
Application:
Equipment failures (chronic or sporadic).
Example Triggers: Three hand injuries in one quarter.
Chronic equipment failures causing 30,000 lbs of reject
product in a month.

Technical Note About the Difference Between RCM and RCA and
When Each Should Be Used:
Much discussion has taken place in the reliability engineering world with
regards to the true differences between these two methodologies and which
one is appropriate for a given situation. A thorough analysis of these two
methods reveals that, at their core, they are all but identical. Each deals with
cause and effect. It is just that simple.
Traditionally, the only difference lies in when each is used. If the analysis is
trigger based, which is another way of saying reactive, then we are most
likely looking at performing RCA or RCFA. These two terms are typically
used synonymously, though the RCA faithful will argue vehemently that
there is a distinct difference. We will address this shortly.
If someone is analyzing the potential failure modes of a system proactively

(before they have happened), then they will typically employ RCM as the
methodology.
So there lies the only real difference. RCA is generally used to figure out the
cause(s) of a specific event(s) so that they can be eliminated in the future, and
RCM is trying to put a strategy in place to eliminate/mitigate all of the
probable causes.
It is doubtful that RCA experts enjoy the fact that their methodology has
been labeled reactive. Enter the RCA faithful who say that RCFA is reactive
and RCA is proactive. This is true, but the techniques used are still the same.
The only difference is that if the event has already happened, it may be
possible to produce evidence to confirm or deny a particular theory.
Either way, both of the methodologies are powerful and necessary for the
development of a comprehensive FMA program.
Step 4 Complete the Equipment Maintenance Plan

An EMP only includes interval-based inspections and inspection-based activities designed to
identify a failure mode, prevent a failure mode, or address activity that is statutory or regulatory
in nature. An EMP is a detailed maintenance plan for the lowest maintainable item and includes
the following:
1. The equipment hierarchy.

2. It needs to be clear to the end user which piece of equipment is being discussed or
identified.
3. Whether the clients Computerized Maintenance Management System (CMMS) or
Enterprise Asset Management system (EAM) hierarchy supports it or not, the hierarchy in
the EMP needs to go down to the component level. Component is defined as the lowest

maintainable item (i.e. pump, fan, motor, gearbox). Reference ISO 14224, Section 8.2
Taxonomy.
4. The failure mode(s) that has/have been identified.
5. The definition of a failure mode is the part + what is wrong with the part + the reason.
(See Introduction for a detailed definition.)
6. The list of failure modes covered in an EMP need not be an exhaustive list. It should only
include the predominant failure modes that represent both the failures that have happened
and the failures that are likely to happen.
7. The task(s) employed against each failure mode, divided into one of four categories.
8. Inspection identify the presence of a failure mode
9. Preventative prevent the occurrence of a failure mode
10. Adjustment
11. Time-based replacement
12. Lubrication
13. Run-to-failure
14. Statutory or regulatory tasks for liability reasons, the responsibility of defining these
tasks and adding them to the EMP is reserved specifically for the asset owner
15. The frequency for each task.
16. Frequency is defined as interval based.
17. Intervals can be defined in terms of time (hours, days, weeks) or throughput (units, tons,
pounds).
18. The person who is going to perform each task (by title not by name).
19. Task assignments should be made based on task complexity and required skill levels.
20. Operators may be assigned equipment care tasks or inspection tasks depending on task
qualification.
21. The amount of manpower that will be required to complete each task, reported in total
man-hours.
22. The amount of equipment downtime required to complete each task.
NOTE: When addressing item 7, keep in mind that certain pieces of equipment may
require additional time to cool down or warm-up and that additional time needs to
be considered as a part of the maintenance plan.

Step 5 Optimize the Equipment Maintenance Plan

EMP optimization is accomplished through a detailed analysis of the failure rates of the plant
equipment. The optimized task frequency is calculated by accounting for the cost of downtime, the
cost of repair, and the cost of the inspections.
This option is for mature reliability programs only. To be considered mature, the following criteria
should be met:
1. Have a complete FMECA library for the top 20% of the equipment in the facility;
2. Be dissatisfied with the current task frequencies;
3. Have extensive failure data on most of the failure modes in the FMECA library; and
4. Be willing to utilize the WeiBayes methodology for failure profile estimation.
Allied Reliability Group uses the software tool RCMCost from Isograph, which allows the
relationship between maintenance plans and the needs of the business to be quantified. The
process uses local knowledge, maintenance history, and inspection results to develop models based
on RCM principles. Failure effects can be assessed in terms of dollar impact, safety/environmental
risk, and operational criticality. These models are used to evaluate, compare, and optimize
maintenance and inspection strategies over the entire life cycle of the machine. RCMCost can
generate reports on the justification of maintenance strategies and quantify the implications of
strategy changes on costs to the business and availability of your equipment.
As with all computer modeling processes, the quality of the data in the model determines the
quality of the results. Fortunately, most organizations have enough data to determine the
parameters required for the model. Once built, models can be easily updated and refined as
parameters change. In order to generate results that provide a definitive comparison of
alternatives, all parameters in the modeling process need to be brought back to a common unit.
Simply put, the only way to demonstrate the benefit of any maintenance policy is to compare
alternatives in terms of dollars or reduction in business risk. Sometimes the lowest cost
alternative does not provide the required reduction in safety, environmental, or operational risk.
In those cases, the goal is to find the lowest cost alternative that will achieve the required
reduction.
The reality is that maintenance tasks should be determined or optimized based on many more
parameters, including:
Cost of failure
Risk of injury or death
Risk of environmental releases
Risk of business shutdown
Cost of corrective maintenance
Cost of inspections
Cost of PM
Cost of PdM

Cost of on-condition alarms

Cost of replacement parts
Cost of labor
Cost of delays associated with unpredicted failures
Equipment failure characteristics
System redundancies
Given the amount of variables, the need for computer simulation should be apparent. RCMCost
can quickly generate results based on calculations that involve several parameters. This allows the
impact of parameter changes to be seen immediately.
Many organizations have followed an RCM type approach for their maintenance activities and
have generated effective maintenance policies directed towards the identified modes of failure. The
next step in the optimization process is to quantify the maintenance strategy decisions and apply
real life parameters to an entire system model so that asset life costing and predictions can be
performed. In several cases, organizations have generated what they believed to be their optimum
policies, but because they could not substantiate these, they had trouble applying the decisions to
real life. Generally, these decisions are made, but the results are not clearly understood or
checked. The optimized model should assist with these decisions and allow the user to understand
the consequences.
RCMCost is a strategy analysis/modeling package that utilizes simulation techniques that

provide flexibility in manipulating operational and equipment variables. Once the impacts of these
variables are understood, decisions can be made with confidence.
RCMCost provides an environment where information from RCM processes can be simulated to
test the resulting policies and add real world parameters to decision making.
Failure Reporting, Analysis, and Corrective Action System

Once the EMP is developed and implemented, continuous improvement to the strategy should take
place. This is referred to as a Failure Reporting, Analysis, and Corrective Action System
(FRACAS). An easy way to think of FRACAS is the following graphic.

Failure Mode
Analysis
Strategy Failure Codes

Adjustments Creation
FRACAS
Kaizen Loop
Work Order
Root Cause
History
Analysis
Analysis
Failure Mode Analysis

This step represents the entire EMP analysis and all of the known failure modes that are expected
to occur.
Failure Codes Creation

The failure codes in a CMMS should match very closely to, if not exactly, the failure modes listed
in the failure modes library.
Work Order History Analysis

One of the roles of the Reliability Engineer (RE) is to periodically analyze the failures that have
occurred in the plant and compare those failures to the failure modes library. Do the failures that
have occurred match up with the failures that were expected? If not, why not? Best Practice is for
the RE to perform this analysis on a monthly basis.
Root Cause Analysis

When the failures that have occurred do not match the expected failures, there is room for
improvement. Methodologies like RCA provide an opportunity to analyze these situations and
devise an adjustment to the current strategies.

Strategy Adjustments
Modifications to the EMP, or even to the asset management plan to compensate for operational
changes, are warranted after the RCA has been completed.
Failure Modes Analysis

After the cycle has been completed, the failure modes library should be updated. For example:
New failure modes

New failure finding tasks
New PM/PdM tasks
Different tasks frequencies
Now the failure modes library is complete and the cycle is ready to begin again. The process never
ends, which is why the failure modes library and the EMP are referred to as living documents.
As you learn more about the way the equipment fails, the strategy gets updated and improved.
This continuous improvement process is a great example of kaizen.
Summary
Understanding how equipment fails is the core of any effective maintenance strategy. Once these
failure modes are understood, they can be eliminated, prevented, or simply detected with enough
time to respond.
A combination of RCM and RCA is required for the top 5-20% of the equipment in a plant and for
any other machine that has displayed chronic problems. For the balance of the plant equipment, a
full Reliability Analysis is not economically feasible. For these machines, a combination of
strategies is sufficient to ensure reliable operation. Those strategies are:
A comprehensive Condition Monitoring program;

A PM program that has been relieved of non-value added tasks and tasks that are more
appropriately addressed by operators, lubrication specialists, and Condition Monitoring
technicians;
Effective craft skills; and
Lubrication excellence.
A comprehensive Condition Monitoring program for the whole plant is easily and quickly designed
through the use of the AHM tool. The AHM quickly lays out the PdM tasks to known failure modes
of the components in a drive train. Thus the RCM Analysis Team can spend their time on the
systemic and operating envelope problems.
Once the EMP is developed and some failure history is collected, it can then be optimized. The
optimization is accomplished through the use of powerful statistical software that analyzes failure
patterns and accounts for the cost of failures. The optimum task frequency is calculated, and the

Life Cycle Costs (LCC) of operating and maintaining the machine can also be calculated. Changes
to the maintenance strategy can then be analyzed to understand the overall impact to the system
and decisions that were made in the best financial interest of the company.
The combination of RCM Analysis and the AHM creates an environment where EMPs can be
developed more rapidly than ever before. The addition of Weibull Analysis and LCC Analysis
ensures that the maintenance strategy is optimized to produce the most favorable results possible.

ARG FMA-EMPDevelopment WhitePaper

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

ARG FMA-EMPDevelopment WhitePaper

Diunggah oleh

Hak Cipta:

Format Tersedia

Failure Modes Analysis and

Reliability its in our DNA.

Allied Reliability Group 2 Copyright 2013

NOTE: For question number 6, inspection techniques that adequately identify

Allied Reliability Group 3 Copyright 2013

An example that illustrates failure modes follows.

Allied Reliability Group 4 Copyright 2013

1. Task prevents a failure mode from occurring.

Allied Reliability Group 5 Copyright 2013

Allied Reliability Group 6 Copyright 2013

Developing a Failure Modes Driven Maintenance Strategy

Understanding Asset Defects

We do not properly align the shafts before we operate the fan;

Allied Reliability Group 7 Copyright 2013

Allied Reliability Group 8 Copyright 2013

Rolling element defects

Allied Reliability Group 9 Copyright 2013

Technical Note on Asset Health Management:

An asset that has an identifiable defect is said to be in a condition RED. An

There is an interesting aspect about Asset Health. Once this change is

Allied Reliability Group 10 Copyright 2013

my Asset Heath Management Program. You are correct. If a plant is only

Three-Tiered Approach to Failure Mode Analysis

Tier 1 The Asset Health Matrix

Allied Reliability Group 11 Copyright 2013

Tier 2 RCM Blitz

Failure Mode Analysis Process

Step 1 Construct the 100% Theoretical Model

Allied Reliability Group 12 Copyright 2013

Step 2 Construct the Validated AHM

A validated AHM requires the following items:

1. The 100% Theoretical Model.

Technical Note on the Asset Health Matrix as a Streamlined RCM

Allied Reliability Group 13 Copyright 2013

The storage and handling of lubricants is usually poor and oftentimes

Operator Care Program

Allied Reliability Group 14 Copyright 2013

Step 3 Conduct Bad Actor Analysis

Reliability Centered Maintenance

Allied Reliability Group 15 Copyright 2013

How Is RCM Blitz Different?

Allied Reliability Group 16 Copyright 2013

Root Cause Analysis

Verify the EMP

Allied Reliability Group 17 Copyright 2013

RCA Methodology: 5 Why

RCA Methodology: Cause Mapping

RCA Methodology: Logic Tree

Allied Reliability Group 18 Copyright 2013

If someone is analyzing the potential failure modes of a system proactively

Step 4 Complete the Equipment Maintenance Plan

1. The equipment hierarchy.

Allied Reliability Group 19 Copyright 2013

Allied Reliability Group 20 Copyright 2013

Step 5 Optimize the Equipment Maintenance Plan

Allied Reliability Group 21 Copyright 2013

Cost of on-condition alarms

RCMCost is a strategy analysis/modeling package that utilizes simulation techniques that

Failure Reporting, Analysis, and Corrective Action System

Allied Reliability Group 22 Copyright 2013

Strategy Failure Codes

Failure Mode Analysis

Failure Codes Creation

Work Order History Analysis