Equipment Maintenance
Plan Development
Table of Contents
INTRODUCTION ............................................................................................................................... 3
DEVELOPING A FAILURE MODES DRIVEN MAINTENANCE STRATEGY ........................ 7
Understanding Asset Defects ............................................................................................................... 7
Three-Tiered Approach to Failure Mode Analysis ............................................................................ 11
Tier 1 The Asset Health Matrix .................................................................................................. 11
Tier 2 RCM Blitz ...................................................................................................................... 12
Tier 3 RCMCost ........................................................................................................................ 12
Failure Mode Analysis Process .......................................................................................................... 12
Step 1 Construct the 100% Theoretical Model ........................................................................... 12
Step 2 Construct the Validated AHM ......................................................................................... 13
Step 3 Conduct Bad Actor Analysis ............................................................................................ 15
Reliability Centered Maintenance ............................................................................................. 15
RCM Blitz ............................................................................................................................ 15
Why Blitz? ............................................................................................................................... 15
How Is RCM Blitz Different? .............................................................................................. 16
Root Cause Analysis ................................................................................................................... 17
Step 4 Complete the Equipment Maintenance Plan .................................................................. 19
Step 5 Optimize the Equipment Maintenance Plan ................................................................... 21
Failure Reporting, Analysis, and Corrective Action System............................................................. 22
Failure Mode Analysis ................................................................................................................... 23
Failure Codes Creation .................................................................................................................. 23
Work Order History Analysis......................................................................................................... 23
Root Cause Analysis ....................................................................................................................... 23
Strategy Adjustments .................................................................................................................... 24
Failure Modes Analysis .................................................................................................................. 24
SUMMARY ........................................................................................................................................ 24
Introduction
In 1978, F. Stanley Nowlan and Howard F. Heap released a United Airlines study that contained a
new methodology called Reliability Centered Maintenance (RCM). This methodology quickly
became the gold standard for risk assessment and risk management. As the years progressed, two
major improvements happened:
1. Improvements in the use of advanced statistics to analyze machinery failure patterns shed
new light on the nature of how machinery fails; and
2. Advancements in computer technology produced extremely powerful inspection methods
that changed the face of assessing a components condition.
Very few practitioners of RCM kept up with either of these changes, and as such, the development
of a comprehensive and efficient Equipment Maintenance Plan (EMP) became a resource-
consuming process.
This paper will lay out a new system for using a combination of traditional RCM Analysis,
Condition Monitoring technologies, and statistical data to produce a complete EMP in a fraction of
the time that it would take traditional RCM Analysis alone to produce. This paper will display how
the reliability engineering field as a whole can fully integrate the real power of Condition Based
Monitoring (CBM) technologies, sometimes referred to as Predictive Maintenance (PdM)
technologies, with traditional RCM Analysis by means of a three-Tiered approach to Failure Mode
Analysis (FMA), thereby accelerating the EMP development.
While this paper will not instruct the reader on the finer points of RCM Analysis, a short review is
in order. A recollection of the fundamentals will be necessary later in this paper. The core of RCM
Analysis lies in the answer to seven questions:
1. What are the functions and associated desired standards of performance of the asset in its
present operating context (functions)?
2. In what ways can the asset fail to fulfill its functions (functional failures)?
3. What causes each functional failure (failure modes)?
4. What happens when each failure occurs (failure effects)?
5. In what way does each failure matter (failure consequences)?
6. What should be done to predict or prevent each failure (proactive tasks and task intervals)?
7. What should be done if a suitable proactive task cannot be found (default actions)?
and it is this category that most faithful RCM facilitators have traditionally
overlooked or misunderstood.
Contained within question number 3 is a term we have not previously defined, failure mode. A
failure mode is the condition that exists that will cause a functional failure. Another way to think
of it is simply: the part + what is wrong with the part + the reason.
a. The part is a group of pieces that make up a component. Examples from the ISO standard
include impeller, seal, shaft, bolt, nut, and bearing.
b. What is wrong with the part refers to the effect of the failure mechanism. Examples
include failed, damaged, out of adjustment, abrasion, erosion, corrosion, fatigued, burnt,
and broken.
c. The reason is the physical cause of the problem. This could be age, improper lubrication,
misalignment, imbalance, or improper installation, among others.
Example: One month, a misalignment condition is noted on a pump that is directly coupled to an electric
motor. The pump was recently replaced due to wear on the impeller. The post maintenance follow-up by the
Vibration Analyst revealed that the alignment was performed improperly. The motor-pump combination is
now running in a misaligned condition. The failure mode noted on the Vibration Analysts Exception Report
by would be Shaft Misalignment Improper Installation. Just two months later, the misalignment
condition having never been corrected, the Vibration Analyst now detects an outer race defect on the pump
bearing. The failure mode noted on the Exception Report is now Bearing Fatigued Shaft Misalignment.
This example should help demonstrate how Condition Monitoring programs allow for the
acceleration of Root Cause Failure Analysis (RCFA). Having the Vibration Analysis reports to
review during the RCFA makes the process much faster because documented proof is readily
available. The above situation is an extremely typical example of defects creating increased
maintenance labor and material costs while increasing the amount of unplanned (or even planned)
downtime. Planned downtime is increased by the fact that replacing the pump bearings takes
significantly longer than properly aligning the shafts, the initial failure that was noted. This
example also builds an excellent case for procedure-based maintenance and improved craft skills.
Had the craftsmen aligned the shafts properly upon replacing the impeller, none of this would
have occurred. This is also an excellent example of why the RCM Analysis Team should include a
CBM specialist.
NOTE: The list of failure modes covered in an RCM Analysis need not be an
exhaustive list. It should only include the predominant failure modes that represent
the failures that have previously occurred and the failures that are very likely to
happen. This philosophy is decidedly different from the original system laid out by
Nowlan and Heap, which John Moubray later formalized. They recommended that
the list of failure modes take into account every possible failure mode, however
unlikely, so as to avoid as much risk as possible. Modern variations focus the
attention of the Analysis Team on known failure modes and those with a high
conditional probability of occurrence. Some people tend to label this streamlined
RCM, though this is not accurate. Streamlined RCM is one that is not fully SAE
JA1011 compliant, due to some shortcut being taken in the interest of time.
The biggest advantage of RCM is the fact that the Analysis Team, and by extension the
organization, begin to think in a failure modes manner. They realize that there are a myriad of
non-value added tasks in a typical maintenance program that not only waste valuable crafts time,
but by means of intrusive inspections, actually increase their chances for infant mortality
problems like improper reassembly or lubrication contamination. The Analysis Team also realizes
a clear set of guidelines for determining what tasks are to remain a part of the maintenance
strategy. Any task that is to remain in the maintenance program must meet at least one of the
following criteria:
This failure modes style of thinking quickly separates the value added from the non-value added.
Additionally, this type of analysis need not be limited to the creation of an EMP; it can also be
applied to the redesign of an existing EMP.
Performing the preceding analysis on an already existing Preventive Maintenance (PM) strategy
allows for the non-value added tasks to be removed from the PM program and either deleted or
reassigned to more appropriate personnel. It also calls out which PM tasks need to be kept and
which ones need to be cleaned up in terms of wording and formatting to create a more
quantitative, repeatable procedure. This exercise is called a Preventive Maintenance Evaluation
(PME).
A PME can be done one of two ways. A sample PME can be done at the beginning of a reliability
improvement initiative to build some momentum around the types of changes possible and start to
define the size of the changes that could happen. This is typically performed on 200-300 PM tasks
that are deemed to be representative of the entire PM program. The PM tasks should be selected
from across 20-25 different equipment types in the plant and from a combination of monthly,
quarterly, and annual PM tasks.
Secondly, a full PME can be done. This is typically performed on the entire PM library, towards
the middle of a reliability improvement initiative. This is done to calculate precisely how many
craft resources will be freed up and how many PM tasks need to be reengineered into the proper
format.
The organization does not have to implement the results of the PME right away. The output of the
study should follow a staged implementation, as outlined below.
1. All tasks deemed Non-Value Added: These tasks can and should be deleted from the
program immediately. This will not be detrimental to the equipment performance because
the tasks, by definition, hold no value. This is an excellent time to begin reengineering the
tasks that need to be made more qualitative and repeatable.
2. All tasks deemed Non-Value Added: Reassign to Operator Care: These tasks do
not require a skilled maintenance craft person to be successfully completed. These tasks
should only be assigned to the individual operator(s) after the proper task procedure has
been created and the operator has been task qualified to both the written procedure and the
physical procedure. This step gets the operator(s) more intimately involved with the
maintenance of the equipment and provides another line of defense against equipment
failures.
3. All tasks deemed Non-Value Added: Reassign to Lube Route: Lubrication tasks
require a significant amount of training to be performed correctly. Contamination control
and sound lubrication fundamentals are broad topics and should be accounted for in the
design of the procedure. These tasks, like the operator care tasks, should only be assigned
to the individual lube technician after the proper task procedure has been created and the
lube technician has been task qualified to both the written procedure and the physical
procedure.
4. All tasks deemed Reassign to PdM: Parallel to the PME process, the PdM
improvement process should be taking shape (more on that later) and it should be time to
relieve the PM program of all tasks that were previously deemed Reassign to PdM. This
should only be performed after the PdM program is up and running, and like all of these
steps, can be done department by department as the technicians are ready to increase their
coverage.
Remember, this will not be as simple as throwing a light switch. Changes to the process workflows
have to be made and people need to be trained on changes in the workflow. Different metrics may
need to be created to measure implementation effectiveness and overall system efficiency. All four
of these steps can be done department by department, as the Reliability Improvement Team
tackles tasks and gets everything ready for rollout.
This situation gave birth to all of the different variations of streamlined RCM methodologies.
Streamlined methodologies are RCM Analysis methods that are not fully SAE JA1011 compliant
because they skip one or more steps of the original 7-step process.
An analysis of the history of failures for a given plant will normally indicate that a large portion of
the failures are happening on a small portion of the assets. Often it is found that the 80/20 Rule
applies: 80% of the problems are being created by only 20% of the machines. Even more
interesting, the 80/20 of the 80/20 generally applies as well. This means 64% (80% of 80%) of the
problems are being caused by only 4% (20% of 20%) of the assets. As shocking as this may seem, it
has been found to be true more often than not. This is further evidence that the streamlined RCM
approach is the correct analysis strategy. Therefore, a simple statistical calculation dictates that a
full-blown RCM Analysis need only be performed on 5-20% of all of the machines in a facility to
address the majority of the equipment failures creating the unplanned downtime. This 5-20% is
determined by the Criticality Analysis process.
NOTE: This does not preclude an RCM Analysis from being performed on any
machine at any time. After a maintenance strategy is put into place, RCM Analysis
is a very powerful tool to use on any machine, regardless of criticality ranking, that
exhibits chronic failures (otherwise known as a bad actor).
Defects come from two different sources: people and environment. The first source of defects is
people and how they deal with the machinery. As people work on machines, they sometimes create
defects, such as:
All of these cause defects and have everything to do with how we interact with the machine. We
will refer to these as systemic problems, with the system being the man-machine system.
The impact of systemic problems is easily seen in an I-P-F curve. An I-P-F curve is the standard P-
F curve with an I-P portion added. Point I is defined as the point of installation of the component.
The I-P portion of the I-P-F curve is the failure free period. This is the time during which the
operation is defect free. On machines that were installed improperly, this may be just a few
seconds. On machines that were installed and/or repaired properly, this may be years.
An excellent way to determine the maturity of the maintenance effort is not by looking at the age
of the program, but by looking at where their focus is on the I-P-F curve. An organization that is
constantly focused on Point F and staying clear of it, will undoubtedly be a reactive culture. How
long can we run it before it fails? and Just how bad is it? are typical questions that might be
heard around this type of organization.
As the organization matures, the focus shifts from Point F to Point P. The organization then
focuses its efforts on understanding how things fail and their ability to detect these failures early.
Typical things overheard in this type of organization may be something like Is this the best way
to detect these defects early? or I appreciate you letting me know about this problem, even
though it is very early.
Another transition follows, moving the focus from Point P to Point I. Overheard in the hallways of
these organizations are statements similar to Take the time to do it right, it will pay big dividends
for us not too far down the road and Lets update the procedures for that job to reflect what we
just learned. This organization is trying to prevent failures from occurring in the first place by
applying best practices with fits, tolerances, alignment standards, contamination control, and well-
documented procedures. They are the ones who will see the step change in performance and they
are the ones whom we label mature, not the organization that has had a maintenance effort in
place for a longer period of time and follows it poorly.
Secondly, there are the defects that stem from the operating environment or the operating context.
These defects are the result of the environment in which the machine operates. Some examples of
this type of defect are detailed below:
Overhead crane gearboxes in plants located in hot regions of the United States will see
drastically hotter temperatures in the summer than the same gearbox in Wisconsin. As a
result, lubrication problems may be much more prevalent.
Motors operating in a dusty environment, such as inside a building where carbon is
produced, are much more likely to have their cooling fins clogged than a motor operating a
fresh water pump down by the river.
Pumps that have to change speeds to constantly adjust flow characteristics to account for
system adjustments are much more likely to run with flow turbulence issues, such as
cavitation, than a pump that never has to change speeds and has a constant inlet and
outlet pressure.
Electrical equipment that sees constant changes in load are more likely to develop thermal
anomalies or hot spots as a result of the heating-up and cooling-down process than
electrical equipment that see much more stable loads.
Motors that are constantly starting under high load conditions or motors that experience
large variations in load during normal operations are much more prone to rotor bar defects
than motors that run all of the time and at the same load level.
These are just a few examples of how operating context can be a source of defects. We will refer to
these causes as operating envelope problems.
Both systemic problems and operating envelope problems manifest themselves in the same way as
specific component or part level defects. Some examples include:
Knowing that both systemic problems and operating envelope problems produce the same type of
defects, a maintenance strategy that merely attempts to discover the defects and correct them will
never be able to reach a proactive state. Technicians will be too busy fixing the symptoms of
problems instead of addressing the root cause. To reach a truly proactive state, the root cause of
the defects will need to be identified and eliminated. Maintenance strategies that accomplish this
are able to see the step change in performance and achieve incredible cost savings. Maintenance
strategies that do not attempt to address the root cause of defects will continue to see lackluster
results and struggle with financial performance.
Another way to maintain focus on identifying and eliminating the root cause of the problem is to
focus on the defect and not the failure. Organizations that operate in a reactive mode typically
spend all of their time reacting to failures. This behavior finds them with all of their focus
around Point F on the curve. For that organization to make a step change in performance, they
will need to shift their focus to Point P, where a defect enters the system. As an organization shifts
its focus to detecting the defects early, they buy themselves the single most valuable commodity a
maintenance organization can have time. Detecting defects the moment they begin allows for the
maximum amount of time for the defect to be eliminated. While detecting defects the moment they
begin is not exactly possible, understanding the nature of the defects, how they are initiated, and
how they propagate is possible. A comprehensive inspection strategy, performed at the correct
intervals, will increase the conditional probability that the defects will be found very near their
origination. In addition, detecting the defects early allows for a proper RCFA to be performed
because many of the conditions that led to the defect are still in effect and can be more accurately
analyzed. Letting the defect progress down the curve or degrade, changes the nature of the defect
and makes proper analysis more challenging. Perhaps just as importantly, it makes the failure
that much more expensive to correct.
So, how can an organization successfully shift their focus from F to P and get a step change in
performance?
The implementation of an Asset Health Management program will accomplish that goal. Asset
Health is defined as the percent of machines in the plant that do not have an identifiable defect.
Notice this is not the percent of machines that experienced a failure or that are about to fail, but
the percent of machines that have some type of defect, no matter how small.
Using this as a measure pushes the focus from Point F to Point P. No longer is an organization
putting all of their attention toward what is about to fail, thus keeping themselves in a short cycle,
high intensity, resource draining reactive environment. By detecting defects early in their life and
initiating the maintenance planning cycle upon detection, they transform themselves into an
organization that values planning over firefighting. They move themselves into the long cycle, low
intensity, resource maximizing proactive environment.
It should be noted here that the planning and scheduling function of an organization should
already be well established to be able to take advantage of the information that a Condition
Monitoring program provides. That being said, through the early detection of defects, the
Condition Monitoring program becomes an empowering force for the planning and scheduling
effort.
That is it. It is that simple. There are no other but ifs, what ifs, or if
thens to consider. If there is an identifiable defect, the asset is RED. If there
is no identifiable defect, it is GREEN. The percentage of machines that are in
condition GREEN is the Asset Health, stated as a percentage, for that plant
or area.
Example: A plant has 1,000 pieces of equipment. Of that number, 750 of them
have no identifiable defects. The plant is said to have 75% Asset Health.
You may be reading this and thinking, The amount of Asset Health I have at
a given moment depends on what inspection techniques I employ as a part of
While this paper will not go into detail for each methodology, we will spend some time discussing
the three different methodologies that Allied Reliability Group has used with considerable success.
This is the three-Tiered approach mentioned in the introductory paragraph. Each of these
methods/tools will be explained in greater detail in the sections following this summary.
The AHM tool utilizes the most common failure modes in asset components to identify which PdM
technologies could apply to the asset. Similarly, a brief description is provided of which
Quantitative Preventive Maintenance (QPM) inspections tasks could be performed to identify a
given defect or failure mode. This tool also identifies tasks that could be performed by task-
qualified operators, sometimes referred to as Basic Care. This tool utilizes asset criticality to
determine maintenance strategy.
Using benchmark data, each technology is assigned percent coverage models to populate the Asset
Versus Technology Matrix. A complete, granular Criticality Analysis is vital. The accuracy of the
resulting EMP is directly proportional to the ability of the Criticality Analysis to differentiate one
asset from another.
Inside the matrix, failure modes are mapped to specific component types and then Condition
Monitoring technologies and PM tasks are mapped to the applicable failure modes. This creates a
database of sorts where simply knowing the component types means knowing the Condition
Monitoring technologies that apply. The tool allows the user to map PM tasks to failure modes as
well. This is a critical function as there are failure modes that no Condition Monitoring technology
can detect and a PM task will be required.
There are two different types of AHM models: the 100% Theoretical Model and the validated AHM.
Tier 3 RCMCost
RCMCost is a powerful software tool that combines the Failure Mode, Effects, and Criticality
Analysis (FMECA) library developed using RCM Blitz with the failure data from the plants
assets to produce an optimized maintenance strategy. The RCMCost software uses Weibull
statistics and Monte Carlo Simulation routines to calculate the optimum maintenance strategy,
failure mode by failure mode, over the entire life cycle of the asset.
Each of the five steps contained in the process map below has a specific tool/facilitation style or
standard that is used. Each step has a specific purpose and creates unique value for the
maintenance organization.
The 100% Theoretical Model is always done first. It is a quick method of starting to build in the
maintenance departments mind what the depth and breadth of a Condition Monitoring program
could be once the program has reached full maturity. This model lays the groundwork for what is
to come in the complete EMP process.
This model is usually created based solely on the equipment list. If the plant has not already
declared an equipment type for each asset (and most plants do not), then the Analyst makes his
best guess as to what type of machine it is. This process is sometimes done remotely and sight
unseen. A major challenge for this step is that the equipment list is seldom accurate. It normally
has machines that are no longer in the plant and it does not always contain new machines that
have been added. The more accurate the list is, the more complete the picture is at this stage.
The RCM faithful may be asking: If the AHM is a tool that is predominantly
focused on developing the inspection strategy for an asset base, what of the
preventive strategy for those same assets? That is an excellent question.
While it is true that the AHM tool generally focuses on the inspection
strategy to identify individual component level failures, the preventive
strategy is more complex than can be generalized and put into a tool that is
universally applicable across multiple industry verticals. So, while the tool
has the capability of mapping PM strategies to different equipment types, the
tasks themselves are to be analyzed within the operating context and then
put into the tool.
In general, for the balance of plant assets (those assets not earmarked for a
full RCM Analysis), there are only a few necessary strategies that would
address the lions share of the preventive maintenance needs. They are:
Craft Skills
More often than not, the root cause of many of the failures found in
modern facilities is the result of poor craft skills. Specifically, these are
the precision skills of the crafts personnel to properly address fits,
tolerances, and alignment.
Lubrication Program
The operators are the first line of defense for any asset. Where the
operator is involved in the care and maintenance of the asset, the
equipment always runs more reliably.
If these three items are considered at the plant level and not the individual
component level, as is often the case, reliability would improve significantly.
A comprehensive strategy of education and operating discipline would see the
immediate elimination of a tremendous amount of unplanned failures and
the improvement of Overall Equipment Effectiveness (OEE). As such, these
systemic level initiatives are considered adequate as the preventive strategy
for the balance of plant assets. After deployment, any asset that displays
chronic failures can be earmarked for RCA/RCM Analysis as well.
Additionally, the preventive strategy for these assets was addressed during
the PME portion of the process. The existing PM program was streamlined to
contain only the PM tasks that added value by addressing specific failure
modes that were not going to be covered by PdM technologies. It should be
noted that, in the past, some organizations have chosen to eliminate a PM
task by scheduling it to be covered by a PdM technology and when it came
time to implement the different technologies, that particular technology was
not chosen. This left those failure modes undetected and exposed the
organization to significant risk. The PME process and the design of the
Condition Monitoring program have to work in concert to ensure that all of
the failure modes are covered by the most appropriate technique(s).
A two-tiered approach consisting of RCM and RCA is applied to combat these bad actors.
RCM Blitz
If a site is looking for more information than is found in the AHM, and wants to be assured that all
failure modes and effects, as well as a spares inventory, are identified through a formal RCM
process, RCM Blitz should be utilized.
Why Blitz?
The challenge at hand was to develop an RCM process that delivered a thorough analysis in a
much shorter cycle. It was from this challenge that the RCM Blitz Method was created. RCM
Blitz is a process that completes the traditional RCM Analysis cycle in seven days or less and
delivers actionable tasks the first day of analysis.
The dramatic RCM Analysis cycle reduction is accomplished by combining advanced facilitation
techniques with the traditional RCM thought process.
The advancement in facilitation came from the idea of co-facilitation and the use of a real time
RCM database. These advancements made it possible to facilitate the RCM Analysis and enter the
information into the database at the same time. While one person facilitates, the other captures
the information. This provides the RCM Team and its company with more accurate information
and a complete RCM report with prioritized and assigned action items at the end of the analysis
cycle.
As previously stated, RCM Blitz is a fully SAE JA1011 compliant RCM Analysis methodology. Do
not let the term Blitz lead you to think that shortcuts are taken; the output is a complete
maintenance strategy for the failure modes analyzed. The term Blitz comes from the way the
analysis process is facilitated and is based on two factors.
1. The Analysis Team is fully dedicated for one week and one week only. The size of the
system being considered for analysis is such that the Analysis Team can complete the
analysis in that week. In this way, the analysis of the system is completed quickly (Blitz)
and not dragged out over weeks or months due to partial attention from the Analysis Team.
2. The process utilizes two facilitators, one facilitator works with the Analysis Team to keep
the process rolling while the other facilitator works the RCM Blitz software tool (typically
projected on the wall for the entire team to view). In this way, the conversation does not
have to stop for the facilitator to update the software tool when results need to be recorded.
This accelerates the analysis process (Blitz).
RCM Blitz results in a complete maintenance strategy for the failure modes that were analyzed.
Below is a list of the different tasks and strategies that are the typical output of the RCM Blitz
process.
Implementation Strategy: When the RCM Analysis is complete, there is often a large
action list that needs to be acted on and implemented. One thing we quickly realized was
that if you did not also have a structured way to prioritize and assign these actions, the
speed and thoroughness of implementation would be compromised. RCM Blitz provides a
way to quickly prioritize tasks, assign actions, establish due dates, and track the
implementation of the RCM Analysis.
Predictive Maintenance Tasks: Condition Monitoring or PdM tasks are extremely
powerful defect detection inspections that are usually performed while the equipment is in
normal operation. These are preferred inspection types versus inspections that require the
equipment to be shut down. Condition Monitoring technologies include, but are not limited
to, Vibration Analysis, Infrared Thermography, Ultrasonic Analysis, Lubricant Analysis,
and Motor Circuit Analysis.
Preventive Maintenance Tasks: Some failure modes cannot be detected with Condition
Monitoring and, therefore, require some other type of inspection. Additionally, some failure
modes can be prevented with the correct intervention technique at the correct interval.
Process Verification Techniques: One of the most cost effective on-condition
maintenance tasks, process verification uses field instruments and Programmable Logic
Controller (PLC) programming of trends, set-points, and alarms to detect and give
notification of any potential failure conditions. Unlike many single-point-in-time PdM
techniques, process verification tasks are monitored continuously as your process operates.
Listed below are just a few process verification tasks:
o Pump output pressure or flow
o Motor amps
o Process temperature
o Process sequence timers
Failure Finding Tasks: Whenever a component is subject to functional failure that will
not be evident to the operating crew under normal operating conditions, a scheduled
maintenance task is needed to protect the availability of that function. While hidden-
function failures have no immediate or noticeable consequences, these undetected failures
increase the exposure or probability of a possible multiple failure. As a result, hidden
function items are assigned failure finding tasks, which are scheduled inspections put into
place to detect functional failures versus potential failures.
Consequence Reduction Tasks: In teaching RCM, many vendors instruct that run-to-
failure is the remaining option when no preventive or predictive task can be used to
eliminate or reduce the failure to an acceptable level. If run-to-failure is your strategy, you
should be prepared to reduce the consequences of that failure. Allied Reliability Groups
process not only tracks the tasks that need to be implemented for predictive and preventive
maintenance, but also tracks the tasks that need to be completed to ensure all failure
consequences are kept to a minimum. To reduce failure consequences, the proper spare
parts, procedures, and resources need to be in place. Reducing failure consequences is the
key tool in reducing costly equipment downtime. If this strategy is not a part of your RCM
process, the process is incomplete.
Spare Parts Strategy: Allied Reliability Groups spare parts strategy offers a decision
process to ensure you have the parts needed to reduce the consequences of all failures. If
your RCM methodology ends the discussion of maintenance tasks with the phrase No
Scheduled Maintenance, your company is in danger of running a critical item to failure
without ever considering if you should have a spare on hand. RCM Blitz uses a risk-based
flow diagram to make sound spare parts decisions that consider:
o The probability that the failure will occur;
o The consequences to your business should the failure occur;
o The availability and lead time to place the component at your facility; and
o Obsolescence.
RCA should be used when certain triggers are tripped and should not necessarily be based on asset
criticality. The RCA methodologies that Allied Reliability Group uses are 5 Why, Cause Mapping,
and Logic Tree. The methodologies each have specific strengths and should be used at different
times for optimal results.
Technical Note About the Difference Between RCM and RCA and
When Each Should Be Used:
Much discussion has taken place in the reliability engineering world with
regards to the true differences between these two methodologies and which
one is appropriate for a given situation. A thorough analysis of these two
methods reveals that, at their core, they are all but identical. Each deals with
cause and effect. It is just that simple.
Traditionally, the only difference lies in when each is used. If the analysis is
trigger based, which is another way of saying reactive, then we are most
likely looking at performing RCA or RCFA. These two terms are typically
used synonymously, though the RCA faithful will argue vehemently that
there is a distinct difference. We will address this shortly.
So there lies the only real difference. RCA is generally used to figure out the
cause(s) of a specific event(s) so that they can be eliminated in the future, and
RCM is trying to put a strategy in place to eliminate/mitigate all of the
probable causes.
It is doubtful that RCA experts enjoy the fact that their methodology has
been labeled reactive. Enter the RCA faithful who say that RCFA is reactive
and RCA is proactive. This is true, but the techniques used are still the same.
The only difference is that if the event has already happened, it may be
possible to produce evidence to confirm or deny a particular theory.
Either way, both of the methodologies are powerful and necessary for the
development of a comprehensive FMA program.
maintainable item (i.e. pump, fan, motor, gearbox). Reference ISO 14224, Section 8.2
Taxonomy.
4. The failure mode(s) that has/have been identified.
5. The definition of a failure mode is the part + what is wrong with the part + the reason.
(See Introduction for a detailed definition.)
6. The list of failure modes covered in an EMP need not be an exhaustive list. It should only
include the predominant failure modes that represent both the failures that have happened
and the failures that are likely to happen.
7. The task(s) employed against each failure mode, divided into one of four categories.
8. Inspection identify the presence of a failure mode
9. Preventative prevent the occurrence of a failure mode
10. Adjustment
11. Time-based replacement
12. Lubrication
13. Run-to-failure
14. Statutory or regulatory tasks for liability reasons, the responsibility of defining these
tasks and adding them to the EMP is reserved specifically for the asset owner
15. The frequency for each task.
16. Frequency is defined as interval based.
17. Intervals can be defined in terms of time (hours, days, weeks) or throughput (units, tons,
pounds).
18. The person who is going to perform each task (by title not by name).
19. Task assignments should be made based on task complexity and required skill levels.
20. Operators may be assigned equipment care tasks or inspection tasks depending on task
qualification.
21. The amount of manpower that will be required to complete each task, reported in total
man-hours.
22. The amount of equipment downtime required to complete each task.
NOTE: When addressing item 7, keep in mind that certain pieces of equipment may
require additional time to cool down or warm-up and that additional time needs to
be considered as a part of the maintenance plan.
This option is for mature reliability programs only. To be considered mature, the following criteria
should be met:
1. Have a complete FMECA library for the top 20% of the equipment in the facility;
2. Be dissatisfied with the current task frequencies;
3. Have extensive failure data on most of the failure modes in the FMECA library; and
4. Be willing to utilize the WeiBayes methodology for failure profile estimation.
Allied Reliability Group uses the software tool RCMCost from Isograph, which allows the
relationship between maintenance plans and the needs of the business to be quantified. The
process uses local knowledge, maintenance history, and inspection results to develop models based
on RCM principles. Failure effects can be assessed in terms of dollar impact, safety/environmental
risk, and operational criticality. These models are used to evaluate, compare, and optimize
maintenance and inspection strategies over the entire life cycle of the machine. RCMCost can
generate reports on the justification of maintenance strategies and quantify the implications of
strategy changes on costs to the business and availability of your equipment.
As with all computer modeling processes, the quality of the data in the model determines the
quality of the results. Fortunately, most organizations have enough data to determine the
parameters required for the model. Once built, models can be easily updated and refined as
parameters change. In order to generate results that provide a definitive comparison of
alternatives, all parameters in the modeling process need to be brought back to a common unit.
Simply put, the only way to demonstrate the benefit of any maintenance policy is to compare
alternatives in terms of dollars or reduction in business risk. Sometimes the lowest cost
alternative does not provide the required reduction in safety, environmental, or operational risk.
In those cases, the goal is to find the lowest cost alternative that will achieve the required
reduction.
The reality is that maintenance tasks should be determined or optimized based on many more
parameters, including:
Cost of failure
Risk of injury or death
Risk of environmental releases
Risk of business shutdown
Cost of corrective maintenance
Cost of inspections
Cost of PM
Cost of PdM
Given the amount of variables, the need for computer simulation should be apparent. RCMCost
can quickly generate results based on calculations that involve several parameters. This allows the
impact of parameter changes to be seen immediately.
Many organizations have followed an RCM type approach for their maintenance activities and
have generated effective maintenance policies directed towards the identified modes of failure. The
next step in the optimization process is to quantify the maintenance strategy decisions and apply
real life parameters to an entire system model so that asset life costing and predictions can be
performed. In several cases, organizations have generated what they believed to be their optimum
policies, but because they could not substantiate these, they had trouble applying the decisions to
real life. Generally, these decisions are made, but the results are not clearly understood or
checked. The optimized model should assist with these decisions and allow the user to understand
the consequences.
RCMCost provides an environment where information from RCM processes can be simulated to
test the resulting policies and add real world parameters to decision making.
Failure Mode
Analysis
Work Order
Root Cause
History
Analysis
Analysis
Strategy Adjustments
Modifications to the EMP, or even to the asset management plan to compensate for operational
changes, are warranted after the RCA has been completed.
Now the failure modes library is complete and the cycle is ready to begin again. The process never
ends, which is why the failure modes library and the EMP are referred to as living documents.
As you learn more about the way the equipment fails, the strategy gets updated and improved.
This continuous improvement process is a great example of kaizen.
Summary
Understanding how equipment fails is the core of any effective maintenance strategy. Once these
failure modes are understood, they can be eliminated, prevented, or simply detected with enough
time to respond.
A combination of RCM and RCA is required for the top 5-20% of the equipment in a plant and for
any other machine that has displayed chronic problems. For the balance of the plant equipment, a
full Reliability Analysis is not economically feasible. For these machines, a combination of
strategies is sufficient to ensure reliable operation. Those strategies are:
A comprehensive Condition Monitoring program for the whole plant is easily and quickly designed
through the use of the AHM tool. The AHM quickly lays out the PdM tasks to known failure modes
of the components in a drive train. Thus the RCM Analysis Team can spend their time on the
systemic and operating envelope problems.
Once the EMP is developed and some failure history is collected, it can then be optimized. The
optimization is accomplished through the use of powerful statistical software that analyzes failure
patterns and accounts for the cost of failures. The optimum task frequency is calculated, and the
Life Cycle Costs (LCC) of operating and maintaining the machine can also be calculated. Changes
to the maintenance strategy can then be analyzed to understand the overall impact to the system
and decisions that were made in the best financial interest of the company.
The combination of RCM Analysis and the AHM creates an environment where EMPs can be
developed more rapidly than ever before. The addition of Weibull Analysis and LCC Analysis
ensures that the maintenance strategy is optimized to produce the most favorable results possible.