Understanding Operating Data

Back to Basics
Understanding
Operating Data
Tony Cooper Operating data offer valuable clues and are
Tony Cooper, LLC
essential to managing and improving processes.
However, they can also be misleading and
hide the underlying issues.
The principles and statistical tools presented here
can help you make effective use of process data.
U
nderstanding the performance of production pro- inputs and process parameters influencing y are xs. Although
cesses is fundamentally an engineering issue, and each y is a synergistic function of multiple xs, statistics are
operating data extracted from the process can be often incorrectly used to empirically build a relation-
a powerful stimulus. Statistical skills allow the engineer to ship between one y and one x.
create useful information out of these data. By applying the Since the data used to evaluate the operation of chemical
concepts discussed in this article, these skills can be devel- processes are typically kept in various locations, the first step
oped and put to good use. in data analysis is often to compile all the accumulated data
on both ys and xs. For instance, analytical data on customer-
Consider variation critical ys may be kept in laboratory reports, while x data,
Variation in output is due to an inability to control nui- such as information on sensors, switches, valves, flowrates,
sance variables and to hold setpoints. The measurement sys- etc., are kept in historical process records. Other important
tem is sometimes incapable of detecting the variation, while data, which may be kept in other locations, include:
at other times the variation is so large that it dominates. raw material supplier data, costs, and order dates
Variation can lead to a misunderstanding of the actual downtimes and maintenance records
reaction stoichiometry, actual mass balance, or actual operator assignments
energy consumption. In such circumstances, a determin- environmental conditions
istic model provides a poor framework for understanding customer feedback.
and managing chemical operations. This complicates the
integration of a theoretical understanding with an empirical Investigate the integrity of the data
view of the process. The lack of integrity in a set of data can provide valuable
Because of variation, analysis of data involves not just a clues, but can also be misleading. Three common data-
single point, but rather a distribution of points. If the process integrity issues are missing records, incomplete records, and
is stable, the sampled data will follow some type of statis- measurement error.
tical distribution. This is often the normal (or Gaussian) Missing data can take various forms. For instance,
distribution depicted in Figure 1. Regardless of the assumed records from certain time periods or for certain batches may
distribution, statistics are used to estimate parameters of the be missing. There is a temptation to analyze the remaining
underlying distribution. data and ignore the information that is not there, assuming
someone inadvertently forgot to enter it, a sample was lost,
Gather all the available data or a computer crashed. The information, however, could
Keep in mind the relationship y = f(x), where both y have been lost as a result of a significant event. The fact that
and x are vectors or matrices. A quality product must meet data are missing is often a serious clue.
multiple criteria, such as purity and homogeneity. Numerous A subtler form of missing data involves incomplete
32 www.aiche.org/cep October 2009 CEP

records because not all of the important variables were mea-
sured. This can result in the inability to explain the process- (mean) estimates (standard deviation)
the balance point estimates the spread or
ing issues or it may appear that there is another cause, either of the distribution: width of the distribution:
of which can lead to misguided and ineffective action.
The integrity of the data may also be suspect when y or =/x / x
n = n 1
x is incorrectly measured. Mass balances and raw material 67%
loss data are simple, but powerful, tools for evaluating the
integrity of data such as flowrates and valve positions. Mass 95%
balances may not close due to a missing source or due to an
inaccurate value. A raw material loss analysis might compare 99.7%
the amount of delivered material and the theoretical usage to
3 2 + + 2 + 3
the amount of product produced. If they do not match, there
might be issues with the supplier, theoretical usage, or the
S Figure 1. A normal, or Gaussian, distribution is shaped like a bell curve,
measurement. If the mass balance does not close, it may not with data clustered around the mean.
be clear what the problem is, but it is clear that some data are
not completely trustworthy. be broken down or disaggregated, which makes the plots
Common measurement system evaluations (1) can be much more sensitive. For example, are multiple production
used to review the data from analytical laboratories by ask- lines represented or are there different suppliers of the raw
ing such questions as: material?
Is the characteristic being measured relevant? Consider a product for which particle size is an impor-
Are the samples being taken at the right locations and tant measure of quality smaller particle size indicates
frequencies? higher quality. The average particle sizes for batches from
Are the measurements reproducible? two different crystallizers are 32.308 m and 32.005 m.
It is often necessary to verify information from laborato- Figure 2 demonstrates how these single numbers hide impor-
ries. A good strategy is to periodically measure (or have the tant information about the management of the crystallizers
lab measure) duplicate samples in a well-designed study. over time. When the data are disaggregated by unit, the
This should be a blind study i.e., the analyst should not variations over time are revealed. It is misleading to quote a
know which samples are duplicates or even that duplicate single average that aggregates across time if the process is
samples are being used. If there is a large variation between not stable or across units if they perform differently.
the duplicate readings, the historical data on that character-
istic are likely to be misleading. Tool 2: Control charts
A control chart is one of the most powerful tools for
Time as a source of variation displaying statistics. Some essentials of this tool need to be
Statistics describe and estimate characteristics of pro- understood to avoid misinterpretation.
cesses and products, often over a period of time. An example Control charts (2, 3) are generated from data collected
of a typical statistic is the average yield from a campaign, from the process, and control limits are mathematical func-
but sometimes how the statistic (in this case yield) fluctuated tions of those data. At a fundamental level, if the data are
over time is more interesting. contained by the control limits, the process is said to be
Time is a surrogate for underlying sources of variation. acting consistently and in control. Since the control limits
Fluctuations in a statistic over time provide valuable infor-
mation: the process average might trend, variation could Average = 32.008 Average = 32.005
increase, or ratios might change. 36 Unit 1 Unit 2
35
+
Tool 1: Time series and run plots 34 + ++++ + + + +
+ + + ++
Data
Using averages to summarize large quantities of data is + + ++++++ + + + +

++++ ++++++++ +++++
33
+
+++
+ +++ + + + + ++
+
one of the weakest and often misleading uses of statistics. 32
+ +
+
++++++ + ++++++ +++++++
31 + + +
+ ++ + +
The summary may be needed because a table of the raw data
30
+ ++++ ++ ++ +++
can be intimidating and not very informative, but there are 29
often better ways to provide a summary than to report data Batches Over Time
averages.
S Figure 2. Plotting operational data over time is often more revealing than
Data should typically be presented in a graphical format.
reporting an average.
Plot the data over time and consider ways that the data can
CEP October 2009 www.aiche.org/cep 33

Back to Basics
are a function of the empirical data, they are said to report uniform temperature, or by adjusting the amount of water
the voice of the process. Requirements and specifications, added to a waste stream to keep its concentration within
which represent the voice of the customer, do not factor acceptable limits.
into the control limit calculations. Control charts assist in debugging the variation in the
Statistical control charts have little in common with process. They do not control the variation, but rather report
process controllers (e.g., feedback controllers). The purpose on whether the variation is likely to be either the result of an
of process controllers is to move variability from the product assignable event or consistent with the process itself and its
stream to more benign locations to control the effects of many common, but unassignable, causes. Failure to under-
variation. For example, product quality might be maintained stand the cause of the variation can lead to changes that
by adjusting the flowrate of steam through a jacket to ensure make the situation worse.
Control charts are a departure from typi-
1 2 3 4 5 18 cal quality-control inspections. Inspection is
intended to find bad products before they are
shipped, while control charts focus on the causes
of bad product in particular, where and when
the variation in the process occurs and whether
it is consistent or inconsistent. A control chart
95.11
88.78
92.02
91.53
92.16
93.48
92.46
90.28
90.42
92.37
93.48
93.45
90.30
90.73
93.35
91.87
90.64
91.48
is the most powerful statistical tool for effective

process management.
S Figure 3. Samples were taken from random locations in each of 18 batches.
There are several kinds of control charts.
Charts can be used to understand the behavior of
96
95
one variable (univariate) or many variables (multivariate).
94 Two types of univariate control charts frequently used are
93 individual and moving range charts, which are run plots
Quality
92
91 with control limits, and mean (or Xbar) and range charts,
90 which consider subgrouping.
89
88 Specification
Off-Spec Product As an example, consider the sampling and subgroup-
87 ing (within and between batches) represented in Figure 3,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
where three samples were taken from random locations
S Figure 4. A frequency plot shows that three of the batches sampled in each of 18 batches. The data are plotted in Figure 4,
according to the plan given in Figure 3 contained off-spec product. which shows that three samples contain off-spec product.
These batches need to be quarantined or
reworked, but this conventional process
1 2 3 4 5 18
management work does not improve the
process. The control charts in Figure 5
show that one sample per batch would
not have revealed the true issue of con-
siderable variation within each batch.
Every sample from every batch had the
95.11
88.78
92.02
91.53
92.16
93.48
92.46
90.28
90.42
92.37
93.48
93.45
90.30
90.73
93.35
91.87
90.64
91.48
same probability of being outside the

R = 1.1 R = 1.23 R = 6.33 R = 3.05 R = 1.95 R = 2.18 customers requirements (not just the
x = 93.1 x = 91.33 x = 91.97 x = 91.46 x = 92.39 x = 91.05
three off-spec samples in the data).
Range Chart Mean Chart According to the control charts in
Figure 5, the process is in control, but
Mean of Quality
Range of Quality
8
UCL = 7.45 95 UCL = 95.00
6 the comparison of the sample data to
Avg = 92.04
4
Avg = 2.89
specifications (Figure 4) shows that
2 90 LCL = 89.08 the process needs improvement. The
0 LCL = 0.00
control charts indicate how to improve
2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18
Batch Batch the process.
To make better product, the engineer
S Figure 5. Range and mean control charts revel that despite variation within batches, the process needs to consider the ubiquitous varia-
is considered to be in control.
tion sources that create variation in every

batch. In this case, increased mixing 1.1 1.1
might help, but high viscosity and the 1

0.9
1
potential for shearing make this an 0.8 0.9
Component D
expensive option. A more elegant way 0.7 0.8
y
0.6
to make the batch more homogenous is 0.5 0.7
to address the charge pattern and rate. 0.4 0.6
0.3
0.2 0.5
Basic summary statistics 0 5 10 15 20 25 30 35 40 0.4
Statistics are useful for summariz- Time 0.3
ing historical events such as the aver- Component C Component D 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
age quality of product shipped over Component C
a specific time period, but they often
lack context. Time series are lost, y = f(x) relationships are Correlation Coefficient =
/^ x x h^ y y h = 0.71
avg avg
not shown, data are aggregated over important sources of / ^ x x h /^ y y h

avg
2
avg
2
variation, and the multivariate nature of the situation can

be hidden. S Figure 6. A scatter plot (right) is useful for identifying correlation
The average (mean), standard deviation, and correlation between two output variables that is not revealed on a time series plot (left).
are snapshot statistics.
Figure 6. Based on these graphs, she looked at the reaction
Tool 3: Correlation stoichiometry and explored ways to manipulate the reaction
A scatter plot is a basic, but powerful, tool for exploring more advantageously.
relationships between variables. It is simply a collection of It is also common to plot inputs vs. outputs to search for
data points on a two-dimensional graph, with the indepen- cause. A chemical facility made its product in campaigns
dent variable plotted on the horizontal axis and the depen- that typically ran about nine hours. Before the start of each
dent (response) variable on the vertical axis. campaign, all the process equipment was cleaned; when
When two variables move together, they are said to be possible, this cleaning was scheduled for the night shift.
correlated. Correlation can be characterized by the Pearson Experienced operators on the day shift would start up the
correlation coefficient. This statistic ranges from 1 to +1 equipment. At the beginning of a campaign, product quality
and measures the strength of a linear relationship. A posi- was excellent, but the quality dropped off as the campaign
tive value implies a positive association (large values of x progressed. The campaigns were so short that intermedi-
are associated with large values of y, and small values of x ate cleanings were rarely performed. A new engineer was
are associated with small values of y), and a negative value not aware of this history. When told about the problem, he
implies a negative or inverse relationship. As the coefficient became curious about the effect of ambient temperature
approaches +1 or 1, the relationship approaches one-to-one, on product quality and created the plot in Figure 7. There
i.e., a straight line on a scatter plot. appeared to be a correlation between the ambient tempera-
Statistical relationships can also be characterized by the ture and product quality, and an investigation was initiated to
coefficient of variation (CV), which is defined as the ratio of
the standard deviation () to the mean ().
Together, the Pearson correlation coefficient and the 99
coefficient of variation summarize a scatter plot.

In addition to plotting a variable over time, it is also
98.5
useful to plot output variables against each other to see
Quality
how they are related. For example, a chemical plant had

trouble reducing the amount of two side products one 98
always seemed to be an issue. Component C must be held
below 0.6%, and component D must be held below 0.75%;
failure to simultaneously meet both criteria has resulted 97.5
in expensive product blending. An engineer at the plant 40 50 60 70 80 90
suspected that the actions taken to reduce one side product
Ambient Temperature
would increase the likelihood of the other being formed. To
find evidence to support this theory, she obtained sample S Figure 7. A scatter plot can also reveal correlation between an input
and an output.
analyses and created the time series plot and scatter plot in

Back to Basics
Day 1 Day 2 Day 3

determine whether better insulation might eliminate
60
the degradation occurring throughout the day.
50 Keep in mind that correlation does not necessar-
40 ily indicate causation.
T2
30
Fundamental multivariate statistics
20 and statistical methods
10 When both x and y are vectors obtained by taking
samples repeatedly over time, multivariate methods
0
are needed to analyze the data. The simplest multi-
0 1,000 2,000 3,000 4,000
variate methods are interaction plots, disaggregation,
Minute
and dimensionality-reducing statistics derived from
S Figure 8. A plot of a principal component over time can be used to hone in on the principal component analysis (PCA).
most important information.
Tool 4: Multivariate analysis
330 When hundreds of variables are being recorded,
it is virtually impossible to monitor or evaluate
Rate
300
270
each variable individually. But if two variables are
240
perfectly correlated, it is not necessary to report
40
both, because knowledge of one provides infor-
Power
35
mation about the other. Multivariate methods use
30 mathematical techniques to identify redundant or
25 partially redundant information, reducing the number
0 50 100 150 200 250
of variables and highlighting the most important
data. Multivariate methods are used for two primary
Rows
reasons:
S Figure 9. Separate plots show variation in rate and power over time to summarize information contained in a
T Figure 10. which is clarified by plotting the second principal component over time. large number of variables
to consider the ratio between variables.
Many of the above-mentioned techniques
0 are incorporated into spreadsheet programs, but
2nd Principal
Component
the multivariate techniques discussed here may

require specialized software and training.
The measurements that best reflect the
1 dynamics of a process are not always readily
apparent, so it is common to measure whatever
0 50 100 150 200 can be measured often more than neces-
Rows sary. This produces data sets that are large and
confusing. PCA is a technique for extracting information
from these data and identifying the most meaningful basis to
330
re-express the data set (4).
300
PCA involves linear combining of many possibly cor-
related variables into a smaller number of uncorrelated
Rate
270 orthogonal variables called principal components, and

computing eigenvectors and eigenvalues for the correlation
240 matrix (or some equivalent matrix). The eigenvectors are
ordered so that the first, or principal, vector best explains
25 30 35 the variance in the data, a second component will best
Power explain the remaining variation, and so on. A small number
S Figure 11. Plotting the agitation rate vs. power relationship reveals of artificial variables, called principal components, will
not only that power and rate vary, but also that the power consumption is account for most of the variance in the observed variables.
increasing.
In this way, the new basis filters out the noise and reveals

the true internal structure of the data. Plotting just the two ure 11. This indicates that the process requires more power
or three principal components against one another and over to achieve the same agitation rate. The subsequent investi-
time produces a manageable number of plots that contain gation considered various causes, including a worn agitator
the most important information. motor, incorrect feeds, and sludge buildup in the tank.
A medium-size chemical plant records minute-by-
minute data from multiple valves, steam pressure gages, In conclusion
and temperature sensors each day, more than 80,000 When you consider data to validate suspicions or look
measurements (56 variables per minute over 1,440 minutes) for clues, the principles discussed in this article will improve
are made. Three days worth of data are summarized using your ability to find and trust information. Table 1 summarizes
PCA and used to develop the run plot in Figure 8. Toward some statistical tools that support these principles. These
the end of the second day, the statistic drifted up with seem- analysis techniques often get lost in the hectic day-to-day
ingly increasing frequency. This was eventually traced to a business, but using data appropriately is critical. CEP
valve that was sticking, which caused a temperature spike

that propagated through numerous variables. Fixing the
Literature Cited
valve improved the output and might have warded off a
1. Wheeler, D., EMP (Evaluating the Measurement Process) III:
very expensive rebuild. Using Imperfect Data, SPC Press, Knoxville, TN (2006).
An important caution is that multivariate methods 2. Shewhart, W., Economic Control of Quality of Manufactured
consider variation in all variables equally and typically do Product, Macmillan, New York, NY (1931).
not weight the variables according to their importance. In 3. Deming, W. E., Out of the Crisis, MIT Press, Cambridge,
addition to multivariate methods, consider prioritizing the MA (1986).
variables. 4. MacGregor, J. F., Using On-Line Process Data to Improve
Quality, ASQC Statistics Div. Newsletter, 16 (2), pp. 613
Plotting the relationships among the principal compo- (Winter 1996).
nents can show when a relationship changes. For example,
a batch process employed an agitator that changed speed
throughout the course of a batch and across different batches TONY COOPER, PhD, (tony.cooper@mac.com) provides manufacturing and
design project consulting, with an emphasis on the efcient use of data
depending on batch size. Thus, the changes in the measured to support appropriate decisions in process management, continuous
agitation rate and in the power to the agitator depicted in improvement, engineering, and design. He was a founder of Six Sigma
Associates and is a faculty member of the Center for Executive Educa-
Figure 9 were expected, but the plot of the second principal tion at the Univ. of Tennessee. He has taught and supported projects
component in Figure 10, which is the relationship between at Cytec Industries, Whirlpool, Allied Signal Aerospace, PPG, and
Remington. He received a BS in chemical engineering from Rensselaer
the variables, reveals a clear change in the relationship. Polytechnic Institute and a PhD from in management science with a
This led an engineer to evaluate various relationships, concentration in statistical applications from the Univ. of Tennessee in
1996. He is a member of AIChE.
including the one between power and agitation rate in Fig-
Table 1. Various stastical tools are available for characterizing data.

Tool Comments
Average The average estimates the center of the data Time series is hidden (process should be stable)
Standard Deviation Easy to calculate; good for summarizing Not sensitive when aggregated across multiple
Averages should be accompanied by a measure of sources
variation or spread (standard deviation)
Correlation Scatter plots are more powerful than summary
correlation coefficients
Correlation is not necessarily causation
Time Series or Run Plot Usually the best way to take a first look any data
Individual and moving range control limits can provide additional information
Mean and Range Charts Assess the consistency of the sources of variation
Data must be in subgroups
Focus attention on the key sources of variation
Multivariate Techniques Can sort through large databases
Consider the ratios between variables
Use math (rather than engineering criteria) to identify important information

Understanding Operating Data

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Understanding Operating Data

Diunggah oleh

Hak Cipta:

Format Tersedia

Back to Basics

32 www.aiche.org/cep October 2009 CEP

Using averages to summarize large quantities of data is + + ++++++ + + + +

CEP October 2009 www.aiche.org/cep 33

is the most powerful statistical tool for effective

same probability of being outside the

34 www.aiche.org/cep October 2009 CEP

might help, but high viscosity and the 1

potential for shearing make this an 0.8 0.9

not shown, data are aggregated over important sources of / ^ x x h /^ y y h

variation, and the multivariate nature of the situation can

coefficient of variation summarize a scatter plot.

how they are related. For example, a chemical plant had

CEP October 2009 www.aiche.org/cep 35

Day 1 Day 2 Day 3

the multivariate techniques discussed here may

270 orthogonal variables called principal components, and

36 www.aiche.org/cep October 2009 CEP

valve that was sticking, which caused a temperature spike

Table 1. Various stastical tools are available for characterizing data.

CEP October 2009 www.aiche.org/cep 37

Anda mungkin juga menyukai