Abstract— Autoassociative neural networks (ANNs) have been of underlying variation patterns can potentially shed light on
proposed as a nonlinear extension of principal component analy- the nature of the variation patterns themselves by analyzing the
sis (PCA), which is commonly used to identify linear variation relationship between the reduced feature space and the data.
patterns in high-dimensional data. While principal component
scores represent uncorrelated features, standard backpropaga- PCA is frequently used to extract low-dimensional features
tion methods for training ANNs provide no guarantee of pro- from high-dimensional data sources. Because the principal
ducing distinct features, which is important for interpretability components are orthogonal, they can be used to identify
and for discovering the nature of the variation patterns in the uncorrelated linear variation patterns in high-dimensional data
data. Here, we present an alternating nonlinear PCA method, by interpreting each principal component as a source of varia-
which encourages learning of distinct features in ANNs. A new
measure motivated by the condition of orthogonal loadings in tion. The order of the principal components further facilitates
PCA is proposed for measuring the extent to which the nonlin- variation pattern analysis, because it conveys information on
ear principal components represent distinct variation patterns. the amount of variation in the original data explained by
We demonstrate the effectiveness of our method using a simulated each principal component. However, PCA is a linear method
point cloud data set as well as a subset of the MNIST handwritten that fails to accurately characterize the underlying variation
digits data. The results show that standard ANNs consistently
mix the true variation sources in the low-dimensional represen- patterns when they are nonlinear in nature.
tation learned by the model, whereas our alternating method Our objective is to identify and visualize nonlinear variation
produces solutions where the patterns are better separated in the sources in multivariate data. The examples presented in this
low-dimensional space. paper involve spatially arranged multivariate data, but the
Index Terms— Autoassociative neural network (ANN), dis- methodology can be applied more generally. The nonlinear
tinct features, nonlinear principal component analysis (NLPCA), multivariate data have the functional form xi = f(vi ) + wi ,
tangent vector, variation pattern. where xi is the i th instance for i = 1, 2, . . . , N. In a manufac-
turing context, each data instance xi could represent a profile
I. I NTRODUCTION of M discrete height measurements taken along the surface
of a part, which is denoted by xi = [x i,1 , x i,2 , . . . , x i,M ]T .
D ISCOVERING nonlinear variation patterns is an impor-
tant problem in many applications involving high-
dimensional data. For example, a manufacturing engineer may
Underlying each instance xi are P variation patterns repre-
sented by the vector vi = [v i,1 , v i,2 , . . . , v i,P ]T . These P
wish to understand the nature of part-to-part variation using variation patterns are mapped to the M-dimensional space
profile quality control data for manufactured components. of the height measurements by the vector of M common
Such data often arises from ocular coordinate measuring nonlinear functions f(vi ) = [ f 1 (vi ), f 2 (vi ), . . . , f M (vi )]T :
machines, where a laser beam is fanned into a plane, projected R P → R M . The i th instance xi is also corrupted by a noise
onto a part as a stripe, and scanned across the part to produce a component wi = [wi,1 , wi,2 , . . . , wi,M ]T . Learning the under-
3-D point cloud. The number of measurements in the resulting lying nonlinear variation patterns corresponds to identifying vi
profile is much larger than the number of variation patterns and the functions f(vi ) using only the observed instances xi .
affecting the manufacturing process that produced it. Thus, The model that we use to identify and visualize the
reducing the dimensionality of the observed data to the number nonlinear variation sources is an autoassociative neural net-
work (ANN), which has been proposed as a form of nonlinear
Manuscript received December 30, 2015; revised June 27, 2016; accepted principal component analysis (NLPCA) [1]. These models
September 26, 2016. This work was supported by the National Science
Foundation under Grant 1265713. are neural networks that aim to reconstruct their input after
P. Howard and G. Runger are with the School of Computing, Informat- passing hidden node activations through a bottleneck layer
ics, and Decision Systems Engineering, Arizona State University, Tempe, with far fewer nodes than the dimensionality of the original
AZ 85281 USA (e-mail: prhoward@asu.edu; George.Runger@asu.edu).
D. W. Apley is with the Department of Industrial Engineering and Manage- data, thereby forcing data compression. The hidden nodes
ment Sciences, Northwestern University, Evanston, IL 60208 USA (e-mail: in the bottleneck layer are referred to as bottleneck nodes,
apley@northwestern.edu). or alternatively as the (nonlinear) components of the model.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. An important issue with NLPCA methods is that often multiple
Digital Object Identifier 10.1109/TNNLS.2016.2616145 bottleneck nodes align with the same variation pattern early in
2162-237X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
training [1], resulting in a solution where each of the extracted variables are also independent [8]. Because our objective is
features represents a combination of the true variation sources the visualization of the true variation sources underlying the
rather than distinct sources of variation. Existing approaches observed data, a nonlinear ICA solution is insufficient for
for NLPCA models fail to address the task of learning a extracting the features we seek, because the original variation
solution, where each of the nonlinear components represents patterns can be mixed among the extracted features while still
a single distinct source of variation in the data. being statistically independent.
We present a new ANN learning methodology called alter- NLPCA is a nonlinear extension of PCA, which uses an
nating NLPCA (ANLPCA), which aims to address this defi- ANN to learn a compact representation of high-dimensional
ciency. Unlike NLPCA, ANLPCA uses an alternating penalty data [1]. The neural network takes the original data as input
term during backpropagation training to guide the model and is trained to reproduce this input at its output layer, which
toward a solution, where the extracted patterns are distinct has the same number of nodes as the dimensionality of the
and well separated among the bottleneck nodes. Experimental original data. The network typically has a total of four layers
results indicate that the nonlinear features extracted from the (excluding the input), which consists of three hidden layers
data using our ANLPCA methodology are more interpretable and the output layer. An example of this network structure
and better facilitate visualizing each of the true variation is shown in Fig. 1. The second hidden layer is referred to
sources individually. as the bottleneck layer and has fewer hidden nodes than the
The remainder of this paper is organized as follows. dimensionality of the original data, forcing the network to
In Section II, we review literature related to our task of learn a more compact representation of the data by encoding
nonlinear dimensionality reduction. Our ANLPCA method is it to the units in the bottleneck layer. The network is trained
detailed in Section III, along with a new metric that we propose to minimize the sum of squared errors (SSE) in reconstructing
for measuring the separation of variation sources among the original data at the output layer. Unlike linear PCA,
learned features. Section IV presents experimental results from the training process for NLPCA does not guarantee that the
applying ANLPCA to simulated point cloud data and the nonlinear components represented by the bottleneck layer will
MNIST handwritten digits data set. Concluding remarks and be uncorrelated (or distinct in any manner) and the nonlinear
a discussion of future work is provided in Section V. components do not have any intrinsic ordering with respect to
the amount of variation they characterize.
II. R ELATED W ORK Several variants of NLPCA have been proposed. Circular
Kernel PCA (KPCA) is an extension of linear PCA, where PCA implements NLPCA where the units in the bottleneck
the data vectors xi existing in M-dimensional space are layer are constrained to lie on the unit circle so that a pair of
mapped to some N-dimensional feature space (N M bottleneck nodes can be described by a single angle parame-
typically, and N may be infinite) via a map φ(x) = ter [9]. Inverse NLPCA learns only the decoding portion of the
[φ1 (x), φ2 (x), . . . , φ N (x)]T , where each φ(x) is some scalar- network (i.e., the bottleneck layer through the output layer) by
valued nonlinear function of x. PCA is then performed on the treating the bottleneck node values themselves as model para-
set of feature vectors {φ(x1 ), φ(x2 ), . . . , φ(xN )}. The feature meters to be learned during backpropagation, in addition to
map is defined implicitly in practice by some kernel function the weights [10]. Deep autoencoders are ANNs that have more
for computational efficiency [2]. Because KPCA can remove hidden layers than the standard NLPCA model, but that still
some noise in data, it can be useful for detecting underlying have a single bottleneck layer in the middle of the network,
variation patterns and it has been applied to variation pattern which forces compression of the data [11]. Unsupervised
analysis previously [3], [4]. However, it does not provide a pretraining of the autoencoder is used prior to backpropagation
parametric representation, nor even an explicit representation, to increase the speed of learning for deep network architec-
of the patterns (only a collection of denoised data points), and tures. Pretrained deep autoencoders have been successfully
therefore, it is not ideally suited for our task of visualizing the applied to learning efficient low-dimensional representations
nature of the variation patterns. of nonlinear profile data, outperforming alternative methods
Blind source separation (BSS) methods, such as indepen- based on reindexing and manifold learning [12]. However,
dent component analysis (ICA), are related to our task of none of these methods specifically consider the problem of
identifying distinct variation patterns. Whereas PCA seeks ensuring that the learned components each represent distinct
uncorrelated components that are efficient for representing the variation patterns in the original data.
original data, ICA searches for a solution, which minimizes the Hierarchical nonlinear principal component
statistical dependence between the components [5]. However, networks (HNPCN) is another variant of NLPCA, which
traditional ICA methods only discover linear patterns, making seeks a solution in which a hierarchical ordering of the
them unsuitable for our task of blindly identifying nonlinear components exists, similar to the eigenvalue ordering of
variation patterns. BSS methods for the case of linear patterns principal components in linear PCA [13], [14]. The method
have been proposed with applications to manufacturing vari- proposes training multiple subnetworks sequentially, where
ability analysis in [6] and [7]. Nonlinear methods for BSS each additional subnetwork contains one more unit in the
have received less attention; unlike the linear case, nonlinear bottleneck layer than the previous networks, and all but one
ICA solutions are highly nonunique and differ greatly from of the bottleneck layer nodes obtains its value from a previous
the solution to the much harder nonlinear BSS problem, subnetwork. While this provides some order to the learned
because any arbitrary function of a pair of independent random features, it requires the training of P networks (where P is
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HOWARD et al.: DISTINCT VARIATION PATTERN DISCOVERY USING ALTERNATING NONLINEAR PCA 3
depending on the current epoch of training, where the training on the training epoch results in only updating the weights
epoch is denoted by r . We define connected to a single bottleneck node in any given training
⎧⎡ ⎤ epoch. The weights connected to the other bottleneck node
⎪
⎪ n3 n1
remain approximately constant due to the active penalty on
⎪
⎪ ⎣ (w3,1 j − w̄3,1 j )2 + (w2, j 1 − w̄2, j 1 )2 ⎦,
⎪
⎪ deviations from the prior iteration’s weights for that node.
⎪
⎪ j =1 j =1
⎪
⎪ Allowing only the weights connected to v 1 to be updated
⎨ if r ∈
Jr (W) = ⎡ ⎤
1 for the first R epochs allows the error associated with the
⎪
⎪ n3 n1
largest variation in the data to be learned primarily by v 1 .
⎪
⎪⎣ (w3,2 j − w̄3,2 j )2 +
⎪
⎪ (w2, j 2 − w̄2, j 2 )2 ⎦, After the largest variation source has been learned, the penalty
⎪
⎪
⎪
⎪ j =1 j =1 alternates and encourages the v 1 weights to remain constant,
⎩
if r ∈ 2 while v 2 learns the next largest variation source. Alternating
(2) the penalty between the nodes after 2R epochs have elapsed
allows the model to continue updating the weights on each
where node independently to fine-tune the model, which is otherwise
1 = {r |r ≤ R ∨ [r > 2R ∧ r (mod2) = 0]} not possible if we try to learn the weights for a given
bottleneck node entirely before moving on to train the next
2 = {r |R < r ≤ 2R ∨ [r > 2R ∧ r (mod2) = 1]}. (3)
node.
In (3), R is a user-specified parameter for which we provide
selection guidelines in Section IV. The value of w̄k,i j is set
B. Tangent Vector Cosine Similarity Metric
equal to the updated weight wk,i j obtained from the previous
(r) After training an ANN, the SSE represented by the first
iteration of backpropagation. That is, if wk,i j denotes the new
value of wk,i j after the r th backpropagation iteration, then in term in (1) is typically calculated on a test data set in order to
(r) provide a generalized estimate of the quality of fit produced
iteration r + 1, we set w̄k,i j = wk,i j . The parameter λ in (1)
is set equal to a large value, which can be tuned using cross by the model. This reconstruction error is important for our
task of visualizing the variation patterns, because it indicates
validation.
We use conjugate gradient descent to iteratively minimize how well the actual data are modeled. However, we also
need to evaluate a solution based on how distinctness of the
the objective function in (1). The first term in (1) is simply
the squared error loss function that is commonly used when variation patterns modeled by each of the bottleneck nodes,
training ANNs. The other term Jr (W) represents a constraint which requires another metric in addition to the standard SSE.
To measure the degree to which the bottleneck
imposed on the objective function by our ANLPCA method
to arrive at a solution, where each node in the bottleneck nodes v 1 and v 2 capture distinct variation patterns in
layer represents a distinct variation pattern, as explained in the original data, we introduce a metric called the tangent
vector cosine similarity (TVCS). TVCS has similarities with
the following.
The objective function term Jr (W) only affects the weights the concept of tangent distance, which is a transformation-
invariant distance measure used in image recognition [17];
entering into and leaving the bottleneck layer, w2,i j and w3,i j .
Thus, the weight updates for w1,i j and w4,i j in the first and however, the tangents are used for a different role here.
fourth layers are the same as in regular NLPCA. The first After training the neural network, each of the i = 1, . . . , N
instances in our data has distinct bottleneck node values vi =
definition of Jr (W) in (3) for r ∈ 1 only involves the weights
coming into and out of v 1 in the bottleneck layer, and the [v i,1 , v i,2 ]T in addition to a reconstructed vector f(vi ) = x̂i ,
relevant partial derivatives for this term are where f(vi ) : P → M denotes the nonlinear functional
mapping of vi to the original M-dimensional space produced
∂ Jr (W) by the decoder portion of the network (i.e., the weights and
= 2(w3,1 j − w̄3,1 j )
∂w3,1 j activation functions from the bottleneck layer to the output
∂ Jr (W) layer). The functional mapping f(vi ) for i ∈ {1, . . . , N}
= 2(w2, j 1 − w̄2, j 1). (4) defines a P-dimensional manifold in the M-dimensional space
∂w2, j 1
of the original data describing the possible reconstructions that
The second definition of Jr (W) in (3) for r ∈ 2 similarly can be produced by the model.
involves only the weights entering and leaving v 2 . Thus, the Let v̄ 1 and v̄ 2 denote the midpoints of the range of these
relevant partial derivatives for the second definition are bottleneck node values evaluated across the N instances
∂ Jr (W) after training. If we fix v 2 = v̄ 2 , the collection of points
= 2(w3,2 j − w̄3,2 j ) f([v i,1 , v̄ 2 ]T ) for i = 1, . . . , N represent a path along the
∂w3,2 j
∂ Jr (W) manifold defined by v 1 . Similarly, the points f([v̄ 1 , v i,2 ]T ) for
= 2(w2, j 2 − w̄2, j 2 ). (5) i = 1, . . . , N represent a different path along the manifold,
∂w2, j 2
which characterizes the effect the second bottleneck node v 2
The effect of Jr (W) on the weight updates is, therefore, to add exerts on the reconstructions. For the purpose of measuring
the terms in (4) and (5) to the squared error loss gradients the distinctness of the patterns represented by v 1 and v 2 , we
for the relevant weights entering and leaving the bottleneck approximate these paths by their tangent vectors T1 and T2 to
layer. Alternating the definition of the Jr (W) term based the manifold at the midpoint bottleneck values v̄ = [v̄ 1 , v̄ 2 ]T
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HOWARD et al.: DISTINCT VARIATION PATTERN DISCOVERY USING ALTERNATING NONLINEAR PCA 5
using the method of finite differences. Letting δ denote a small following second TVCS measure:
change in the bottleneck node values, T1 and T2 are specifically
T ·TR
calculated as 1 2
TVCS2 = (8)
||T1 ||||T2R ||
f v̄ 1 + 12 δ, v¯2 − f v̄ 1 − 12 δ, v¯2
T1 =
δ
where T2R is the tangent vector to the manifold defined by
f v̄ 1 , v¯2 + 12 δ − f v̄ 1 , v¯2 − 21 δ the second bottleneck node v 2 after each of the reconstructed
T2 = . (6)
δ instances f([v̄ 1 , v i,2 ]T ) has been reflected about the direction
We approximate T1 and T2 at the midpoint bottleneck node in the M-dimensional space impacted by the known variation
values to characterize the variation patterns in the middle of source. The final measure incorporating the prior informa-
their observed ranges, which is where the visualization is tion about one of the known variation sources is defined as
expected to be of greatest interest. While T1 and T2 could T V C S = max(T V C S1 , T V C S2 ).
be calculated across all points on the manifold if visualization To illustrate, consider the reconstructed instances of our
of the patterns in other regions is desired, our objective here is simulated point cloud data shown in Fig. 2. There are two
just to illustrate the usefulness of TVCS for identifying distinct distinct variation patterns that are simulated into the data,
feature solutions at the point of visualization. with additional small Gaussian noise added to the instance.
Because the tangent vectors T1 and T2 represent linear Each instance is a 2500-D point cloud representing a bowl-
approximations of the directions in the manifold defined by the shaped object in the 3-D space. One of the variation patterns
bottleneck nodes v 1 and v 2 , the orthogonality of these vectors acting on this object is a flattening pattern, which expands the
indicates that the bottleneck nodes have learned variation object symmetrically along the x- and y-axes while flattening
patterns in different directions of the original M-dimensional its height in the z dimension. The second variation pattern,
space. This is desirable since we want the nodes to characterize independent to the first, shifts the object along the horizontal
different information about the variation sources in the data. axis.
It is also similar to the orthogonality of loadings used in PCA Fig. 2 shows the shaded surface plots of the object viewed
to describe different sources of variation. Thus, we measure from above (i.e., looking down from the top of the z-axis, as in
the cosine of the angle between the tangent vectors T1 and T2 , a contour plot). These plots were obtained from an ANLPCA
which is close to zero when the tangent vectors have this desir- solution in which v 1 accurately characterizes the first variation
able characteristic of being nearly orthogonal and increases in pattern and v 2 characterizes the second pattern. The top three
absolute value toward one as they become less orthogonal. plots (from left to right) in Fig. 2 illustrate how the object
We define flattens as the value of v 1 decreases and v 2 remains constant
at its midpoint. The bottom three plots illustrate the shifting
T1 · T2
TVCS1 = (7) of the object from left-to-right along the horizontal axis as the
||T ||||T ||
1 2 value of v 2 increases and v 1 remains fixed at its midpoint. The
where T1 · T2 denotes the vector dot product of T1 and T2 , and tangent vectors T1 , T2 , and T2R for this ANLPCA solution are
||T || denotes the Euclidean norm of vector T . shown in Fig. 3.
Now consider the regular NLPCA solution with random
starting weights for this same data shown in Fig. 4. These
C. Enhancements to TVCS plots represent a mapping of the R2500 space of the model’s
The orthogonality of tangent vectors on the manifold is reconstructions to a 50×50 grid corresponding to the physical
analogous to the condition of orthogonal loadings in PCA. space in which the object resides (hereafter referred to as
ANN models with a small TVCS, therefore, share the intuitive the ‘visualization space’). For a given model reconstruction
property of having components that characterize orthogonal f(v) produced by the bottleneck node values v = [v 1 , v 2 ],
directions in the original feature space, as in PCA. The we denote the (i, j ) coordinate pair of the reconstruction
orthogonality of the tangent vectors T1 and T2 is not a suffi- in the visualization space as gv (i, j ) for i ∈ {1, . . . , 50},
cient condition for the variation patterns to be well separated j ∈ {1, . . . , 50}. Note that in the NLPCA solution shown in
between the bottleneck nodes v 1 and v 2 . This is because Fig. 4, v 1 represents both the flattening pattern and shifting the
the true variation patterns can be completely mixed between object to the right along the horizontal axis of the visualization
v 1 and v 2 while still producing patterns that are independent in space, as depicted by the top three plots of Fig. 4. The bottom
the M-dimensional space of the observed data [8], as discussed three plots show that v 2 has also learned a combination of these
in Section II. two variation patterns, where the reconstructions are produced
However, we can improve the usefulness of TVCS for over a different region of the visualization space.
measuring source separation if we have prior information The fact that each bottleneck node has learned a combina-
about one of the nonlinear variation patterns present in the tion of the two true variation patterns is revealed by how the
data. Specifically, if we know the direction in the original reconstructions are identical after reflecting the patterns pro-
M-dimensional space primarily impacted by one of the true duced by one of the bottleneck nodes around the vertical axis
variation patterns, reflecting one of the tangent vectors about in the reconstruction space. Specifically, let v = [v̄ 1 + δ, v̄ 2 ]
this direction reveals if the variation sources have been mixed and v = [v̄ 1 , v̄ 2 − δ] represent two sets of bottleneck node
between the bottleneck nodes. We therefore calculate the values, where δ is a small deviation from one of the elements
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 2. Example of ANLPCA solution on simulated point cloud data where v 1 and v 2 represent distinct patterns.
in each set. Fig. 4 shows how gv (i, j ) ≈ gv (50 − i, j ), of T V C S1 and T V C S2 yields a large value near 1 when
which defines approximate equivalence after reflecting the the bottleneck nodes model a combination of the variation
reconstruction associated with v 2 around the vertical axis of patterns, and it produces a small value near 0 when they each
the visualization space at i = 25 on the horizontal axis. represent distinct patterns.
The tangent vectors for this NLPCA solution are shown This formulation of the T V C S measure relies on prior
in Fig. 5 and illustrate that, despite the variation patterns being knowledge that one of the variation patterns primarily affects
mixed among v 1 and v 2 , the tangent vectors T1 and T2 are still the object in the direction of the horizontal axis of the visual-
nearly orthogonal, because they point in different directions ization space. However, different rotations of the visualization
of the 2500-D feature space, resulting in a small value of space could be used in other applications, where similar
T V C S1 . However, when the reconstructed values for one of prior knowledge exists, or T V C S1 could simply be used to
the bottleneck nodes are reflected around the vertical axis approximate the measure if no prior knowledge is available.
of the visualization space, the reflected tangent vector T2R Note that the prior knowledge is only used for calculating
is approximately equal to the negative of T1 , yielding a the T V C S metric to numerically compare our ANLPCA
value of T V C S2 that is near 1. Thus, taking the maximum method to regular NLPCA with respect to the separation of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HOWARD et al.: DISTINCT VARIATION PATTERN DISCOVERY USING ALTERNATING NONLINEAR PCA 7
Fig. 4. Example of regular NLPCA solution on simulated point cloud data, where v 1 and v 2 each represent a mix of the two distinct patterns.
the variation patterns; this information is not used by the have elapsed, the weight penalties would alternate across the
ANLPCA method, making it applicable to scenarios where bottleneck nodes after each subsequent epoch. The T V C S
no prior information about the variation sources is known. metric could be evaluated on each pairwise combination of
bottleneck node tangent vectors and averaged to produce a
D. Strategy for P > 2 Bottleneck Nodes single measure of feature distinctiveness.
1
N M
Average SSE = (x i,m − x̂ i,m )2 . (9)
N
i=1 m=1
HOWARD et al.: DISTINCT VARIATION PATTERN DISCOVERY USING ALTERNATING NONLINEAR PCA 9
Fig. 9. Example of ANLPCA solution on MNIST data where v 1 and v 2 each represent the distinct variation patterns of rotation and bottom curvature.
Fig. 10. Example of NLPCA solution on MNIST data where v 1 and v 2 each represent a mix of the two distinct variation patterns.
chosen by monitoring this reduction in the reconstruction mse the solution shown in Fig. 2. Conversely, many of the models
during training and alternating the penalty after the ‘elbow’ trained with regular NLPCA yielded solutions similar to that
point in the curve is reached. Alternatively, cross validation shown in Fig. 4, where v 1 and v 2 each modeled a combination
could be used to choose the value of R, which minimizes the of the patterns. This is reflected in the much larger values of
T V C S of the resulting model. T V C S for regular NLPCA.
The T V C S values for the models trained on the simulated Fig. 8 shows the distribution of average SSE values across
point cloud data are shown in Fig. 7. Each boxplot shows the 30 models trained on each method. Based only on average
the distribution of T V C S measured across the 30 models SSE, regular NLPCA with only 100 epochs outperforms
trained using each learning method. ANLPCA with both ANLPCA with 200 epochs, although both solutions were
200 and 500 epochs produced substantially smaller T V C S considered acceptable from the perspective of having the
values than regular NLPCA, and visualization of each of the reconstructed pattern resemble the originally observed pattern.
trained ANLPCA models revealed that the v 1 and v 2 bottle- Increasing the number of ANLPCA epochs to 500 provides
neck nodes each modeled distinct variation patterns similar to average SSE values that are as good or better than the NLPCA
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HOWARD et al.: DISTINCT VARIATION PATTERN DISCOVERY USING ALTERNATING NONLINEAR PCA 11
Our results using simulated point cloud data and the MNIST Phillip Howard received the Ph.D. degree in indus-
handwritten digits data show that the ANLPCA method consis- trial engineering from Arizona State University,
Tempe, AZ, USA, in 2016, with a focus on machine
tently produces results with the desired separation of distinct learning.
variation patterns. We introduced the T V C S metric as a He is currently a Data Scientist with Intel Cor-
measure that can be used to quantify this distinctness char- poration, Chandler, AZ, USA, where he is involved
in enabling supply chain intelligence and analytics
acteristic of the solution. This will facilitate the comparison capabilities using statistics and machine learning
of our ANLPCA method to future work on learning distinct methods. His current research interests include appli-
nonlinear variation patterns. In future research, we would cations of deep learning for nonlinear dimensionality
reduction, cognitive computing solutions for supply
like to explore different regularization methods that can be chain monitoring, and interpretable feature learning from heterogeneous data
introduced to encourage distinctness of the variation patterns sources in health care applications.
in ANN models.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers
for their constructive feedback.
R EFERENCES
[1] M. A. Kramer, “Nonlinear principal component analysis using autoas- Daniel W. Apley received the B.S., M.S., and Ph.D.
sociative neural networks,” AIChE J., vol. 37, no. 2, pp. 233–243, degrees in mechanical engineering and the M.S.
Feb. 1991. degree in electrical engineering from the University
[2] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component of Michigan, Ann Arbor, MI, USA.
analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, He is currently a Professor of Industrial Engineer-
pp. 1299–1319, Jul. 1998. ing and Management Sciences with Northwestern
[3] A. Sahu, D. W. Apley, and G. C. Runger, “Feature selection for University, Evanston, IL, USA. His current research
noisy variation patterns using kernel principal component analysis,” interests include the interface of engineering mod-
Knowledge-Based Syst., vol. 72, pp. 37–47, Dec. 2014. eling, statistical analysis, and predictive analytics,
[4] J.-K. Im, D. W. Apley, and G. C. Runger, “Tangent hyperplane kernel with a focus on enterprise process modeling and
principal component analysis for denoising,” IEEE Trans. Neural Netw. manufacturing variation reduction applications in
Learn. Syst., vol. 23, no. 4, pp. 644–656, Apr. 2012. which large amounts of data are available. His research has been supported
[5] P. Comon, “Independent component analysis, a new concept?” Signal by numerous industries and government agencies.
Process., vol. 36, no. 3, pp. 287–314, Apr. 1994. Dr. Apley received the NSF CAREER Award in 2001, the IIE Transactions
[6] X. Shan and D. W. Apley, “Blind identification of manufacturing vari- Best Paper Award in 2003, and the Wilcoxon Prize for best practical
ation patterns by combining source separation criteria,” Technometrics, application paper appearing in Technometrics in 2008. He is also the Editor-in-
vol. 50, no. 3, pp. 332–343, Jan. 2008. Chief of the Technometrics and has served as Editor-in-Chief for the Journal of
[7] D. W. Apley and H. Y. Lee, “Identifying spatial variation patterns in Quality Technology, the Chair of the Quality, Statistics and Reliability Section
multivariate manufacturing processes,” Technometrics, vol. 45, no. 3, of INFORMS, and the Director of the Manufacturing and Design Engineering
pp. 220–234, Jan. 2003. Program at Northwestern.
[8] C. Jutten and J. Karhunen, “Advances in nonlinear blind source separa-
tion,” in Proc. 4th Int. Symp. Independent Compon. Anal. Blind Signal
Separat., Apr. 2003, pp. 245–256.
[9] M. J. Kirby and R. Miranda, “Circular nodes in neural networks,” Neural
Comput., vol. 8, no. 2, pp. 390–402, Feb. 1996.
[10] M. Scholz, M. Fraunholz, and J. Selbig, “Nonlinear principal compo-
nent analysis: Neural network models and applications,” in Principal
Manifolds for Data Visualization and Dimension Reduction. New York,
NY, USA: Springer, 2008, pp. 44–67.
[11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of George Runger received the Ph.D. degree in statis-
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, tics from the University of Minnesota in 1982.
2006. He was a Senior Engineer and a Technical Leader
[12] P. Howard, D. Apley, and G. Runger, “Identifying nonlinear variation for data analytics projects with IBM. He is currently
patterns with deep autoencoders,” IIE Trans., 2016. a Professor with the School of Computing, Infor-
[13] R. Saegusa, H. Sakano, and S. Hashimoto, “Nonlinear principal matics, and Decision Systems Engineering, Arizona
component analysis to preserve the order of principal components,” State University, Tempe, AZ, USA. He has authored
Neurocomputing, vol. 61, pp. 57–70, Oct. 2004. over 100 publications in research journals with fund-
[14] M. Scholz and R. Vigário, “Nonlinear PCA: A new hierarchical ing from federal and corporate sponsors. His current
approach,” in Proc. ESANN, Apr. 2002, pp. 439–444. research interests include analytical methods for
[15] A. Gifi, Nonlinear Multivariate Analysis. Hoboken, NJ, USA: Wiley, knowledge generation and data-driven improvements
1990. in organizations, and machine learning for large, complex data, and real-time
[16] M. Linting, J. J. Meulman, P. J. Groenen, and A. J. van der Koojj, analysis, with applications to surveillance, decision support, and population
“Nonlinear principal components analysis: Introduction and application,” health.
Psychol. Methods, vol. 12, no. 3, pp. 336–358, Sep. 2007. Dr. Runger was an affiliated Faculty Member with BMI and the related
[17] P. Simard, Y. LeCun, and J. S. Denker, “Efficient pattern recognition Center for Health Information and Research for several years. He is also a
using a new transformation distance,” in Proc. Adv. Neural Inf. Process. Reviewer for many journals in the area of machine learning and statistics
Syst., 1993, pp. 50–58. and the Department Editor for healthcare informatics for IIE Transactions on
[18] Y. LeCun and C. Cortes, The MNIST Database of Handwritten Digits, Healthcare Systems Engineering. He is also the Chair of the Department of
1998. Biomedical Informatics, Arizona State University.