Distinct Variation Pattern Discovery Using Alternating Nonlinear Principal Component Analysis

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1
Distinct Variation Pattern Discovery Using

Alternating Nonlinear Principal
Component Analysis
Phillip Howard, Daniel W. Apley, and George Runger
Abstract— Autoassociative neural networks (ANNs) have been of underlying variation patterns can potentially shed light on
proposed as a nonlinear extension of principal component analy- the nature of the variation patterns themselves by analyzing the
sis (PCA), which is commonly used to identify linear variation relationship between the reduced feature space and the data.
patterns in high-dimensional data. While principal component
scores represent uncorrelated features, standard backpropaga- PCA is frequently used to extract low-dimensional features
tion methods for training ANNs provide no guarantee of pro- from high-dimensional data sources. Because the principal
ducing distinct features, which is important for interpretability components are orthogonal, they can be used to identify
and for discovering the nature of the variation patterns in the uncorrelated linear variation patterns in high-dimensional data
data. Here, we present an alternating nonlinear PCA method, by interpreting each principal component as a source of varia-
which encourages learning of distinct features in ANNs. A new
measure motivated by the condition of orthogonal loadings in tion. The order of the principal components further facilitates
PCA is proposed for measuring the extent to which the nonlin- variation pattern analysis, because it conveys information on
ear principal components represent distinct variation patterns. the amount of variation in the original data explained by
We demonstrate the effectiveness of our method using a simulated each principal component. However, PCA is a linear method
point cloud data set as well as a subset of the MNIST handwritten that fails to accurately characterize the underlying variation
digits data. The results show that standard ANNs consistently
mix the true variation sources in the low-dimensional represen- patterns when they are nonlinear in nature.
tation learned by the model, whereas our alternating method Our objective is to identify and visualize nonlinear variation
produces solutions where the patterns are better separated in the sources in multivariate data. The examples presented in this
low-dimensional space. paper involve spatially arranged multivariate data, but the
Index Terms— Autoassociative neural network (ANN), dis- methodology can be applied more generally. The nonlinear
tinct features, nonlinear principal component analysis (NLPCA), multivariate data have the functional form xi = f(vi ) + wi ,
tangent vector, variation pattern. where xi is the i th instance for i = 1, 2, . . . , N. In a manufac-
turing context, each data instance xi could represent a profile
I. I NTRODUCTION of M discrete height measurements taken along the surface
of a part, which is denoted by xi = [x i,1 , x i,2 , . . . , x i,M ]T .
D ISCOVERING nonlinear variation patterns is an impor-
tant problem in many applications involving high-
dimensional data. For example, a manufacturing engineer may
Underlying each instance xi are P variation patterns repre-
sented by the vector vi = [v i,1 , v i,2 , . . . , v i,P ]T . These P
wish to understand the nature of part-to-part variation using variation patterns are mapped to the M-dimensional space
profile quality control data for manufactured components. of the height measurements by the vector of M common
Such data often arises from ocular coordinate measuring nonlinear functions f(vi ) = [ f 1 (vi ), f 2 (vi ), . . . , f M (vi )]T :
machines, where a laser beam is fanned into a plane, projected R P → R M . The i th instance xi is also corrupted by a noise
onto a part as a stripe, and scanned across the part to produce a component wi = [wi,1 , wi,2 , . . . , wi,M ]T . Learning the under-
3-D point cloud. The number of measurements in the resulting lying nonlinear variation patterns corresponds to identifying vi
profile is much larger than the number of variation patterns and the functions f(vi ) using only the observed instances xi .
affecting the manufacturing process that produced it. Thus, The model that we use to identify and visualize the
reducing the dimensionality of the observed data to the number nonlinear variation sources is an autoassociative neural net-
work (ANN), which has been proposed as a form of nonlinear
Manuscript received December 30, 2015; revised June 27, 2016; accepted principal component analysis (NLPCA) [1]. These models
September 26, 2016. This work was supported by the National Science
Foundation under Grant 1265713. are neural networks that aim to reconstruct their input after
P. Howard and G. Runger are with the School of Computing, Informat- passing hidden node activations through a bottleneck layer
ics, and Decision Systems Engineering, Arizona State University, Tempe, with far fewer nodes than the dimensionality of the original
AZ 85281 USA (e-mail: prhoward@asu.edu; George.Runger@asu.edu).
D. W. Apley is with the Department of Industrial Engineering and Manage- data, thereby forcing data compression. The hidden nodes
ment Sciences, Northwestern University, Evanston, IL 60208 USA (e-mail: in the bottleneck layer are referred to as bottleneck nodes,
apley@northwestern.edu). or alternatively as the (nonlinear) components of the model.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. An important issue with NLPCA methods is that often multiple
Digital Object Identifier 10.1109/TNNLS.2016.2616145 bottleneck nodes align with the same variation pattern early in
2162-237X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
training [1], resulting in a solution where each of the extracted variables are also independent [8]. Because our objective is
features represents a combination of the true variation sources the visualization of the true variation sources underlying the
rather than distinct sources of variation. Existing approaches observed data, a nonlinear ICA solution is insufficient for
for NLPCA models fail to address the task of learning a extracting the features we seek, because the original variation
solution, where each of the nonlinear components represents patterns can be mixed among the extracted features while still
a single distinct source of variation in the data. being statistically independent.
We present a new ANN learning methodology called alter- NLPCA is a nonlinear extension of PCA, which uses an
nating NLPCA (ANLPCA), which aims to address this defi- ANN to learn a compact representation of high-dimensional
ciency. Unlike NLPCA, ANLPCA uses an alternating penalty data [1]. The neural network takes the original data as input
term during backpropagation training to guide the model and is trained to reproduce this input at its output layer, which
toward a solution, where the extracted patterns are distinct has the same number of nodes as the dimensionality of the
and well separated among the bottleneck nodes. Experimental original data. The network typically has a total of four layers
results indicate that the nonlinear features extracted from the (excluding the input), which consists of three hidden layers
data using our ANLPCA methodology are more interpretable and the output layer. An example of this network structure
and better facilitate visualizing each of the true variation is shown in Fig. 1. The second hidden layer is referred to
sources individually. as the bottleneck layer and has fewer hidden nodes than the
The remainder of this paper is organized as follows. dimensionality of the original data, forcing the network to
In Section II, we review literature related to our task of learn a more compact representation of the data by encoding
nonlinear dimensionality reduction. Our ANLPCA method is it to the units in the bottleneck layer. The network is trained
detailed in Section III, along with a new metric that we propose to minimize the sum of squared errors (SSE) in reconstructing
for measuring the separation of variation sources among the original data at the output layer. Unlike linear PCA,
learned features. Section IV presents experimental results from the training process for NLPCA does not guarantee that the
applying ANLPCA to simulated point cloud data and the nonlinear components represented by the bottleneck layer will
MNIST handwritten digits data set. Concluding remarks and be uncorrelated (or distinct in any manner) and the nonlinear
a discussion of future work is provided in Section V. components do not have any intrinsic ordering with respect to
the amount of variation they characterize.
II. R ELATED W ORK Several variants of NLPCA have been proposed. Circular
Kernel PCA (KPCA) is an extension of linear PCA, where PCA implements NLPCA where the units in the bottleneck
the data vectors xi existing in M-dimensional space are layer are constrained to lie on the unit circle so that a pair of
mapped to some N-dimensional feature space (N M bottleneck nodes can be described by a single angle parame-
typically, and N may be infinite) via a map φ(x) = ter [9]. Inverse NLPCA learns only the decoding portion of the
[φ1 (x), φ2 (x), . . . , φ N (x)]T , where each φ(x) is some scalar- network (i.e., the bottleneck layer through the output layer) by
valued nonlinear function of x. PCA is then performed on the treating the bottleneck node values themselves as model para-
set of feature vectors {φ(x1 ), φ(x2 ), . . . , φ(xN )}. The feature meters to be learned during backpropagation, in addition to
map is defined implicitly in practice by some kernel function the weights [10]. Deep autoencoders are ANNs that have more
for computational efficiency [2]. Because KPCA can remove hidden layers than the standard NLPCA model, but that still
some noise in data, it can be useful for detecting underlying have a single bottleneck layer in the middle of the network,
variation patterns and it has been applied to variation pattern which forces compression of the data [11]. Unsupervised
analysis previously [3], [4]. However, it does not provide a pretraining of the autoencoder is used prior to backpropagation
parametric representation, nor even an explicit representation, to increase the speed of learning for deep network architec-
of the patterns (only a collection of denoised data points), and tures. Pretrained deep autoencoders have been successfully
therefore, it is not ideally suited for our task of visualizing the applied to learning efficient low-dimensional representations
nature of the variation patterns. of nonlinear profile data, outperforming alternative methods
Blind source separation (BSS) methods, such as indepen- based on reindexing and manifold learning [12]. However,
dent component analysis (ICA), are related to our task of none of these methods specifically consider the problem of
identifying distinct variation patterns. Whereas PCA seeks ensuring that the learned components each represent distinct
uncorrelated components that are efficient for representing the variation patterns in the original data.
original data, ICA searches for a solution, which minimizes the Hierarchical nonlinear principal component
statistical dependence between the components [5]. However, networks (HNPCN) is another variant of NLPCA, which
traditional ICA methods only discover linear patterns, making seeks a solution in which a hierarchical ordering of the
them unsuitable for our task of blindly identifying nonlinear components exists, similar to the eigenvalue ordering of
variation patterns. BSS methods for the case of linear patterns principal components in linear PCA [13], [14]. The method
have been proposed with applications to manufacturing vari- proposes training multiple subnetworks sequentially, where
ability analysis in [6] and [7]. Nonlinear methods for BSS each additional subnetwork contains one more unit in the
have received less attention; unlike the linear case, nonlinear bottleneck layer than the previous networks, and all but one
ICA solutions are highly nonunique and differ greatly from of the bottleneck layer nodes obtains its value from a previous
the solution to the much harder nonlinear BSS problem, subnetwork. While this provides some order to the learned
because any arbitrary function of a pair of independent random features, it requires the training of P networks (where P is
HOWARD et al.: DISTINCT VARIATION PATTERN DISCOVERY USING ALTERNATING NONLINEAR PCA 3
the number of desired components) and does not allow the

encoding portion of a network to be further trained once
training has proceeded to the next subnetwork. In addition,
the primary concern of the above referenced papers is
learning a hierarchical ordering of the nonlinear components,
which differs from our objective of identifying distinct and
interpretable components. We did not provide experimental
results showing the performance of their methods with respect
to separating the nonlinear variation sources over multiple
different random initializations of the network weights.
Although the term NLPCA is frequently used to describe
the use of ANNs in dimensionality reduction, there is a
different methodology which is also referred to as NLPCA
in the statistical literature. For the purpose of this discus-
sion, we refer to this different nonlinear PCA method as Fig. 1. Network architecture for ANLPCA with P = 2 components.
NLPCA2. The NLPCA2 method was developed to address the
inability of regular PCA to handle categorical data attributes,
but it can also be used to model nonlinear relationships in NLPCA methods based on ANNs, our ANLPCA model
numerical data [15], [16]. Given that P nonlinear components encourages the bottleneck layer nodes to each learn distinct
are desired, NLPCA2 identifies an optimal scaling of each variation patterns in the data while requiring the training
attribute, such that the sum of the first P eigenvalues of the of only a single network architecture. It also provides a
correlation matrix evaluated on the scaled data is maximized. hierarchical ordering of the learned components similar to
Different types of nonlinear transformations allowed under PCA while allowing all model parameters to be fine-tuned
the optimal scaling process can be specified. Linear PCA during later stages of training, unlike the existing HNPCN
is performed on the transformed data attributes to obtain a method.
nonlinear dimensionality reduction solution.
While NLPCA2 can be used to model nonlinear relation- III. M ETHODOLOGY
ships between numerical variables, there are several differ-
ences from our NLPCA approach, which limit the suitability A. Alternating NLPCA
of NLPCA2 for our task of visualizing nonlinear variation To simplify notation, we present our methodology for
patterns. The objective in NLPCA is to minimize the error ANLPCA with P = 2 nonlinear components in the bottleneck
in reconstructing the original data from the low-dimensional layer. However, the method applies directly to cases in which
representation learned by the model. In contrast, NLPCA2 P > 2 components are desired by adding more terms to the
only minimizes the SSE in reconstructing the transformed data objective function, analogous to the method presented here.
produced through optimal scaling. Optimizing the reconstruc- Further discussion is provided in Section III-D.
tion error of the original data is desirable for visualization The network architecture for ANLPCA with P = 2
purposes, because the ability to discern nonlinear patterns components is shown in Fig. 1. In Fig. 1, h k, j denotes the
learned by the model is limited if the reconstructed data j th hidden node in layer k, and n k denotes the number of
instances poorly represent the original instances. Perhaps a nodes in hidden layer k. The two nodes in the bottleneck
greater concern with NLPCA2 is that reconstructing the origi- layer representing the nonlinear components are denoted by
nal data instances may not even be possible due to the manner v 1 and v 2 , since they are used to represent the two unknown
in which the data are transformed through optimal scaling. variation patterns comprising the vector v. The arcs represent
For example, the reconstruction of the transformed attributes the weights connecting nodes in one layer to the net input
produced by linear PCA could fall outside the domain of the of nodes in the next layer; we denote wk,i j as the weight
inverse transformation function, or the inverse function could connecting the i th node of layer k − 1 to the j th node of
not exist at the reconstructed point (Linting et al. [16] provide layer k.
several transformation plots which illustrate these problems To train the model on a data set consisting of N instances,
with the inverse transformation). Most previous applications of we minimize the following objective function with respect to
NLPCA2 involve lower dimensional data where understanding the weights:
relationships between attributes (e.g., analyzing PCA loading N M
vectors) is the primary concern, rather than our objective of 1
J (W) = (x i,m − x̂ i,m ) + λJr (W).
2
(1)
visualizing nonlinear variation sources in high-dimensional 2
i=1 m=1
data.
ANNs are ideal for our task of visualizing the underlying In (1), x i,m denotes the mth component of instance i in
variation patterns, because they can learn both the original the training set, x̂ i,m is the value of x i,m reconstructed at the
variation sources and a functional mapping (i.e., the func- output layer of the neural network, and W denotes the set
tion f( ) defined in Section I) of those sources back to of all weights used in the network. The λJr (W) term in (1)
the dimensionality of the observed data. Unlike the existing is a relaxation of a constraint on learning which alternates
depending on the current epoch of training, where the training on the training epoch results in only updating the weights
epoch is denoted by r . We define connected to a single bottleneck node in any given training
⎧⎡ ⎤ epoch. The weights connected to the other bottleneck node
⎪
⎪ n3 n1
remain approximately constant due to the active penalty on
⎪
⎪ ⎣ (w3,1 j − w̄3,1 j )2 + (w2, j 1 − w̄2, j 1 )2 ⎦,
⎪
⎪ deviations from the prior iteration’s weights for that node.
⎪
⎪ j =1 j =1
⎪
⎪ Allowing only the weights connected to v 1 to be updated
⎨ if r ∈
Jr (W) = ⎡ ⎤
1 for the first R epochs allows the error associated with the
⎪
⎪ n3 n1
largest variation in the data to be learned primarily by v 1 .
⎪
⎪⎣ (w3,2 j − w̄3,2 j )2 +
⎪
⎪ (w2, j 2 − w̄2, j 2 )2 ⎦, After the largest variation source has been learned, the penalty
⎪
⎪
⎪
⎪ j =1 j =1 alternates and encourages the v 1 weights to remain constant,
⎩
if r ∈ 2 while v 2 learns the next largest variation source. Alternating
(2) the penalty between the nodes after 2R epochs have elapsed
allows the model to continue updating the weights on each
where node independently to fine-tune the model, which is otherwise
1 = {r |r ≤ R ∨ [r > 2R ∧ r (mod2) = 0]} not possible if we try to learn the weights for a given
bottleneck node entirely before moving on to train the next
2 = {r |R < r ≤ 2R ∨ [r > 2R ∧ r (mod2) = 1]}. (3)
node.
In (3), R is a user-specified parameter for which we provide
selection guidelines in Section IV. The value of w̄k,i j is set
B. Tangent Vector Cosine Similarity Metric
equal to the updated weight wk,i j obtained from the previous
(r) After training an ANN, the SSE represented by the first
iteration of backpropagation. That is, if wk,i j denotes the new
value of wk,i j after the r th backpropagation iteration, then in term in (1) is typically calculated on a test data set in order to
(r) provide a generalized estimate of the quality of fit produced
iteration r + 1, we set w̄k,i j = wk,i j . The parameter λ in (1)
is set equal to a large value, which can be tuned using cross by the model. This reconstruction error is important for our
task of visualizing the variation patterns, because it indicates
validation.
We use conjugate gradient descent to iteratively minimize how well the actual data are modeled. However, we also
need to evaluate a solution based on how distinctness of the
the objective function in (1). The first term in (1) is simply
the squared error loss function that is commonly used when variation patterns modeled by each of the bottleneck nodes,
training ANNs. The other term Jr (W) represents a constraint which requires another metric in addition to the standard SSE.
To measure the degree to which the bottleneck
imposed on the objective function by our ANLPCA method
to arrive at a solution, where each node in the bottleneck nodes v 1 and v 2 capture distinct variation patterns in
layer represents a distinct variation pattern, as explained in the original data, we introduce a metric called the tangent
vector cosine similarity (TVCS). TVCS has similarities with
the following.
The objective function term Jr (W) only affects the weights the concept of tangent distance, which is a transformation-
invariant distance measure used in image recognition [17];
entering into and leaving the bottleneck layer, w2,i j and w3,i j .
Thus, the weight updates for w1,i j and w4,i j in the first and however, the tangents are used for a different role here.
fourth layers are the same as in regular NLPCA. The first After training the neural network, each of the i = 1, . . . , N
instances in our data has distinct bottleneck node values vi =
definition of Jr (W) in (3) for r ∈ 1 only involves the weights
coming into and out of v 1 in the bottleneck layer, and the [v i,1 , v i,2 ]T in addition to a reconstructed vector f(vi ) = x̂i ,
relevant partial derivatives for this term are where f(vi ) : P → M denotes the nonlinear functional
mapping of vi to the original M-dimensional space produced
∂ Jr (W) by the decoder portion of the network (i.e., the weights and
= 2(w3,1 j − w̄3,1 j )
∂w3,1 j activation functions from the bottleneck layer to the output
∂ Jr (W) layer). The functional mapping f(vi ) for i ∈ {1, . . . , N}
= 2(w2, j 1 − w̄2, j 1). (4) defines a P-dimensional manifold in the M-dimensional space
∂w2, j 1
of the original data describing the possible reconstructions that
The second definition of Jr (W) in (3) for r ∈ 2 similarly can be produced by the model.
involves only the weights entering and leaving v 2 . Thus, the Let v̄ 1 and v̄ 2 denote the midpoints of the range of these
relevant partial derivatives for the second definition are bottleneck node values evaluated across the N instances
∂ Jr (W) after training. If we fix v 2 = v̄ 2 , the collection of points
= 2(w3,2 j − w̄3,2 j ) f([v i,1 , v̄ 2 ]T ) for i = 1, . . . , N represent a path along the
∂w3,2 j
∂ Jr (W) manifold defined by v 1 . Similarly, the points f([v̄ 1 , v i,2 ]T ) for
= 2(w2, j 2 − w̄2, j 2 ). (5) i = 1, . . . , N represent a different path along the manifold,
∂w2, j 2
which characterizes the effect the second bottleneck node v 2
The effect of Jr (W) on the weight updates is, therefore, to add exerts on the reconstructions. For the purpose of measuring
the terms in (4) and (5) to the squared error loss gradients the distinctness of the patterns represented by v 1 and v 2 , we
for the relevant weights entering and leaving the bottleneck approximate these paths by their tangent vectors T1 and T2 to
layer. Alternating the definition of the Jr (W) term based the manifold at the midpoint bottleneck values v̄ = [v̄ 1 , v̄ 2 ]T
using the method of finite differences. Letting δ denote a small following second TVCS measure:
change in the bottleneck node values, T1 and T2 are specifically
T ·TR
calculated as 1 2
TVCS2 = (8)
||T1 ||||T2R ||
f v̄ 1 + 12 δ, v¯2 − f v̄ 1 − 12 δ, v¯2
T1 =
δ where T2R is the tangent vector to the manifold defined by
f v̄ 1 , v¯2 + 12 δ − f v̄ 1 , v¯2 − 21 δ the second bottleneck node v 2 after each of the reconstructed
T2 = . (6)
δ instances f([v̄ 1 , v i,2 ]T ) has been reflected about the direction
We approximate T1 and T2 at the midpoint bottleneck node in the M-dimensional space impacted by the known variation
values to characterize the variation patterns in the middle of source. The final measure incorporating the prior informa-
their observed ranges, which is where the visualization is tion about one of the known variation sources is defined as
expected to be of greatest interest. While T1 and T2 could T V C S = max(T V C S1 , T V C S2 ).
be calculated across all points on the manifold if visualization To illustrate, consider the reconstructed instances of our
of the patterns in other regions is desired, our objective here is simulated point cloud data shown in Fig. 2. There are two
just to illustrate the usefulness of TVCS for identifying distinct distinct variation patterns that are simulated into the data,
feature solutions at the point of visualization. with additional small Gaussian noise added to the instance.
Because the tangent vectors T1 and T2 represent linear Each instance is a 2500-D point cloud representing a bowl-
approximations of the directions in the manifold defined by the shaped object in the 3-D space. One of the variation patterns
bottleneck nodes v 1 and v 2 , the orthogonality of these vectors acting on this object is a flattening pattern, which expands the
indicates that the bottleneck nodes have learned variation object symmetrically along the x- and y-axes while flattening
patterns in different directions of the original M-dimensional its height in the z dimension. The second variation pattern,
space. This is desirable since we want the nodes to characterize independent to the first, shifts the object along the horizontal
different information about the variation sources in the data. axis.
It is also similar to the orthogonality of loadings used in PCA Fig. 2 shows the shaded surface plots of the object viewed
to describe different sources of variation. Thus, we measure from above (i.e., looking down from the top of the z-axis, as in
the cosine of the angle between the tangent vectors T1 and T2 , a contour plot). These plots were obtained from an ANLPCA
which is close to zero when the tangent vectors have this desir- solution in which v 1 accurately characterizes the first variation
able characteristic of being nearly orthogonal and increases in pattern and v 2 characterizes the second pattern. The top three
absolute value toward one as they become less orthogonal. plots (from left to right) in Fig. 2 illustrate how the object
We define flattens as the value of v 1 decreases and v 2 remains constant
at its midpoint. The bottom three plots illustrate the shifting
T1 · T2
TVCS1 = (7) of the object from left-to-right along the horizontal axis as the
||T ||||T ||
1 2 value of v 2 increases and v 1 remains fixed at its midpoint. The
where T1 · T2 denotes the vector dot product of T1 and T2 , and tangent vectors T1 , T2 , and T2R for this ANLPCA solution are
||T || denotes the Euclidean norm of vector T . shown in Fig. 3.
Now consider the regular NLPCA solution with random
starting weights for this same data shown in Fig. 4. These
C. Enhancements to TVCS plots represent a mapping of the R2500 space of the model’s
The orthogonality of tangent vectors on the manifold is reconstructions to a 50×50 grid corresponding to the physical
analogous to the condition of orthogonal loadings in PCA. space in which the object resides (hereafter referred to as
ANN models with a small TVCS, therefore, share the intuitive the ‘visualization space’). For a given model reconstruction
property of having components that characterize orthogonal f(v) produced by the bottleneck node values v = [v 1 , v 2 ],
directions in the original feature space, as in PCA. The we denote the (i, j ) coordinate pair of the reconstruction
orthogonality of the tangent vectors T1 and T2 is not a suffi- in the visualization space as gv (i, j ) for i ∈ {1, . . . , 50},
cient condition for the variation patterns to be well separated j ∈ {1, . . . , 50}. Note that in the NLPCA solution shown in
between the bottleneck nodes v 1 and v 2 . This is because Fig. 4, v 1 represents both the flattening pattern and shifting the
the true variation patterns can be completely mixed between object to the right along the horizontal axis of the visualization
v 1 and v 2 while still producing patterns that are independent in space, as depicted by the top three plots of Fig. 4. The bottom
the M-dimensional space of the observed data [8], as discussed three plots show that v 2 has also learned a combination of these
in Section II. two variation patterns, where the reconstructions are produced
However, we can improve the usefulness of TVCS for over a different region of the visualization space.
measuring source separation if we have prior information The fact that each bottleneck node has learned a combina-
about one of the nonlinear variation patterns present in the tion of the two true variation patterns is revealed by how the
data. Specifically, if we know the direction in the original reconstructions are identical after reflecting the patterns pro-
M-dimensional space primarily impacted by one of the true duced by one of the bottleneck nodes around the vertical axis
variation patterns, reflecting one of the tangent vectors about in the reconstruction space. Specifically, let v = [v̄ 1 + δ, v̄ 2 ]
this direction reveals if the variation sources have been mixed and v = [v̄ 1 , v̄ 2 − δ] represent two sets of bottleneck node
between the bottleneck nodes. We therefore calculate the values, where δ is a small deviation from one of the elements
Fig. 2. Example of ANLPCA solution on simulated point cloud data where v 1 and v 2 represent distinct patterns.
Fig. 3. Tangent vectors for ANLPCA solution.
in each set. Fig. 4 shows how gv (i, j ) ≈ gv (50 − i, j ), of T V C S1 and T V C S2 yields a large value near 1 when
which defines approximate equivalence after reflecting the the bottleneck nodes model a combination of the variation
reconstruction associated with v 2 around the vertical axis of patterns, and it produces a small value near 0 when they each
the visualization space at i = 25 on the horizontal axis. represent distinct patterns.
The tangent vectors for this NLPCA solution are shown This formulation of the T V C S measure relies on prior
in Fig. 5 and illustrate that, despite the variation patterns being knowledge that one of the variation patterns primarily affects
mixed among v 1 and v 2 , the tangent vectors T1 and T2 are still the object in the direction of the horizontal axis of the visual-
nearly orthogonal, because they point in different directions ization space. However, different rotations of the visualization
of the 2500-D feature space, resulting in a small value of space could be used in other applications, where similar
T V C S1 . However, when the reconstructed values for one of prior knowledge exists, or T V C S1 could simply be used to
the bottleneck nodes are reflected around the vertical axis approximate the measure if no prior knowledge is available.
of the visualization space, the reflected tangent vector T2R Note that the prior knowledge is only used for calculating
is approximately equal to the negative of T1 , yielding a the T V C S metric to numerically compare our ANLPCA
value of T V C S2 that is near 1. Thus, taking the maximum method to regular NLPCA with respect to the separation of
Fig. 4. Example of regular NLPCA solution on simulated point cloud data, where v 1 and v 2 each represent a mix of the two distinct patterns.
Fig. 5. Tangent vectors for regular NLPCA solution.
the variation patterns; this information is not used by the have elapsed, the weight penalties would alternate across the
ANLPCA method, making it applicable to scenarios where bottleneck nodes after each subsequent epoch. The T V C S
no prior information about the variation sources is known. metric could be evaluated on each pairwise combination of
bottleneck node tangent vectors and averaged to produce a
D. Strategy for P > 2 Bottleneck Nodes single measure of feature distinctiveness.
To extend ANLPCA for use with P > 2 bottleneck nodes,

IV. E XPERIMENTAL R ESULTS
we use an analogous alternating definition of the Jr (W) term
in (2) for each of the k = 1, . . . , P bottleneck nodes. This A. Simulated Point Cloud Data
Jr (W) term would penalize updates to the weight connections The simulated point cloud data were described in Section III
on all but one of the bottleneck nodes during any given epoch and the examples of the data are shown in Figs. 2 and 4.
of training. The bottleneck node weights would be learned The data consisted of 12 000 training instances and
sequentially with each node having its weights free of penalties 4200 testing instances, with both sets of data simulated using
for R consecutive training epochs. After P × R epochs the same two variation patterns with additive Gaussian noise.
During backpropagation training, 30% of the training instances

were reserved for a validation set and the model was trained on
the remaining 70% of the 12 000 observations. The 4200 test
instances were only used to calculate the T V C S measure and
the average SSE after training was complete. The average SSE
is defined as
1
N M
Average SSE = (x i,m − x̂ i,m )2 . (9)
N
i=1 m=1
Fig. 2 shows a typical ANLPCA solution obtained on this

data set, which has the desired separation of the variation
patterns between v 1 and v 2 . The model depicted in this figure
had an average SSE of 1.167 and a T V C S value of 0.095.
It was trained for 200 epochs using batch backpropagation
Fig. 6. Example of the reduction in reconstruction error by training epoch
with a batch size of 500 observations. The model shown for an ANLPCA model with R = 10.
in Fig. 4 was trained using the same parameters and ran-
domly initiated starting weights, but with regular NLPCA for
100 epochs rather than ANLPCA. This resulted in an average
SSE of 0.944 and a T V C S value of 0.911. Although NLPCA
yielded a lower SSE solution with less training, it completely
mixes the variation patterns between the bottleneck nodes
and therefore fails at the objective of identifying the distinct
variation patterns.
To further compare ANLPCA and regular NLPCA, we
trained models using each method starting from 30 different
randomly initiated starting weights. As in the previous exam-
ple, we trained the regular NLPCA models for 100 epochs and
the ANLPCA models for 200 epochs, since the alternating
method typically yields slower learning. We also trained a
third set of 30 models using 500 epochs of ANLPCA to
obtain results that have comparable average SSE values to the Fig. 7. T V C S measure evaluated across 30 different randomly initiated
regular NLPCA solution. In practice, the number of training weights for NLPCA, ANLPCA trained for 200 epochs, and ANLPCA trained
epics could be determined empirically using stopping criteria for 500 epochs.
(such as increasing reconstruction error on a withheld set of
validation data) or by manually terminating training once a
desired quality of fit has been achieved by the model. Both
sets of ANLPCA models were trained with an alternating fre-
quency of R = 25 for the extra objective function term Jr (W).
Other than the training method and the number of epochs, all
other parameters were the same across the three sets of trained
models. We use n 1 = n 3 = 25 nodes in the hidden layers on
either side of the bottleneck layer. The standard neural network
model parameters, including the number of hidden nodes in
the first and third layers, were determined by performing a
search over different possible combinations and selecting the
set which minimized the average SSE of reconstruction on a
validation data set.
The value of the alternating frequency parameter was chosen Fig. 8. Average SSE evaluated across 30 different randomly initiated
by monitoring the reduction in the average SSE by training weights for NLPCA, ANLPCA trained for 200 epochs, and ANLPCA trained
epoch. This typically reveals an ‘elbow’ in the curve indicating for 500 epochs.
where it is appropriate to alternate the penalty and begin
training the other bottleneck node. An example of this where
an alternating frequency of R = 10 epochs was used is shown which is clearly identified by the ‘elbow’ in the plot. Another
in Fig. 6. A steep reduction in the mean squared error (mse) of steep reduction in the error occurs when the penalty has
reconstruction early in training is followed by smaller reduc- alternated after epoch 10, corresponding to learning the second
tions in subsequent epochs after the model has learned as much variation pattern while updating the weights for the other
information as possible using only a single bottleneck node, bottleneck node. The alternating frequency parameter R can be
Fig. 9. Example of ANLPCA solution on MNIST data where v 1 and v 2 each represent the distinct variation patterns of rotation and bottom curvature.
Fig. 10. Example of NLPCA solution on MNIST data where v 1 and v 2 each represent a mix of the two distinct variation patterns.
chosen by monitoring this reduction in the reconstruction mse the solution shown in Fig. 2. Conversely, many of the models
during training and alternating the penalty after the ‘elbow’ trained with regular NLPCA yielded solutions similar to that
point in the curve is reached. Alternatively, cross validation shown in Fig. 4, where v 1 and v 2 each modeled a combination
could be used to choose the value of R, which minimizes the of the patterns. This is reflected in the much larger values of
T V C S of the resulting model. T V C S for regular NLPCA.
The T V C S values for the models trained on the simulated Fig. 8 shows the distribution of average SSE values across
point cloud data are shown in Fig. 7. Each boxplot shows the 30 models trained on each method. Based only on average
the distribution of T V C S measured across the 30 models SSE, regular NLPCA with only 100 epochs outperforms
trained using each learning method. ANLPCA with both ANLPCA with 200 epochs, although both solutions were
200 and 500 epochs produced substantially smaller T V C S considered acceptable from the perspective of having the
values than regular NLPCA, and visualization of each of the reconstructed pattern resemble the originally observed pattern.
trained ANLPCA models revealed that the v 1 and v 2 bottle- Increasing the number of ANLPCA epochs to 500 provides
neck nodes each modeled distinct variation patterns similar to average SSE values that are as good or better than the NLPCA
100 epoch solution with only a modest increase in the T V C S

measure with respect to the ANLPCA 200 epoch solution.
Figs. 7 and 8 show how ANLPCA avoids undesirable solutions
where the variation patterns are mixed between the bottleneck
nodes, but also requires more training epochs to achieve the
same level of average SSE as regular NLPCA. Thus, for a
fixed amount of training epochs, there is a tradeoff between
interpretability and the accuracy of the reconstruction.
B. MNIST Handwritten Digits Database

The second set of data that we used to compare ANLPCA
to NLPCA is a subset of the MNIST handwritten digits
database [18]. Specifically, we use the 6131 images of the Fig. 11. T V C S1 measure evaluated across 30 different randomly initiated
digit 3 contained in this database, of which 20% of the weights for NLPCA and ANLPCA using the MNIST handwritten digits.
observations are reserved for testing. The remaining 80% of
the images are used for training data, with a portion (10%)
of the training data used as a validation set. Learning for
both ANLPCA and NLPCA was done via batch backprop-
agation with 750 observations in each batch, with the model
parameters selected by performing a parameter search over
different combinations and choosing the set which minimized
the reconstruction error on validation data. NLPCA models
were trained for 100 epochs and ANLPCA models were
trained for 200 epochs, since only the weights for one of
the two bottleneck nodes are primarily learned during a given
epoch of ANLPCA training. We used an alternating frequency
of R = 50 for the ANLPCA training, which was again chosen
by monitoring the change in the reconstruction SSE by training
epoch and identifying the ‘elbow’ in the curve.
Two nodes were used in the bottleneck layer for all of Fig. 12. Average SSE of 30 randomly initiated NLPCA and ANLPCA models
the trained models. To measure the extent to which each evaluated on the MNIST data.
bottleneck node characterizes a distinct variation pattern, we
use only T V C S1 rather than taking the maximum of T V C S1
in Fig. 11. The ANLPCA models consistently produced lower
and T V C S2 as with the previous example because we do
T V C S1 scores, indicating that the ANLPCA method results
not have any prior knowledge on the nature of the variation
in solutions which better separate the directions represented
patterns contained in this data. The input and output layers
by the bottleneck nodes. A comparison of the average SSE is
have M = 784 nodes corresponding to the dimensionality
shown in Fig. 12. Similar to the previous example, ANLPCA
of the MNIST data, and we use n 1 = n 3 = 150 nodes in
provides lower T V C S1 while yielding solutions with slightly
the hidden layers on either side of the bottleneck layer where
higher average SSE. However, the examples shown in Fig. 9
this number of hidden nodes was determined by performing
show that the digit is still easily recognizable at this level of
a parameter search using validation data. An example of
reconstruction error. In order to obtain more distinct patterns,
the solution obtained by ANLPCA on the MNIST data is
ANLPCA might be considered to be a constrained solution,
presented in Fig. 9. The first bottleneck node v 1 appears to
so that a slight degradation of the fit can be expected relative
control the rotation of the digit, while the second bottleneck
to NLPCA for the same amount of training.
node v 2 characterizes the curvature of the bottom of the
digit. This solution had an average SSE of 33.98 and a
T V C S1 value of 0.039. The NLPCA solution for the same V. C ONCLUSION
set of initial starting weights is shown in Fig. 10. This model We addressed the difficult problem of learning distinct
had an average SSE of 31.87 and a T V C S1 value of 0.67. nonlinear sources of variation in high-dimensional data. Our
In contrast to the ANLPCA model, the NLPCA model appears strategy to promote the learning of distinct sources of vari-
to characterize both the rotation of the digit and the bottom ation is conceptually simple and easily extends from more
curvature with both v 1 and v 2 , which is reflected in its large traditional methods. The previous literature has not provided
T V C S1 score. a methodology for this important problem, and we expect this
As with the simulated point cloud data, we trained models initial work to be enhanced with future approaches. This paper
with both NLPCA and ANLPCA using 30 different sets of facilitates efforts to visualize independent variation patterns,
randomly initiated starting weights for the network. A boxplot which can be applied to a wide range of exploratory data
comparing their T V C S1 values across the 30 trials is shown analysis problems.
Our results using simulated point cloud data and the MNIST Phillip Howard received the Ph.D. degree in indus-
handwritten digits data show that the ANLPCA method consis- trial engineering from Arizona State University,
Tempe, AZ, USA, in 2016, with a focus on machine
tently produces results with the desired separation of distinct learning.
variation patterns. We introduced the T V C S metric as a He is currently a Data Scientist with Intel Cor-
measure that can be used to quantify this distinctness char- poration, Chandler, AZ, USA, where he is involved
in enabling supply chain intelligence and analytics
acteristic of the solution. This will facilitate the comparison capabilities using statistics and machine learning
of our ANLPCA method to future work on learning distinct methods. His current research interests include appli-
nonlinear variation patterns. In future research, we would cations of deep learning for nonlinear dimensionality
reduction, cognitive computing solutions for supply
like to explore different regularization methods that can be chain monitoring, and interpretable feature learning from heterogeneous data
introduced to encourage distinctness of the variation patterns sources in health care applications.
in ANN models.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers
for their constructive feedback.
R EFERENCES
[1] M. A. Kramer, “Nonlinear principal component analysis using autoas- Daniel W. Apley received the B.S., M.S., and Ph.D.
sociative neural networks,” AIChE J., vol. 37, no. 2, pp. 233–243, degrees in mechanical engineering and the M.S.
Feb. 1991. degree in electrical engineering from the University
[2] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component of Michigan, Ann Arbor, MI, USA.
analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, He is currently a Professor of Industrial Engineer-
pp. 1299–1319, Jul. 1998. ing and Management Sciences with Northwestern
[3] A. Sahu, D. W. Apley, and G. C. Runger, “Feature selection for University, Evanston, IL, USA. His current research
noisy variation patterns using kernel principal component analysis,” interests include the interface of engineering mod-
Knowledge-Based Syst., vol. 72, pp. 37–47, Dec. 2014. eling, statistical analysis, and predictive analytics,
[4] J.-K. Im, D. W. Apley, and G. C. Runger, “Tangent hyperplane kernel with a focus on enterprise process modeling and
principal component analysis for denoising,” IEEE Trans. Neural Netw. manufacturing variation reduction applications in
Learn. Syst., vol. 23, no. 4, pp. 644–656, Apr. 2012. which large amounts of data are available. His research has been supported
[5] P. Comon, “Independent component analysis, a new concept?” Signal by numerous industries and government agencies.
Process., vol. 36, no. 3, pp. 287–314, Apr. 1994. Dr. Apley received the NSF CAREER Award in 2001, the IIE Transactions
[6] X. Shan and D. W. Apley, “Blind identification of manufacturing vari- Best Paper Award in 2003, and the Wilcoxon Prize for best practical
ation patterns by combining source separation criteria,” Technometrics, application paper appearing in Technometrics in 2008. He is also the Editor-in-
vol. 50, no. 3, pp. 332–343, Jan. 2008. Chief of the Technometrics and has served as Editor-in-Chief for the Journal of
[7] D. W. Apley and H. Y. Lee, “Identifying spatial variation patterns in Quality Technology, the Chair of the Quality, Statistics and Reliability Section
multivariate manufacturing processes,” Technometrics, vol. 45, no. 3, of INFORMS, and the Director of the Manufacturing and Design Engineering
pp. 220–234, Jan. 2003. Program at Northwestern.
[8] C. Jutten and J. Karhunen, “Advances in nonlinear blind source separa-
tion,” in Proc. 4th Int. Symp. Independent Compon. Anal. Blind Signal
Separat., Apr. 2003, pp. 245–256.
[9] M. J. Kirby and R. Miranda, “Circular nodes in neural networks,” Neural
Comput., vol. 8, no. 2, pp. 390–402, Feb. 1996.
[10] M. Scholz, M. Fraunholz, and J. Selbig, “Nonlinear principal compo-
nent analysis: Neural network models and applications,” in Principal
Manifolds for Data Visualization and Dimension Reduction. New York,
NY, USA: Springer, 2008, pp. 44–67.
[11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of George Runger received the Ph.D. degree in statis-
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, tics from the University of Minnesota in 1982.
2006. He was a Senior Engineer and a Technical Leader
[12] P. Howard, D. Apley, and G. Runger, “Identifying nonlinear variation for data analytics projects with IBM. He is currently
patterns with deep autoencoders,” IIE Trans., 2016. a Professor with the School of Computing, Infor-
[13] R. Saegusa, H. Sakano, and S. Hashimoto, “Nonlinear principal matics, and Decision Systems Engineering, Arizona
component analysis to preserve the order of principal components,” State University, Tempe, AZ, USA. He has authored
Neurocomputing, vol. 61, pp. 57–70, Oct. 2004. over 100 publications in research journals with fund-
[14] M. Scholz and R. Vigário, “Nonlinear PCA: A new hierarchical ing from federal and corporate sponsors. His current
approach,” in Proc. ESANN, Apr. 2002, pp. 439–444. research interests include analytical methods for
[15] A. Gifi, Nonlinear Multivariate Analysis. Hoboken, NJ, USA: Wiley, knowledge generation and data-driven improvements
1990. in organizations, and machine learning for large, complex data, and real-time
[16] M. Linting, J. J. Meulman, P. J. Groenen, and A. J. van der Koojj, analysis, with applications to surveillance, decision support, and population
“Nonlinear principal components analysis: Introduction and application,” health.
Psychol. Methods, vol. 12, no. 3, pp. 336–358, Sep. 2007. Dr. Runger was an affiliated Faculty Member with BMI and the related
[17] P. Simard, Y. LeCun, and J. S. Denker, “Efficient pattern recognition Center for Health Information and Research for several years. He is also a
using a new transformation distance,” in Proc. Adv. Neural Inf. Process. Reviewer for many journals in the area of machine learning and statistics
Syst., 1993, pp. 50–58. and the Department Editor for healthcare informatics for IIE Transactions on
[18] Y. LeCun and C. Cortes, The MNIST Database of Handwritten Digits, Healthcare Systems Engineering. He is also the Chair of the Department of
1998. Biomedical Informatics, Arizona State University.

Distinct Variation Pattern Discovery Using Alternating Nonlinear Principal Component Analysis

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Distinct Variation Pattern Discovery Using Alternating Nonlinear Principal Component Analysis

Diunggah oleh

Hak Cipta:

Format Tersedia

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Distinct Variation Pattern Discovery Using

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

the number of desired components) and does not allow the

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 3. Tangent vectors for ANLPCA solution.

Fig. 5. Tangent vectors for regular NLPCA solution.

To extend ANLPCA for use with P > 2 bottleneck nodes,

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

During backpropagation training, 30% of the training instances

Fig. 2 shows a typical ANLPCA solution obtained on this

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

100 epoch solution with only a modest increase in the T V C S

B. MNIST Handwritten Digits Database

Anda mungkin juga menyukai