Least Squares Method For Factor Analysis

University of California
Los Angeles
Least Squares Method for Factor Analysis
A thesis submitted in partial satisfaction

of the requirements for the degree
Master of Science in Statistics
by
Jia Chen
2010
c Copyright by

Jia Chen
2010
The thesis of Jia Chen is approved.
Hongquan Xu
Yingnian Wu
Jan de Leeuw, Committee Chair
University of California, Los Angeles

2010
ii
To my parents
for their permanent love.
And to my friends and teachers
who have given me precious memory and tremendous encouragement.
iii
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Factor Analysis Models . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1
Random Factor Model . . . . . . . . . . . . . . . . . . . .
2.1.2
Fixed Factor Model . . . . . . . . . . . . . . . . . . . . . .
Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1
Principal Component Method . . . . . . . . . . . . . . . .
2.2.2
Maximum Likelihood Method . . . . . . . . . . . . . . . .
2.2.3
Least Squares Method . . . . . . . . . . . . . . . . . . . .
11
Determining the Number of Factors . . . . . . . . . . . . . . . . .
12
2.3.1
Mathematical Approaches . . . . . . . . . . . . . . . . . .
12
2.3.2
Statistical Approach . . . . . . . . . . . . . . . . . . . . .
13
2.3.3
The Third Approach . . . . . . . . . . . . . . . . . . . . .
14
3 Algorithms of Least Squares Methods . . . . . . . . . . . . . . .
15
2.2
2.3
3.1
Least Squares on the Covariance Matrix . . . . . . . . . . . . . .
16
3.1.1
Loss Function . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.1.2
Projections . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.1.3
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
16
iv
3.2
Least Squares on the Data Matrix . . . . . . . . . . . . . . . . . .
19
3.2.1
Loss Function . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2.2
Projection . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2.3
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
4.1
9 Mental Tests from Holzinger-Swineford . . . . . . . . . . . . . .
25
4.2
9 Mental Tests from Thurstone . . . . . . . . . . . . . . . . . . .
28
4.3
17 mental Tests from Thurstone/Bechtoldt . . . . . . . . . . . . .
30
4.4
16 Health Satisfaction items from Reise . . . . . . . . . . . . . . .
32
4.5
9 Emotional variable from Burt . . . . . . . . . . . . . . . . . . .
35
5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . .
37
A Augmented Procrustus . . . . . . . . . . . . . . . . . . . . . . . . .
39
B Implementation Code . . . . . . . . . . . . . . . . . . . . . . . . . .
41
C Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4 Examples
List of Tables
4.1
LSFA methods Summary - 9 Mental Tests from Holzinger-Swineford 27
4.2
Loss function Summary - 9 Mental Tests from Holzinger-Swineford 27
4.3
Loading Matrices Summary - 9 Mental Tests from Holzinger-Swineford 28
4.4
LSFA methods Summary - 9 Mental Tests from Thurstone . . . .
29
4.5
Loss function Summary - 9 Mental Tests from Thurstone . . . . .
30
4.6
Loading Matrices Summary - 9 Mental Tests from Thurstone . . .
30
4.7
LSFA methods Summary - 17 mental Tests from Thurstone/Bechtoldt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.8
Loss function Summary - 17 mental Tests from Thurstone/Bechtoldt 32
4.9
Loading Matrices Summary - 17 mental Tests from Thurstone . .
32
4.10 LSFA methods Summary - 16 Health Satisfaction items from Reise 34

4.11 Loss function Summary - 16 Health Satisfaction items from Reise
34
4.12 Loading Matrices Summary - 16 Health Satisfaction items from

Reise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.13 LSFA methods Summary - 9 Emotional variable from Burt . . . .
36
4.14 Loading Matrices Summary - 9 Emotional variable from Burt
36
vi
. .
Abstract of the Thesis
Least Squares Method for Factor Analysis

by
Jia Chen
Master of Science in Statistics
University of California, Los Angeles, 2010
Professor Jan de Leeuw, Chair
This paper demonstrates the implementation of using alternating least squares

to solve the common factor analysis. The algorithm leads to convergence and
accumulation points of the sequences it generates will be stationary points. In
addition to implementing the Procrustus algorithm, it provides a means of verifying that the solution obtained is at least a local minimum of the loss function.
vii
CHAPTER 1
Introduction
A major objective of scientific or social activities is to summarize, by theoretical
formulations, the empirical relationships among a given set of events and discover
the natural laws behind thousands of random events. The events can be investigated are almost infinite, so it is difficult to make any general statement about
phenomena. However, it could be stated that scientists analyze the relationships
among a set of variables, while these relationships are evaluated across a set of
individuals under specified conditions. The variables are the characteristic being
measured and could be anything that can be objectively identified or scored.
Factor analysis can be used for theory instrument development and assessing
construct validity of an established instrument when administered to a specific
population. Through factor analysis, the original set of variables is reduced to
a few factors with minimum loss of information. Each factor represents an area
of generalization that is qualitatively distinct from that represented by any other
factors. Within an area where data can be summarized, factor analysis first represents that area by a factor and then seeks to make the degree of generalization
between each variable and the factor explicit [6].
There are many methods available to estimate a factor model, and the purpose
of this paper is to present and implement a new least squares algorithm, and
then compare its speed of convergence and model accuracy to some existing
approaches. To begin, we provide some matrix background and assumptions on
the existence of a factor analysis model.
CHAPTER 2
Factor Analysis
Many statistical methods are used to study the relation between independent and
dependent variables. Factor analysis is different; the purpose of factor analysis is
data reduction and summarization with the goal understanding causation. It aims
to describe the covariance relationships among a large set of observed variables in
terms of a few underlying, but unobservable, random quantities called factors.
Factor analysis is a branch of multivariate analysis that was invented by psychologist Charles Spearman. He discovered that school childrens scores on a
wide variety of seemingly unrelated subjects were positively correlated, which led
him to postulate that a general mental ability, or g, underlies and shapes human
cognitive performance. Raymond Cattell expanded on Spearmans idea of a twofactor theory of intelligence after performing his own tests and factor analysis.
He used a multi-factor theory to explain intelligence. Factor analysis was developed to analyze test scores so as determine if intelligence is made up of a single
underlying general factor or of several more limited factors measuring attributes
like mathematical ability. Today factor analysis is the most widely used brach
of multivariate analysis in the psychological field, and helped by the advent of
electronic computers, it has been quickly spreading to economics, botany, biology
and social sciences.
Factor analysis has two main motivations to study it. One of the purposes of
factor analysis is to reduced the number of variables. In multivariate analysis, one
often has data or a large number of variables Y1 , Y2 , ... Ym , and it is reasonable

to believe that there is a reduce list of unobserved factors that determine the
full dataset. And the primary motivation for factor analysis is to detect patterns
of relationship among many dependent variables with the goal of discovering the
independent variables that affect them, event though those independent variables
cannot be measured directly. The object of a factor problem is to account for
the tests, the smallest possible number that is consistent with acceptable residual
errors [21].
2.1
Factor Analysis Models
The factor analysis is generally presented with the framework of the multivariate
linear model for data analysis. Two classical linear common factor models are
briefly reviewed. For a more comprehensive discussion should refer to Anderson
and Rubin [1956] and to Anderson [1984] [1] [2].
Common factor analysis (CFA) starts with the assumption that the variance
in a given variable can be explained by a small number of underlying common
factors. For the common factor model, the factor score matrix can be divided
into two parts. The common factor part and unique factor part. The model
in matrix algebra form as:
Y = F + U,
nm
nm
nm
F = H A ,
nm
np pm
U = E
nm
D .
nm mm
It can be rewritten as:

Y = HA + ED
(2.1)
where the common factor part is linear combination of p common factors (Hnp )
and factor loadings (Amp ). The unique factor part is linear combination of m
unique factors (Enm ) and unique factor scores (Dmm ) is a diagonal matrix.
In the common factor model, common factor and unique factor are assumed
to be orthogonal and follow a multivariate normal distribution with mean zero
and scaled to have unit length. The assumption of normality means they are
statistically independent random variables. The common factor are assumed to
be independent of the unique factor. Therefore, E(H) = 0, H H = Ip , E(E) =
0, E E = Im , E H = 0mp , and D is a diagonal matrix.
The common factor model (2.1) and assumptions imply the following model
correlation structure for the observed variables:
= AA + D2
2.1.1
(2.2)
Random Factor Model
The matrix Y is assumed to be a realization of a matrix-valued random variable

Y, where the random variable Y has a random common part F and a random
unique part U. Thus
Y = F + U,
nm
nm
nm
F = H A ,
nm
np pm
U = E
nm
D .
nm mm
Each row of Y are corresponding with an individual observation, and these observations are assumed to be independent. Moreover the specific parts are assumed
to be uncorrelated with the common factors, and with the other specific parts.
2.1.2
Fixed Factor Model
The random factor model explained above was criticized soon after it was formally
introduced by Lawley. The point is that in factor analysis different individuals
are regarded as drawing their scores from different k-way distributions, and in
these distributions the mean for each test is the true score of the individual on
that test. Nothing is implied about the distribution of observed scores over
a population of individuals, and one makes assumptions only about the error
distributions [25].
There is a fixed factor model, which assumes
Y= F+U
The common part is a bilinear combination of a number of a number of
common factor loadings and common factor scores
F = HA
In the fixed model we merely assume the specific parts are uncorrelated withe
the other specific parts.
2.2
Estimation Methods
In common factor analysis, the population covariance matrix of m variables

with n common factors can be decomposed as
= AA + D2 ,
where A is the loading matrix of order m n and D2 is the matrix of unique variance of order m which is diagonal and non-negative definite. These parameters
are nearly always unknown and need to be estimated from the sample data. The
estimation are relatively straightforward method of breaking down a covariance
or correlation matrix into a set of orthogonal components or axes equal in number
to the number of variate methods. The sample covariance matrix is occasionally
used, but it is much more common to work with the sample correlation matrix.
There are many different methods have been developed for estimating, the
best known of these is principal factor method. It extracts the maximum amount
of variance that can be possibly extracted by a given number of factors. This
method chooses the first factor so as to account for as much as possible of the
variance from the correlation matrix, the second factor to account for as much
as possible of the remaining variance, and so on.
In 1940, a major step forward was made by D. N. Lawley, who developed
the Maximum Likelihood equations. These are fairly complicated and difficult
..
to solve, but recent computational advances, particularly by K. G. Joreskog,

have made maximum-likelihood estimation a practical proposition, and computer
programs has widely available [4]. Since then, the maximum likelihood become
a dominant estimation method in factor analysis.
2.2.1
Principal Component Method
Let variance-covariance matrix have corresponding eigenvalues and eigenvector.

Eigenvalues of
b
b2 , ...,
bm
1 ,
where
b1
b2 ...
bm )
(
Eigenvector of
b
e1 , b
e2 , ..., b
em
The Spectral Decomposition of [13] says, the variance-covariance matrix

can be expressed as the sum of m eigenvalues multiplied by their eigenvectors
and their transpose. The idea behind the principal component method is to
approximate this expression.
m
X
i ei ei
i=1

|
1 e1
1 e1

2 e2
= AA
2 e2 ...
m em
{z
}
.
m em
{z
}
|
(2.3)
Instead of summing the equation (2.3) from 1 to m, we would sum it from 1

to p to estimate the variance-covariance matrix.
b =
p
X
i=1
i ei ei =
1 e1
1 e1

2 e2
.
p
bA
b
=A
2 e2 ...
p ep
{z
}
.
b
A
p
p ep
| {z }
b
A
(2.4)
The equation (2.4) yields the estimator for the factor loadings:

b = e e ... p e
A
1 1
2 2
p p
(2.5)
Recall equation (2.2), D2 is going to be equal to the variance-covariance matrix

b A
b .
minus A
2.2.2
b2 = A
bA
b
D
(2.6)
Maximum Likelihood Method
Maximum-likelihood method was first proposed to factor analysis by Lawley

(1940, 1941, 1943) but its routine use had to await the development of computers and suitable numerical optimization procedures [19]. This method is the
procedure of finding the value of one or more parameters for a given statistic
which makes the known likelihood distribution a maximum. It consist in finding factor loadings which maximize the likelihood function for a specified set of
unique variances. The Maximum-Likelihood method are assumed that the data
are independently sampled from a multivariate normal distribution. As the common factor (H) and unique factor (E) are assumed to be multivariate normal, Y
= HA + ED are then multivariate normal with mean vector 0 and variancecovariance matrix . Maximum likelihood method is estimating the matrix of
factor loadings and unique factors. The method estimator for factor loadings A
b and D
b that maximizes the
and the unique factors D are obtained by finding A
log-likelihood, which is given by the following expression:
(A, D) =

1
nm
n
log2 log AA + D2 Y (AA + D2 )1 Y
2
2
2
(2.7)
There are two types of maximum likelihood methods, one is called Covariance Matrix Methods. This method was first proposed by Lawley [14], and
then popularized and programmed by J

oreskog [12]. Since then the multinormal
maximum likelihood for the random factor model became the dominant estimation method in factor analysis. The maximum likelihood was applied to the likelihood function of the covariance matrix, assuming multivariate normality. The
negative log-likelihood measures the distance between the sample and population
covariance model, and choose A and D to minimize
(A, D) = n log || + n tr1 S,
(2.8)
where S is the sample covariance matrix of Y, and = AA + D2 .

In Anderson and Rubin the impressive machinery developed by the Cowles
Commission was applied to both the fixed and random factor analysis model.
Maximum likelihood was applied to likelihood function of the covariance matrix,
assuming multivariate normality.
The other method is called Data Matrix Methods. The maximum likelihood procedures were proposed by Lawley [14] were criticized soon after they
appeared by Young. Young said Such a distribution is specified by the means
and variance of each test and the covariance of the tests in pairs it has no
parameters distinguishing different individuals. Such a formulation is therefore
inappropriate for factor analysis, where factor loadings of the tests and of the
individuals enter in a symmetric fashion in a bilinear form. [25]
Young proposed to minimize the log-likelihood of the data,
(H, A, D) = n log |D| + tr(Y HA ) D1 (Y HA )
(2.9)
where D is known diagonal matrices with column (variable) weights. The solution
is given by a weighted singular value decomposition of Y.
The basic problem with Youngs method is that it assumes the weights to
be known. One solution, suggested by Lawley, is to estimate them along with
10
the loadings and uniquenesses [15]. If there are no person-weights, Lawley suggests to alternate minimization over (H, A), which is done by weighted singular
value decomposition, and minimization over diagonal D, which simply amounts
to computing the average sum of squares of the residuals for each variable. However, iterating two minimizations produces a block relaxation algorithm intended
to minimize the negative log-likelihood does not work. Although the algorithm
produces a decreasing sequence of loss function values. A rather disconcerting
feature of the new method is, however, that iterative numerical solutions of the
estimation equations either fail to converge, or else converge to unacceptable solutions in which one of more of the measurements have zero error variance. It is
apparently impossible to estimate scale as well as location parameters when so
many unknowns are involved [24].
In fact, if we look at the loss function we can see it is unbounded below. We
can choose scores to fit one variable perfectly, and then let the corresponding
variance term approach zero [1].
In 1952, Whittle suggested to take D proportional to the variance of the variables. This amounts to doing a singular value decomposition of the standardized
variables. J
oreskog makes the more reasonable choice of setting D proportional
to the reciprocals of the diagonals of the inverse of the covariance matrix of the
variables [11].
2.2.3
Least Squares Method
The least-squares method is one of the most important estimation methods, which
attempts to obtain such values of the factor loading A and the unique variance
D2 that minimizes a different loss function. The least squares loss function either
used in minimizing the residual of covariances matrix or residual of data matrix.
11
The least squares loss function used on the covariances matrix is

1
(A, D) = SSQ(C AA D2 )
2
We minimize over A Rmp and D Dm , the diagonal matrices of order m.
There have been four major approaches to minimizing this loss function.
The least squares loss function used on the data matrix is
1
(H, A, E, D) = SSQ(Y HA ED)
2
We minimize over H Rnp , E Rnm , A Rmp and D Dm , under the
conditions that H H = I, E E = I, H E = 0 and D is diagonal.
2.3
Determining the Number of Factors
A factor model is not solveable without first determining the number of factors
p. How many common factors should be included in the model? This requires a
determination of how parameters are going to be involved. There are statistical
and mathematical approaches to determine the number of factors.
2.3.1
Mathematical Approaches
The mathematical approach to the number of factors is concerned with the number of factors for a particular sample of variables in the population, and the
theories of approaches are based on a population correlation matrix. Mathematically, the number of factors underlying any given correlation matrix is a function
of its rank. Estimating the minimum rank of the correlation matrix is
the same as estimating the number of factors [6].
1. The percentage of variance criterion. This method applies particularly to the principal component method. The percentage of common
12
variance extracted is computed by using the sum of the eigenvalues of

variance-covariance matrix () in the division. Usually, investigators compute the cumulative percentage of variance after each factor is removed
from the matrix and then stop the factoring process when 75, 80 or 85% of
the total variance is account for.
2. The latent root criterion. This rule is the commonly used criterion of
long standing and performs well in practice. This method use the variancecovariance matrix and choose the number of eigenvalues greater than one.
3. The scree test criterion. The scree test was named after the geological
term scree. It also results in practice well. This rule is derived by plotting
the latent roots against the number of factors in their order of the extraction, and the shape of the resulting curve is used to evaluate the cutoff
point. Use the scree test based on a plot of the eigenvalues of . If the
graph drops sharply, followed by a straight line with much smaller slope,
choose m equal to the number of eigenvalues before the straight line begins.
Because the test involves subjective judgement, it cannot be programmed
into the computer run.
2.3.2
Statistical Approach
In the statistical procedure for determining the number of factors to extract,

the following question is asked: Is the residual matrix after p factors have been
extracted statistically significant? A hypothesis tesst would be state to answer
this question, H0 : = AA + D 2 vs H1 : 6= AA + D 2 where H is mp
matrix. If the statistic is significant at beyond the 0.05 level, then the number
of factors is insufficient to totally explain the reliable variance. If the statistic is
13
nonsignificant, then the hypothesized number of factors is correct.

Bartlett has presented a chi-square test of the significance of a correlation
matrix, the test statistic is
2 = (n 1
2 + 5
) ln ||
6
(2.10)
where = 21 [(m p)2 m p] and || is the determinant of the correlation

matrix [3].
2.3.3
The Third Approach
There is no single way to determine the number of factors to extract. A third

approach to sometimes use is to look the theory within field of study for indications of how many factors to expect. In many respects this is a better approach
because its letting the science to drive the statistic rather than the statistic to
drive the science.
14
CHAPTER 3
Algorithms of Least Squares Methods
Given a multivariate sample of n independent observations on each of taking m
variables. Collect these data in an nm matrix Y. In common factor analysis
model can be written as:
Y = HA + ED
(3.1)
where, H Rnp , A Rmp , E Rnm , D Rmm , H H = I, E E = I, and

H E = 0 , and where D is diagonal, p is the number of common factors.
Minimizing the least squares loss function is a form of factor analysis, but it
is not the familiar one. In classical least squres factor analysis, as described in
Young [1941], Whittle [1952] and J
oreskog [1962], the unique factors E are not
parameter in the loss function [25] [24] [11]. Instead the unique variances are used
to weight the residuals of each observed variable. Two different loss function will
be illustrated.
15
3.1
3.1.1
Least Squares on the Covariance Matrix

Loss Function
The least squares loss function used in LSFAC is

1
(A, D) = SSQ(C AA D2 )
2
(3.2)
We minimize over A Rmp and D Dm , the diagonal matrices of order m.

3.1.2
Projections
We also define the two projected or concentrated loss functions, in which one set
of parameters is minimized out,
(A) = minm (A, D) =
DD
(cjl aj al )2 ,
(3.3)
m
1 X 2
(C D 2 ),
2 s=p+1 s
(3.4)
1j<lm
and
(D) = minm (A, D) =
DD
Note that is used as a generic symbol for these LSFAC loss function, because
it will be clear from the context which we are using.
3.1.3
Algorithms
There are a variety of approaches to the problem, each yielding an appropriate

solution but with varying degrees of efficiency. We would briefly review four
major approaches to minimizing the loss function.
1. Thomsons Principal Factor Analysis. PFA [Thomson, 1934] is an alternating least squares method [De Leeuw, 1994], that is alternated minimizing
16
over A for D fixed at its current value and minimizing over D for A fixed
at its current value.
The basic procedure starts with the choice of the number of factors p and
the selection of an arbitrary set of unique variance D2 estimates for the m
variables. If C - D2 = KK is the eigen-decomposition of C - D2 , and
we write p and Kp for the p largest eigenvectors and corresponding eigen1
vectors, then the minimum over A for fixed D is A = K 2 . If fewer than

p eigenvalues are positive, then the negative elements in p are replaced
by zeroes. The minimum over D for fixed A is attained at D2 = diag(C
- AA ). Because D2 = diag(C - AA ) we always have D2 diag(C),
but there is no guarantee that convergence is to a D for which both D for
which both D2 0 and C D2 0. The algorithm for this method may
be expressed in the following form.
Step 1. Start with the observed covariance matrix with arbitrary diagonal:
b 2.
CD
b 2 = KK , where is the diagonal matrix of the

Step 2. Compute: C D
eigenvalues and the columns of K are the associated eigenvectors.
1
b = Kp p2 where p is the
Step 3. Determine the first p principal factors: A
p p submatrix of containing the p largest eigenvalues, and Kp is
the corresponding n p submatrix of K.

b 2 = C diag(A
bA
b ).
Step 4. Determine the reproduced unique variance: D
Step 5. Repeat Steps 2-4 until the convergence criterion is met.
2. Comreys Minimum Residual Factor Analysis. Comrey proposed minimum

residual factor analysis, which takes into account only the off-diagonal ele-
17
ments of the variance-covariance or correlation matrix [5]. The method was

put on a more solid footing after modified by Zegers and Ten Berge [26].
The object of MRFA is to determine a matrix A of order m p, which,
given some covariance matrix C of order m m, minimizes the function
(A) = tr [(C0 AA + diag(AA ) (C0 AA + diag(AA )] ,
(3.5)
where C0 = C diag(C).
Zegers and Berge [1983] show equation (3.5) is minimized by taking,
aik =
X
a2jk
1 X
(k)
cij aij ,
(3.6)
(k)
where cij is the i, jth element of the matrix of residuals resulting from
partialling all factors except the kth from C.
The basic procedures given some starting matrix A, the elements of the
first column are, each in turn, replaced once according to equation 16. then
in the same way, the elements of the second column are replaced, and so
on, until all elements of A have been replaced once. This constitutes one
iteration cycle. This iterative procedure will be terminated when, in one
iteration cycle, the value of the function in equation 15 decreases less than
some specified small value [26].
3. Harmans MINRES. In minimum residual [Harman and Jones, 1966; Harman and Fukuda, 1966] we project out D [8][7]. Thus we define
Cp (A, ) =
X
1
(cjl aj al )2 ,
min SSQ(C AA D2 ) =
2 D
1j<lm
and then use Alternate Least Squares to minimize the projected loss function over A, using the m rows as blocks.
18
4. Gradient and Newton Methods. Gradient methods [de Leeuw 2010] can be
applied by projecting out A. Thus we define
Cp (, D) =
X
1
min SSQ(C AA D2 ) =
2s (C D2 )2 .
2 A
s=p+1
Now use
2
Dj s (D) = zjs
,
Dj ls (D) = zjs (C D2 s I)+

jl ,
where zs is the normalized eigenvector corresponding with eigenvalue s
and (C D2 s )+
jl is the (j, l) element of the Moore-Penrose inverse of
CD2 s I. This directly gives formulas for the first and second derivatives
of the loss function.
Dj C(, D) =
Djl C(, D) =
m
X
s=p+1
m
X
s=p+1
3.2
3.2.1
2
s zjs
,
2 2
zls
zjs 2s zjs zls (C D s I)+
jl
Least Squares on the Data Matrix

Loss Function
The loss function used in LSFAY is

1
(H, E, A, D) = SSQ(Y HA ED),
2
(3.7)
We minimize over H Rnp , E Rnm , A Rmp and D Dm , under the

conditions that H H = I, E E = I, H E = 0 and D is diagonal.
19
3.2.2
Projection
The result in Appendix A can be use to define a projected version of the LSFAY
loss function.
1
min SSQ(Y HA ED)
2 H,E
m
X
1
1
= SSQ(Y ) + (A|D)
s (YA|YD),
2
2
s=1
(A, D) =
where the s (YA|YD) are the ordered singular values of (YA YD). Note that
(YA YD) is n (m + p), but its rank is less than or equal to m. Thus at
least p p of the singular values are zero.
The singular values are the square roots of the ordered eigenvalues of s of
A CA A CD
.
U=
DCA DCD
Thus we can write
m
Xp
1
1
1
s (E).
(A, D) = tr(C) + SSQ(A) + SSQ(D)
2
2
2
s=1
3.2.3
Algorithms
Our approach may seem to be quite similar to the approach proposed by Paul
Horst in his book [9]. Where we differ from Horst is in the additional assumptions
that D is diagonal and that E has the same size as the data Y. This puts us solidly
in the common factor analysis framework. Horst, on the contrary, only makes
the assumption that there is a small number of common and residual factors, and
he then finds them by truncating the singular value decomposition. Separating
common and unique factors is be done later by using rotation techniques. For
Horst factor analysis is just principal component analysis with some additional
interpretational tools.
20
1. Alternating Least Squares. This approach to simultaneous least squares

estimation of both loadings and unique factors was first introduced by Jan
de Leeuw [2004], and has since then been used by Unkel and Trendafilov
[23]; Trendafilov and Unkel [22] [16].
The algorithm to minimize the loss function (3.2) is of the alternating least
squares type. It is started with an initial estimate A(0) and D(0) and then
alternate
h
i
h
i
H(k) | E(k) procrustus YA(k) | YD(k) ,
A(k+1) = Y H(k) ,
D(k+1) = diag(Y E(k) ).
(3.8)
(3.9)
(3.10)
The Procrustus transformation of a matrix is defined in terms of its singular

value decomposition. If the n m matrix X has rank m and singular value
decomposition X = KL , then we define Procrustus(X) = KL . As
shown in the following, the Procrustus transformation as a set of matrix.
By defining the block matrices B = [H : E] and U = [A : D] of dimensions
n(p+m) and m(p+m), therefore, (3.2) can be rewritten as
tr(Y BU ) (X BU ) = tr(X X) + tr(B BU U) 2tr(B XU) , (3.11)
which is optimized subject to a new constraint BB = In . The tr(B BU U)
in (3.11), can be written as
21

H
tr(B BU U) = tr
E
Ip
= tr
0
mp
A A
= tr
DA
A
[H : E] [A : D]
0pm
AA AD

2
Im
DA D
A D
2
D
= tr(A A) + tr(D2 )
showing that tr(B BU U) does not depend on H and E. Hence, as with

the standard Procrustes problem, minimizing (3.11) over B is equivalent
to the maximization of tr(B YU). For this problem a closed-form solution
compute from the singular value decomposition (SVD) of YU exists [10].
After solving the Procrustes problem for B = [H : E], one can update the
values of A and D by A = Y H and D = diag(Y E) using the identities,
H Y = H (HA + ED) = H HA + H ED H Y = A
(3.12)
E Y = E (HA + ED) = E HA + E ED E Y = D
(3.13)
which follow from the model (3.1). The alternating least squares process is
continued until the loss function (3.2) cannot be reduced further.
Alternatively, it can be used the Moore-Penrose inverse and then matrix
symmetric square root, because
Procrustus(X) =
(XX )+ X = X
p
(XX )+ .
If rank(X) = r < min (n, m) then we define the Procrustus transformation

as a (closed) set of matrices. See Appendix A for details.
22
It is important that we can use the symmetric square root to construct a

version of the algorithm that does not depend on the number of observations
n, and that can be applied to examples that are only given as covariance
or correlation matrices [18]. We can combine the equations in (3.8), (3.9)
and (3.10) to
(k+1)
D(k+1)
q
= CA (U(k) )+ ,
q
(k)
= diag(CD
(U(k) )+ ).
k
(3.14)
(3.15)
This version of the algorithm no longer use Y, only C. It can be though of

as an adapted version of Bauer-Rutishauser simultaneous iteration [20].
2. Gradient and Newton Methods. Suppose the eigenvector zs of U corresponding with s , is partitioned, by putting the first p elements in vs and
the last m elements in ws . Then [De Leeuw, 2007]
p
s (U)
1
=p
vrs cj (Avs + Dws ),
ajr
s (U)
p
s (U)
1
wjs cj (Avs + Dws ),
=p
djj
s (U)
where cj is column j of C. Collecting terms gives
p
D1 (A, D) = A CA U+ ,
p
D2 (A, D) = D diag(CD U+ ),
which shows that the alternating least squares algorithm (3.14) and (3.15)
can be written as a gradient algorithm with constant step-size
A(k+1) = A(k) D1 (A(k), D(k) ),
D(k+1) = D(k) D2 (A(k), D(k) ).
23
Note that the matrix on the right is
U+ , the symmetric square root of
the Moore-Penrose inverse of U [17].

Convergence may not be immediately obvious. But in fact the iterations
generated are exactly the same as those of the Procrustus algorithm, and
thus we obtain convergence from general ALS theory. But of course now
computations no longer depend on n, and we only need the covariance
matrix in order to be able to compute the optimal A and D.
24
CHAPTER 4
Examples
Algorithms derived in earlier chapter has been programmed and applied to a
number of problems. In all programs the same criterion was used to stop the
iterations, that either the loss function decrease less than 1e - 6 or iteration equal
to 1000. In this section we can compare these solutions of applying least squares
method to some classic data sets. Results for these data sets are given here. Five
sets are considered: a) 9 mental tests from Holzinger and Swineford (1939); b) 9
mental tests from Thurstone (McDonald, 1999; Thurstone & Thurstone, 1941);
c) 17 mental tests from Thurstone and Bechtoldt (Bechtoldt, 1961); d) 14 tests
from Holzinger and Swineford (1937); e) 9 tests from Brigham (Thurstone, 1933).
The first data sets is included in the HolzingerSwineford1939 data set in the
lavaan package. The last four data sets are included in the bifactor data set in
the psych package.
4.1
9 Mental Tests from Holzinger-Swineford
A small subset with 9 variables of the classic Holzinger and Swineford (1939)
dataset which is discussed in detail by J
oreskog (1969). This dataset consists of
mental ability test scores of seventh and eighth-grade children from two different
schools (Pasteur and Grant-White). These nine tests were grouped into three
factors. Ten different solutions were computed, six of them we iterate until the
25
loss function decrease less than 1e - 6.

In table 4.1 shows the value of loss function, number of iterations and CPU
times that expression to be timed1 of applying LSFAC, LSFAY and ML to
Holzinger-Swinefords example.
For the least squares factor analysis method used covariance matrix (LSFAC)
case, in table 4.1, the principal factor analysis (PFA) beats BFGS2 and CG3
by a factor of 10 in the user time. In addition, going from PFA to Comrey
algorithm makes convergence 3 times faster and from PFA to Harman algorithm
the convergence speed gains 24%. Further analysis, going from PFA algorithm
to Newton algorithm again makes convergence 2.6 times as fast (observe that
Newton algorithm starts with a small number of Comrey algorithm iterations to
get into an area where quadratic approximation is safe).
For the least squares factor analysis method used data matrix (LSFAY) case,
the direct alternating least squares beats BFGS and CG by 50 times. Also,
going from lsfaySVD to lsfayInvSqrt the speed gains another 10%, plus a huge
advantage in used storage because the problem no longer depends on n and they
converge on the same iteration steps. The PFA algorithm is 49% times faster
than the direct alternating least squares.
In table 4.2 we compute the LSFAC loss function (LSFACf) of applying the
solutions from LSFAC and LSFAY algorithms and we do the same to the LSFAY
loss function (LSFAYf). We see the results of LSFAC loss function applying
different algorithms are almost identical, and similarly as LSFAY loss function.
1
The user time is the CPU time charged for the execution of user instructions of the calling
process, and the system time is the CPU time charged for execution by the system on behalf
of the calling process
2
The BFGS method approximates Newtons method, a class of optimization techniques that
seeks a stationary point of a function.
3
Method CG is a conjugate gradients method based on that by Fletcher and Reeves (1964).
26
In addition, each entry for all pairs of loading matrices comparison is 1.00, that
verifies all algorithms are similarity to each other.
Table 4.1: LSFA methods Summary - 9 Mental Tests from Holzinger-Swineford
LSFAC
Newton
LSFAY
PFA Comrey Harman BFGS
CG InvSqrt
SVD BFGS
ML
CG
ML
loss 0.01307 0.01307 0.01307 0.01307 0.01307 0.01307 0.0074 0.0074 0.0074 0.0074
iteration 3.00000 57.0000 17.0000 16.0000
94.000 94.000
user.self 0.02000 0.07200 0.01700 0.05800 0.59000 0.72700 0.1070 0.1180 6.2120 6.7300 0.085
sys.self 0.00000 0.00000 0.00000 0.00100 0.00300 0.00400 0.0030 0.0060 0.0250 0.0290 0.002
Table 4.2: Loss function Summary - 9 Mental Tests from Holzinger-Swineford

LSFACf
LSFAYf
0.01306885
0.007427571
PFA 0.01306885
0.007427550
Newton
Comrey
0.01306885
0.007427556
Harman
0.01306885
0.007427547
InvSqrt 0.01313887
0.007396346
ML
0.01333244
27
0.007454523
Table 4.3: Loading Matrices Summary - 9 Mental Tests from Holzinger-Swineford

LSFAC
LSFAY
Newton PFA Comrey Harman BFGS
CG InvSqrt SVD BFGS
CG
Newton
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
PFA
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
Comrey
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
Harman
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
BFGS
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
CG
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
InvSqrt
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
SVD
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
BFGS
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
CG
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
4.2
9 Mental Tests from Thurstone
A classic data set is the 9 variable Thurstone problem which is discussed in detail
by R. P. McDonald (1985, 1999). These nine tests were grouped by Thurstone,
1941 into three factors: Verbal Comprehension, Word Fluency, and Reasoning.
The original data came from Thurstone and Thurstone (1941) but were reanalyzed by Bechthold (1961) who broke the data set into two. McDonald, in turn,
selected these nine variables from a larger set of 17. Nine different solutions were
computed, five of them we iterate until the loss function decrease less than 1e 6.
For the LSFAC case, in table 4.4, the PFA beats BFGS and CG by a factor
28
of 10. In addition, going from PFA to Comrey algorithm makes convergence 1.2
times faster and from PFA to Harman algorithm makes convergence 18% times
faster. Further analysis, going from Comrey algorithm to Newton algorithm
again makes convergenc 50% times as fast (observe that Newton algorithm starts
with a small number of Comrey algorithm iterations to get into an area where
quadratic approximation is safe). For the LSFAY case, direct alternating least
squares beats BFGS and CG by 50 times. The PFA algorithm is twice faster
than the direct alternating least squares.
Table 4.4: LSFA methods Summary - 9 Mental Tests from Thurstone
LSFAC
Newton
LSFAY
CG InvSqrt BFGS
ML
CG
ML
loss 0.00123 0.00123 0.00123 0.00123 0.00123 0.00123 0.00098 0.00098 0.00098
iteration 4.00000 66.0000 28.0000 22.0000
144.000
user.self 0.02400 0.07700 0.03500 0.06500 0.64200 0.81500 0.14600 6.72300 7.23700 0.037
sys.self 0.00100 0.00100 0.00000 0.00000 0.00300 0.00400 0.00000 0.03100 0.04600 0.000
29
Table 4.5: Loss function Summary - 9 Mental Tests from Thurstone

Newton vs InvSqrt
LSFACf
LSFAYf
Newton
0.001228405
0.0009998733
PFA
0.001228405
0.0009998909
Comrey
0.001228405
0.0009998581
Harman
0.001228405
0.0009998984
InvSqrt
0.001266686
0.0009804552
ML
0.0013511019
0.0009924787
Table 4.6: Loading Matrices Summary - 9 Mental Tests from Thurstone

LSFAC
LSFAY
Newton PFA Comrey Harman BFGS CG InvSqrt BFGS CG
4.3
Newton
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
PFA
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
Comrey
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
Harman
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
BFGS
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
CG
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
InvSqrt
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
BFGS
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
CG
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
17 mental Tests from Thurstone/Bechtoldt
This set is the 17 variables from which the clear 3 factor solution used by McDonald (1999) is abstracted. Nine different solutions were computed, five of them we
30
iterate until the loss function decrease less than 1e - 6.

of 57. In addition, going from Comrey to PFA algorithm makes convergence 30%
times faster, and going from Harman to PFA algorithm makes convergence 2.8
times faster. Further analysis, going from Comrey algorithm to Newton algorithm
again makes convergenc 30% times as fast (observe that Newton algorithm starts
quadratic approximation is safe). For the LSFAY case, the direct alternating
least squares beats BFGS and CG by 23 times. The PFA algorithm is twice
faster than the direct alternating least squares.
Table 4.7: LSFA methods Summary - 17 mental Tests from Thurstone/Bechtoldt
LSFAC
LSFAY
Newton PFA Comrey Harman BFGS
CG InvSqrt BFGS
ML
CG
ML
loss
0.496 0.496
0.496
0.496 0.496 0.496
0.187 0.187 0.187
iteration
3.000 24.00
18.00
17.00
43.00
user.self
0.030 0.030
0.039
0.115 1.500 1.700
0.068 1.608 2.194 0.042
sys.self
0.001 0.00
0.001
0.000 0.006 0.008
0.000 0.0070 0.001 0.001
31
Table 4.8: Loss function Summary - 17 mental Tests from Thurstone/Bechtoldt

LSFACf
LSFAYf
Newton
0.4959954
0.1916967
PFA
0.4959954
0.1916968
Comrey
0.4959954
0.1916968
Harman
0.4959954
0.1916968
InvSqrt 0.5071997
0.1873351
ML
0.5300513
0.1929767
Table 4.9: Loading Matrices Summary - 17 mental Tests from Thurstone

LSFAC
LSFAY
4.4
Newton
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
PFA
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
Comrey
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
Harman
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
BFGS
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
CG
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
InvSqrt
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
BFGS
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
CG
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
16 Health Satisfaction items from Reise
The Reise data set is a correlation matrix based upon > 35,000 observations to the
consumer Assessment of Health Care Providers and System survey instrument.
Reise, Morizot, and Hays (2007) describe a bifactor solution based upon 1,000
32
cases. The five factors from Reise et al. reflect Getting care quickly (1-3), Doctor
communicates well (4-7), Courteous and helpful staff (8, 9), Getting need care
(10-13), and Health plan customer service (14-16). In all LSFAC case we iterate
until the loss function decrease less than 1e - 6, and in LSFAY case we iterate
until 1000 iterations.
of 5. In addition, going from Comrey to PFA algorithm makes convergence 13%
times faster, and going from Harman to PFA algorithm makes convergence 7%
times faster. Further analysis, going from Comrey algorithm to Newton algorithm
again makes convergenc 2.7 times as fast (observe that Newton algorithm starts
quadratic approximation is safe). For the LSFAY case, the direct alternating
least squares beats BFGS and CG by 20 times. The PFA algorithm is 44% times
faster than the direct alternating least squares.
33
Table 4.10: LSFA methods Summary - 16 Health Satisfaction items from Reise
LSFAC
Newton
LSFAY
CG InvSqrt BFGS
ML
CG
ML
loss 0.00329 0.00329 0.00329 0.00329 0.00329 0.00329 0.00179 0.00179 0.00179
iteration 6.00000 734.000 326.000 168.000
1000.00
user.self 0.28400 0.93100 1.05200 1.00000 2.22300 4.15700 1.34000 14.6200 21.7400 0.063
sys.self 0.00200 0.00500 0.00400 0.00400 0.00800 0.01400 0.00500 0.13600 0.12900 0.000
Table 4.11: Loss function Summary - 16 Health Satisfaction items from Reise
LSFACf
LSFAYf
Newton
0.003287921
0.001813492
PFA
0.003287921
0.001813506
Comrey
0.003287921
0.001813501
Harman
0.003287921
0.001813467
InvSqrt 0.003339447
0.001789870
ML
0.003621737
0.001834006
Table 4.12: Loading Matrices Summary - 16 Health Satisfaction items from Reise
LSFAC
LSFAY

Newton
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
PFA
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
Comrey
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
Harman
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
BFGS
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
CG
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
InvSqrt
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
BFGS
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
CG
1.00 1.00
1.00
1.00
1.00 1.00
1.00
1.00 1.00
34
4.5
9 Emotional variable from Burt
The Burt nine emotional variables are taken from Harman (1967, p 164) who in
turn adapted them from Burt (1939). they are said be from 172 normal children
aged nine to twelve. As pointed out by Harman, this correlation matrix is singular
and has squared multiple correlations > 1. Note this correlation matrix has a
negative eigenvalue, but the LSFAY still works. In all LSFAC case we iterate
until the loss function decrease less than 1e - 6, and in LSFAY case we iterate
until 1000 iterations.
For the LSFAC case, in table 4.13, simple the PFA beats BFGS and CG by a
factor of 8. In addition, going from PFA to Comrey algorithm makes convergence
1.6 times faster, and going from Harman to PFA algorithm makes convergence
0.02 times faster. Further analysis, going from Comrey algorithm to Newton algorithm again makes convergenc 1.4 times faster (observe that Newton algorithm
starts with a small number of Comrey algorithm iterations to get into an area
where quadratic approximation is safe). For the LSFAY case, the direct alternating least squares beats BFGS and CG by 13 times. The PFA algorithm is 6
times faster than the direct alternating least squares.
In table 4.14, each entry for all pairs of loading matrices comparison is 1.00,
that verifies all algorithms are similarity to each other.
35
Table 4.13: LSFA methods Summary - 9 Emotional variable from Burt

LSFAC
Newton
LSFAY
CG InvSqrt BFGS
CG
loss 0.1696 0.1696
0.1696
0.1696 0.1696 0.1696 0.1287 0.1287 0.1287
iteration 5.0000 134.00
48.000
39.000
user.self 0.0250 0.1560
0.0600
0.1530 0.8240 1.2670 1.0730 10.481 14.244
sys.self 0.0000 0.0000
0.0000
0.0000 0.0030 0.0020 0.0030 0.0450 0.0420
1000.0
Table 4.14: Loading Matrices Summary - 9 Emotional variable from Burt

LSFAC
LSFAY

Newton
1.00 1.00
1.00
1.00
1.00 1.00
0.98
0.98 0.98
PFA
1.00 1.00
1.00
1.00
1.00 1.00
0.98
0.98 0.98
Comrey
1.00 1.00
1.00
1.00
1.00 1.00
0.98
0.98 0.98
Harman
1.00 1.00
1.00
1.00
1.00 1.00
0.98
0.98 0.98
BFGS
1.00 1.00
1.00
1.00
1.00 1.00
0.98
0.98 0.98
CG
1.00 1.00
1.00
1.00
1.00 1.00
0.98
0.98 0.98
InvSqrt
0.98 0.98
0.98
0.98
0.98 0.98
1.00
1.00 1.00
BFGS
0.98 0.98
0.98
0.98
0.98 0.98
1.00
1.00 1.00
CG
0.98 0.98
0.98
0.98
0.98 0.98
1.00
1.00 1.00
36
CHAPTER 5
Discussion and Conclusion
This paper demonstrates the feasibility of using an Alternating Least Squares
algorithm to solve the minimizing the least squares loss function from an common
factor analysis perspective. The algorithm leads to clean and solid convergence,
and accumulation points of the sequences it generates will be stationary points.
In addition, the Procrustus algorithm provides a means of verifying that the
solution obtained is at least a local minimum of the loss function.
The Procrustus algorithm was applied to the classical Holzinger-Swineford
mental tests problem and a simple structure of the loadings was achieved. The
illustrative results verify that the steps of iteration generated from Projection
algorithm are exactly the same as those from the Procrustus algorithm. The
optimization was easily carried out using the gradient projection algorithm. The
Procrustus algorithm has to be compared to other least squares methods.
Current algorithm for fitting common factor analysis models to data are quite
good. In almost every case the least squares on the covariance matrix (LSFAC)
method has a faster convergence speed than least squares on the data matrix
(LSFAY) method, and results from two different methods are almost identical.
The Newton algorithm (with a small number of Comrey algorithm iterations to
get into an area where quadratic approximation is safe) is the fastest algorithm
in least squares method. However, the LSFAY has two advantages over LSFAC:
1. the independence is more realistic, optima scaling easy (FACTALS), and 2.
37
the square root of the estimating unique variance are always result positive.
The illustrative results verify that an common factor analysis solution can
be surprisingly similar to the classical maximum likelihood and least squares
solutions on those data set we analysis earlier, suggestion that further research
into its properties may be of interest in the future.
Although we showed that the proposal factor analysis methodology can yield
results that are equivalent to those from standard methods, an important question to consider s whether any variants of this methodology actually can yield
improvements over existing methods. If not, results will be of interest mainly in
providing a new theoretical perspective on the relations between components and
factor analysis.
38
APPENDIX A
Augmented Procrustus
Suppose X is an n m matrix of rank r. Consider the problem of maximizing
tr(U X) over the n m matrices U satisfying U U = I. This is known as the
Procrustus problem, and it is usually studied for the case n m = r. We want
to generalize to n m r. For this, we use the singular value decomposition

L
0
1
rr
r(mr) rm
K0
X = K1
nr n(nr)
L
0
0
0
(nr)r
(nr)(mr)
(mr)m
Theorem 1. The maximum of tr U X over nm matrices U satisfying U U = I

is tr , and it is attained for any U of the from U = K1 L1 + K0 V L0 , where V
is any (n r) (m r) matrix satisfying V V = I.
Proof. Using a symmetric matrix of Lagrange multipliers leads to the station1
ary equations X = UM, which implies X X = M2 or M = (X X 2 ). It also

implies that at a solution of the stationary equation tr U X = tr . The negative sign corresponds with the mimimum, the positive sign with the maximum.
Now
M=
L1
mr
L0
m(mr)
rr
(mr)r
39
r(mr)
(mr)(mr)
L1
rm
L0
(mr)m
If we write U in the form

U = K1
K0
nr
n(nr)
U1
rm
U0
(nr)m
then X = UM can be simplified to

U1 L1 = I,
U0 L1 = 0.
with in addition, of course, U1 U1 + U0 U0 = I. It follows that U1 = L1 and
U0
(nr)m
V
L0 ,
(nr)(mr)(mr)m
with V V = I. Thus U = K1 L1 + K0 V L0 .
40
APPENDIX B
Implementation Code
B.1. Examples Dataset
1
libra ry (MASS)
libra ry ( psych )
libra ry ( optimx )
libra ry ( s e r i a t i o n )
data ( Harman )
data ( b i f a c t o r )
data ( Psych24 )
haho < Harman . H o l z i n g e r
11
13
# haho a r e H o l z i n g e r s 9 p s y c h o l o g i c a l t e s t s
b u r t < matrix ( 0 , 11 , 11)

b u r t [ 2 , 1 : 1 ] < . 8 3
15
b u r t [ 3 , 1 : 2 ] < c ( . 8 1 , . 8 7 )
b u r t [ 4 , 1 : 3 ] < c ( . 8 0 , . 6 2 , . 6 3 )
17
b u r t [ 5 , 1 : 4 ] < c ( . 7 1 , . 5 9 , . 3 7 , . 4 9 )
b u r t [ 6 , 1 : 5 ] < c ( . 7 0 , . 4 4 , . 3 1 , . 5 4 , . 5 4 )
19
b u r t [ 7 , 1 : 6 ] < c ( . 5 4 , . 5 8 , . 3 0 , . 3 0 , . 3 4 , . 5 0 )
b u r t [ 8 , 1 : 7 ] < c ( . 5 3 , . 4 4 , . 1 2 , . 2 8 , . 5 5 , . 5 1 , . 3 8 )
41
21
b u r t [ 9 , 1 : 8 ] < c ( . 5 9 , . 2 3 , . 3 3 , . 4 2 , . 4 0 , . 3 1 , . 2 9 , . 5 3 )
b u r t [ 1 0 , 1 : 9 ] < c ( . 2 4 , . 4 5 , . 3 3 , . 2 9 , . 1 9 , . 1 1 , . 2 1 , . 1 0 , . 0 9 )
23
b u r t [ 1 1 , 1 : 1 0 ] <
c (.13 ,.21 ,.36 , .06 , .10 ,.10 ,.08 , .16 , .10 ,.41)
b u r t < b u r t + t ( b u r t ) + diag ( 1 1 )
25
# b u r t a r e th e Burt ( 1 9 1 5 ) e m o t i o n a l v a r i a b l e s :
27
# n ote t h i s c o r r e l a t i o n m atr ix has a

# n e g a t i v e e i g e n v a l u e , but l s f a y ( burt , 3) s t i l l works
29
rownames( b u r t ) < colnames ( b u r t ) < c ( " Sociality " , " Sorrow" , "
Tenderness " , "Joy" , " Wonder" , " Elation" , " Disgust" , " Anger"
, "Sex" , " Fear " , " Subjection " )
31
r e i s e < as . matrix ( R e i s e )
33
# r e i s e i s Reise et al s (2007) health s a t i s f a c t i o n items

35
b e a l l <
37
structure ( l i s t ( gen d er = c ( 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
39
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2) , p i c t o r i a l a b s u r d i t i e s = c ( 1 5 ,
41
17 , 15 , 13 , 20 , 15 , 15 , 13 , 14 , 17 , 17 , 17 , 15 , 18 , 18 , 15 ,
18 ,
10 , 18 , 18 , 13 , 16 , 11 , 16 , 16 , 18 , 16 , 15 , 18 , 18 , 17 , 19 ,
13 ,
42
43
14 , 12 , 12 , 11 , 12 , 10 , 10 , 12 , 11 , 12 , 14 , 14 , 13 , 14 , 13 ,
16 ,
14 , 16 , 13 , 2 , 14 , 17 , 16 , 15 , 12 , 14 , 13 , 11 , 7 , 12 , 6) ,
paper form boards = c ( 1 7 ,
45
15 , 14 , 12 , 17 , 21 , 13 , 5 , 7 , 15 , 17 , 20 , 15 , 19 , 18 , 14 , 17 ,
14 , 21 , 21 , 17 , 16 , 15 , 13 , 13 , 18 , 15 , 16 , 19 , 16 , 20 , 19 ,
14 ,
47
12 , 19 , 13 , 20 , 9 , 13 , 8 , 20 , 10 , 18 , 18 , 10 , 16 , 8 , 16 , 21 ,
17 , 16 , 16 , 6 , 16 , 17 , 13 , 14 , 10 , 17 , 15 , 16 , 7 , 15 , 5) ,
tool recognition = c (24 ,
49
32 , 29 , 10 , 26 , 26 , 26 , 22 , 30 , 30 , 26 , 28 , 29 , 32 , 31 , 26 ,
33 ,
19 , 30 , 34 , 30 , 16 , 25 , 26 , 23 , 34 , 28 , 29 , 32 , 33 , 21 , 30 ,
12 ,
51
14 , 21 , 10 , 16 , 14 , 18 , 13 , 19 , 11 , 25 , 13 , 25 , 8 , 13 , 23 , 26 ,
14 , 15 , 23 , 16 , 22 , 22 , 16 , 20 , 12 , 24 , 18 , 18 , 19 , 7 , 6) ,
vocabulary = c (14 ,
53
26 , 23 , 16 , 28 , 21 , 22 , 22 , 17 , 27 , 20 , 24 , 24 , 28 , 27 , 21 ,
26 ,
17 , 29 , 26 , 24 , 16 , 23 , 16 , 21 , 24 , 27 , 24 , 23 , 23 , 21 , 28 ,
21 ,
55
26 , 21 , 16 , 16 , 18 , 24 , 23 , 23 , 27 , 25 , 26 , 28 , 14 , 25 , 28 ,
26 ,
14 , 23 , 24 , 21 , 26 , 28 , 14 , 26 , 9 , 23 , 20 , 28 , 18 , 28 , 13) ) , .
Names = c ( " gender" ,
57
" pictorial absurdities " , " paper form boards" , "tool

recognition " ,
" vocabulary" ) , row . names = c (NA, 64L) , c l a s s = "data . frame" )
43
59
# b e a l l a r e s c o r e s o f 32 men and 32 women on 4 t e s t s taken

from B e a l l ( Psychometrika , 1945)
61
data ( H o l z i n g e r S w i n e f o r d 1 9 3 9 , package=" lavaan" )

63
HS . name < c ( "vis perc " , " cubes" , " lozenges" , "par comp " , "sen
comp " , " wordmean" , " addition" , " cont dot" , "s c caps " )
65
HS < H o l z i n g e r S w i n e f o r d 1 9 3 9 [ , 7 : 1 5 ]
67
names(HS) < HS . name
69
RHS < cor (HS)
44
APPENDIX C
Program
C.1. Main.
1
l s f a < function ( cov = NULL, y = NULL, p , itmax = 1000 , ep s =

1 e 6, i n i t i t = 5 , fm = " lsfacNewton " , method = "BFGS " ,
v e r b o s e = FALSE) {
i f ( i s . null ( y ) ) {
cov < as . matrix ( cov )
m < nrow ( cov )

5
h < i n i t i a l A D ( cov , p )
o l d a < h $ a
i t e l < 1
}
i f ( fm == " lsfacPFA" ) {
old d < h $ d
repeat {
11
newd < diag ( cov t c r o s s p r o d ( o l d a ) )

13
ed ec < eigen ( cov diag ( newd ) )

ev < ed ec $ v e c t o r s [ , 1 : p ]
15
ea < ed ec $ v a l u e s [ 1 : p ]
newa < ev matrix ( sqrt (pmax ( 0 , ea ) ) , m, p , byrow=
TRUE)
17
chg < max ( abs ( old d newd ) )
45
i f ( verbose ) {
cat ( " Iteration : " , formatC ( i t e l , d i g i t s =4, width=6) ,
19
" Change: " , formatC ( chg , d i g i t s =6, width =10 ,

format="f" ) ,
"\n" )
21
cat ( " oldd : " , formatC ( oldd , d i g i t s =6, width =10 ,

format="f" ) , "\n" )
cat ( " newd : " , formatC ( newd , d i g i t s =6, width =10 ,
23
format="f" ) , "\n\n" )
}
25
i f ( ( chg < ep s ) | | ( i t e l == itmax ) ) {

break
}
27
i t e l < i t e l + 1
29
old d < newd

o l d a < newa
31
}
l o s s < sum ( ( cov diag ( newd ) t c r o s s p r o d ( newa ) ) 2)
/ 2
33
r e s u l t < l i s t ( i t e l = i t e l , l o s s = l o s s , chg = chg , a =

old a , d = old d )
}
35
i f ( fm == " lsfacHarman " ) {

repeat {
37
newa < o l d a
for ( i i n 1 : m) {
39
newa [ i , ] < qr . solve ( newa [ i , ] , cov [ i , ] [ i

])
46
}
41
chg < max ( abs ( o l d a newa ) )

i f ( verbose ) {
43

format="f" ) ,
"\n" )
45
}
47

break
}
49
i t e l < i t e l + 1
51
o l d a < newa
}
53
d < diag ( cov t c r o s s p r o d ( newa ) )

l o s s < sum ( ( cov diag ( d ) t c r o s s p r o d ( newa ) ) 2) / 2
55

old a , d = d )
}
57
i f ( fm == " lsfacComrey " ) {

repeat {
59
newa < o l d a
for ( s i n 1 : p ) {
61
for ( i i n 1 : m) {
c i s < newa [ , s ] [ i ]
63
r i s < newa [ i , ] [ s ]
d i s < cov
65
[ i , ] [ i ]
mis < newa [ i , s ]
47
newa [ i , s ] < sum ( c i s ( d i s mis %% r i s ) )

/ sum ( c i s c i s )
}
67
}
69

i f ( verbose ) {
cat ( " Iteration: " , formatC ( i t e l , d i g i t s =4, width=6) ,
71

format="f" ) ,
"\n" )
73
}
75

break
}
77
i t e l < i t e l + 1
79
o l d a < newa
}
81

83

old a , d = d )
}
i f ( fm == " lsfacOptim " ) {
85
l o s s < function ( d ) {
eval < eigen ( cov diag ( d ) , o n l y . v a l u e s=
87
TRUE) $ v a l u e s
sum ( eval [ ( 1 : p ) ] 2 ) /2
89
48
lsfacPFA < function ( cov , p , itmax = 1000 , ep s = 1 e 6,

91
m < nrow ( cov )

93
o l d a < h $ a
old d < h $ d
95
i t e l < 1
repeat {
97
newd < diag ( cov t c r o s s p r o d ( o l d a ) )

ed ec < eigen ( cov diag ( newd ) )
99
ev < ed ec $ v e c t o r s [ , 1 : p ]
ea < ed ec $ v a l u e s [ 1 : p ]
101
newa < ev matrix ( sqrt (pmax ( 0 , ea ) ) , m, p , byrow=

TRUE)
103

break
105
}
i t e l < i t e l + 1
107
old d < newd

o l d a < newa
109
}
l o s s < sum ( ( cov diag ( newd ) t c r o s s p r o d ( newa ) ) 2)
111
/ 2
return ( l i s t ( i t e l = i t e l , l o s s = l o s s , chg = chg , a = old a
, d = old d ) )
113
49
115
d < lsfacPFA ( cov , p , itmax = i n i t i t ) $ d

r e s < optimx ( d , l o s s , method = method )
117
d < u n l i s t ( r e s $ par )
l o s s < as . numeric ( r e s $ f v a l u e s )
119
ec < eigen ( cov diag ( d ) )

ed < ec $ v a l u e s [ 1 : p ]
121
ev < ec $ v e c t o r s [ , 1 : p ]
a < ev matrix ( sqrt ( ed ) , m, p , byrow=TRUE)
123
r e s u l t < l i s t ( l o s s = l o s s , a = a , d = d )
}
125
i f ( fm == " lsfacNewton " ) {

l s f a c C o m r e y < function ( cov , p , itmax = 1000 ,
127
ep s = 1 e 6, v e r b o s e = FALSE) {
129
m < nrow ( cov )

o l d a < i n i t i a l A D ( cov , p ) $ a
131
i t e l < 1
repeat {
133
newa < o l d a
for ( s i n 1 : p ) {
135
for ( i i n 1 : m) {
c i s < newa [ , s ] [ i ]
137
r i s < newa [ i , ] [ s ]
d i s < cov
139
[ i , ] [ i ]
mis < newa [ i , s ]
50
newa [ i , s ] < sum ( c i s ( d i s mis %% r i s ) )

/ sum ( c i s c i s )
}
141
}
143

break
145
}
i t e l < i t e l + 1
147
o l d a < newa
}
149

151
return ( l i s t ( i t e l = i t e l , l o s s = l o s s , chg = chg , a = old a

, d = d) )
153
}
pp < p + (1 : (m p ) )
155
old d < l s f a c C o m r e y ( cov , p , itmax = i n i t i t ) $ d

i t e l < 1
157
repeat {
ed ec < eigen ( cov diag ( old d ) )
159
ev < ed ec $ v e c t o r s
ea < ed ec $ v a l u e s
161
eu < ev [ , pp ]
gg < drop ( ( eu 2) %% ea [ pp ] )
163
h1 < t c r o s s p r o d ( eu 2)
h2 < matrix ( 0 , m, m)
165
for ( s i n pp ) {
51
ee < ea ea [ s ]
ee < i f e l s e ( ee == 0 , 0 , 1/ ee )
167
ew < ev [ , s ]
h2 < h2 + ea [ s ] outer ( ew , ew ) ( ev %% ( ee
169
t ( ev ) ) )
}
171
newd < old d solve ( h1 (2 h2 ) , gg )

173
i f ( verbose ) {
175
format="f" ) ,
"\n" )
177
format="f" ) , "\n" )
format="f" ) , "\n\n" )
}
179

break
181
}
183
i t e l < i t e l + 1
old d < newd
185
}
a < ev [ , 1 : p ] matrix ( sqrt (pmax ( 0 , ea [ 1 : p ] ) ) , m,
p , byrow=TRUE)
187
l o s s < sum ( ( cov diag ( newd ) t c r o s s p r o d ( a ) ) 2) / 2
52
r e s u l t < l i s t ( i t e l = i t e l , l o s s = l o s s , chg = chg , a = a ,

d = newd )
}
189
i f ( fm == " lsfayInvSqrt " ) {

191
old d < sqrt ( h $ d )

i t e l < 1
193
repeat {
ad < cbind ( old a , diag ( old d ) )
195
e < crossprod ( ad , crossprod ( cov , ad ) )

u < myInvSqrt ( e )
197
# a l t e r n a t i v e , but s l o w e r u < denman . b e a v e r s ( e ) $ y s q i n v

g < cov %% ad %% u
199
newa < g [ , 1 : p ]
newd < diag ( g [ , p + ( 1 :m) ] )
201

i f ( verbose ) {
203

format="f" ) ,
"\n" )
205

format="f" ) , "\n" )
207

format="f" ) , "\n\n" )
}
209

break
211
53
i t e l < i t e l + 1
213
old d < newd

o l d a < newa
215
}
s v a l < sqrt (pmax ( 0 , eigen ( e , o n l y . v a l u e s = TRUE) $
values ) )
217
l o s s < ( (sum ( diag ( cov ) ) + sum ( ad 2) ) / 2) sum (

sval )
r e s u l t < l i s t ( i t e l = i t e l , l o s s = l o s s , chg = chg , s v a l =
sval ,
219
a = old a , d = old d )
}
221
i f ( fm == " lsfaySVD" ) {
y < as . matrix ( y )
223
y < apply ( y , 2 , function ( x ) x mean ( x ) )

y < apply ( y , 2 , function ( x ) x / sqrt (sum ( x 2) ) )
225
h < i n i t i a l A D ( crossprod ( y ) , p )
o l d a < h $ a
227

i t e l < 1
229
repeat {
emat < y %% cbind ( old a , diag ( old d ) )
231
es vd < svd ( emat )

xxuu < t c r o s s p r o d ( es vd $ u , es vd $ v )
233
aadd < crossprod ( y , xxuu )

newa < aadd [ , 1 : p ]
235
newd < diag ( aadd [ , (1: p ) ] )

54
237
i f ( verbose ) {
239
format="f" ) ,
"\n" )
241
format="f" ) , "\n" )
format="f" ) , "\n\n" )
}
243

break
245
}
247
i t e l < i t e l + 1
old d < newd
249
o l d a < newa
}
251
l o s s < sum ( ( y t c r o s s p r o d ( xxuu , cbind ( newa , diag (

newd ) ) ) ) 2) / 2
r e s u l t < l i s t ( i t e l = i t e l , l o s s = l o s s , chg = chg , a
= old a , d = old d )
253
}
i f ( fm == " lsfayOptim " ) {
255
t c < sum ( diag ( cov ) )

m < nrow ( cov )
257
l o s s < function ( x ) {
sx < sum ( x 2)
259
a < matrix ( x [ 1 : (m p ) ] , m, p )
55
d < x [ (1 : (m p ) ) ]
ad < cbind ( a , diag ( d ) )
261

lb d < sqrt (pmax ( 0 , eigen ( e , o n l y . v a l u e s =
263
TRUE) $ v a l u e s ) )
return ( ( ( sx + t c ) / 2) sum ( lb d ) )
265
}
l s f a y I n v S q r t < function ( cov , p , itmax =
1000 , ep s = 1 e 6, v e r b o s e = FALSE) {
267

m < nrow ( cov )
269
o l d a < h $ a
271

i t e l < 1
273
repeat {
ad < cbind ( old a , diag ( old d ) )
275
e < crossprod ( ad , crossprod ( cov , ad )

)
u < myInvSqrt ( e )
277
# a l t e r n a t i v e , but s l o w e r u < denman . b e a v e r s ( e ) $ y s q i n v

g < cov %% ad %% u
279
newa < g [ , 1 : p ]
newd < diag ( g [ , p + ( 1 :m) ] )
281

283
break
}
56
i t e l < i t e l + 1
285
old d < newd

o l d a < newa
287
}
s v a l < sqrt (pmax ( 0 , eigen ( e , o n l y . v a l u e s =
289
l o s s < ( (sum ( diag ( cov ) ) + sum ( ad 2) ) /
2) sum ( s v a l )
return ( l i s t ( i t e l = i t e l , l o s s = l o s s , chg =
291
chg , s v a l = s v a l ,
a = old a , d = old d ) )
}
293
ad < l s f a y I n v S q r t ( cov , p , itmax = i n i t i t )

295
x < c ( as . vector ( ad $ a ) , ad $ d )
r e s < optimx ( x , l o s s , method = method )
297
rp < u n l i s t ( r e s $ par )
a < matrix ( rp [ 1 : (m p ) ] , m, p )
299
d < rp [ (1 : (m p ) ) ]
names ( d ) < rownames ( cov )
301
rownames ( a ) < rownames ( cov )

r e s u l t < l i s t ( l o s s = as . numeric ( r e s $ f v a l u e s ) , a = a ,
d = d)
303
}
return ( r e s u l t )
305
C.2. Auxilaries.
1
i n i t i a l A D < function ( cov , p ) {
57
m < nrow ( cov )

3
ec < eigen ( cov )

ed < ec $ v a l u e s
ev < ec $ v e c t o r s
a < ev [ , 1 : p ] matrix ( sqrt ( ed [ 1 : p ] ) , m, p , byrow=TRUE)
d < rep (sum ( ed [ ( 1 : p ) ] ) / (m p ) , m)

return ( l i s t ( a = a , d = d ) )
11
}
denman . b e a v e r s < function (mat, itmax = 50 , ep s = 1 e 6,
s t o p i f n o t (nrow (mat) == ncol (mat) )
13
i t e l < 1
y o l d < mat
15
z < mat %% g i n v (mat)

repeat {
17
ynew < 0 . 5 ( y o l d + g i n v ( z ) )
z < 0 . 5 ( z + g i n v ( y o l d ) )
19
chg < max ( abs ( y o l d ynew ) )

i f ( verbose ) {
21
cat ( " Iteration : " ,

formatC ( i t e l , d i g i t s =6, width =6) ,
"
23
Change:
",
formatC ( chg , d i g i t s =10 , width =15 , format="f" ) ,

"\n" )
25
}
27

break
58
29
i t e l < i t e l + 1
y o l d < ynew
31
}
return ( l i s t ( y s q r t = ynew , y s q i n v = z ) )
33
}
35
myInvSqrt < function ( a ) {

37
e < eigen ( a )
ev < e $ v e c t o r s
39
ea < abs ( e $ v a l u e s )
ea < i f e l s e ( ea == 0 , 0 , 1 / sqrt ( ea ) )
41
return ( ev %% ( ea t ( ev ) ) )
}
43
Myloss < function ( cov , r e s u l t , fm = "LSFAC" ) {

45
i f ( fm == "LSFAC" ) {
LSFACf < r e s u l t $ l o s s
47
a < r e s u l t $a
d < sqrt ( r e s u l t $d )
49

51
s v a l < sqrt (pmax ( 0 , eigen ( e , o n l y . v a l u e s = TRUE) $

values ) )
LSFAYf < ( (sum ( diag ( cov ) ) + sum ( ad 2) ) / 2)
sum ( s v a l )
53
}
else {
59
i f ( fm == "LSFAY" ) {
55
a < r e s u l t $a
d < ( r e s u l t $d ) 2
57
LSFACf < sum( ( cov diag ( d ) t c r o s s p r o d ( a ) ) 2)

/ 2
LSFAYf < r e s u l t $ l o s s
59
}
else {
61
i f ( fm == "ML" )
{
63
a < r e s u l t $ l o a d i n g s
d < sqrt ( r e s u l t $ u n i q u e n e s s e s )
65

67
s v a l < sqrt (pmax ( 0 , eigen ( e , o n l y . v a l u e s =

LSFACf < sum( ( cov diag ( d 2) t c r o s s p r o d ( a )
69
) 2) / 2
LSFAYf < ( (sum ( diag (RHS) ) + sum ( ad 2) ) / 2)
sum ( s v a l )
}
71
}
}
73
75
return ( c ( LSFACf , LSFAYf ) )

}
77
MyLMC < function (A, p ) {
60
79
k < ncol (A) / p

LMC < matrix ( 0 , nrow = k , ncol = k )
81
for ( j i n 1 : k ) {
lmc < NULL
83
g < 1 + ( j 1) p
X < A[ , g : ( g + ( p 1) ) ]
85
for ( i i n 1 : k ) {
87
t < 1 + ( i 1) p
Y < A[ , t : ( t + ( p 1) ) ]
89
lmc1 < sum( svd ( crossprod (X, Y) ) $ d )

/ sqrt (sum(X X) sum(Y Y) )
lmc < c ( lmc , lmc1 )
91
}
93
LMC[ , j ] < lmc

95
}
return (LMC)
97
61
Bibliography
[1] T. W. Anderson and Herman Rubin. Statistical inference in factor analysis.

Proceedings of the Third Berkeley Symposium of Mathematical Statistics and
Probability, 5:111150, 1956.
[2] T.W. Anderson. Estimating linear statistical relationships. The Annals of
Statistics, 12(1):145, 1984.
[3] M. S. Bartlett. Tests of significance in factor analysis. British Journal of
Psychology, Statistical Section, 8:7785, 1950.
[4] C. Chatfield and A. J. Collins. Introduction to Multivariate Analysis. Chapman and Hall, New Jersey, 1980.
[5] A. L. Comrey. The minimum residual method of factor analysis. Psychological Records, 11:1518, 1962.
[6] Richard L. Gorsuch. Factor Analysis. Lawrence Erlbaum Associates, New
Jersey, 1983.
[7] H. H. Harman and Y. Fukuda. Resolution of the heywood case in the minres
solution. Psychometrika, 31:563571, 1966.
[8] H. H. Harman and W. H. Jones. Factor analysis by minimizing residuals
(minres). Psychometrika, 31:351368, 1966.
[9] P. Horst. Factor analysis of data matrices. Holt, Rinehart and Winston,
1965.
[10] G.B. Dijksterhuis J.C. Gower. Procrustes problems. Oxford University Press,
pages 121134, 2004.
62
[11] K. G. J
oreskog. On the statistical treatment of residuals in factor analysis.
Psychometrika, 27:435454, 1962.
[12] K. G. J
oreskog. Some contribution to maximum likelihood factor analysis.
Psychometrika, 32:443482, 1967.
[13] Richard A. Johnson and Dean W. Wichern. Applied Multivariate Statistical
Analysis. Prentice Hall, New Jersey, 2002.
[14] D. N. Lawley. The estimation of factor loadings by the method of maximum
likelihood. Proc. R. Soc. Edinburgh, 60:6482, 1940.
[15] D. N. Lawley. Further investigations in factor estimation. Proceeding of the
Royal Society, Edinburgh, 62:176185, 1942.
[16] J. De Leeuw. Least squares optimal scaling of partially observed linear
systems. Recent Developments on Structural Equation Models: Theory and
Applications, pages 121134, 2004.
[17] J. De Leeuw. Derivatives of generalized eigen systems with applications.
Preprint 528, Department of Statistics, UCLA, 2007.
[18] J. De Leeuw. Least squares methods for factor analysis. Preprint, Department of Statistics, UCLA, 2010.
[19] Eni Pan Gao and Zhiwei Ren. Introduction to Multivariate Analysis for the
social sciences. W.H. Freeman and Company, New Jersey, 1971.
[20] H. Rutishauser. Computational aspects of f.l. bauers simultaneous iteration
method. Numerische Mathematik, 13:413, 1969.
[21] L. L. Thurstone. Multiple-factor analysis. Univ. Chicago Press, IV:535,
1947.
63
[22] Steffen Unkel and Nickolay T. Trendafilov. Noisy independent component

analysis as a method of rotating the factor scores. Springer-Verlag Berlin
Heidelberg, 2007.
[23] Steffen Unkel and Nickolay T. Trendafilov. Factor analysis as data matrix
decomposition: A new approach for quasi-sphering in noisy ica. SpringerVerlag Berlin Heidelberg, pages 163170, 2009.
[24] P. Whittle. On principal components and least squares methods in factor
analysis. Skandinavisk Aktuarietidskrift, 35:223239, 1952.
[25] Gale Young. Maximum likelihood estimation and factor analysis. Psychometrika, 6:4953, 1941.
[26] F. E. Zegers and J. M. F. Ten Berge. A fast and simple computational methods of minimum residual factor analysis. Multivariate Behavioral Research,
18:331340, 1983.
64

Least Squares Method For Factor Analysis

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Least Squares Method For Factor Analysis

Diunggah oleh

Hak Cipta:

Format Tersedia

University of California

Least Squares Method for Factor Analysis

A thesis submitted in partial satisfaction

The thesis of Jia Chen is approved.

Jan de Leeuw, Committee Chair

University of California, Los Angeles

Factor Analysis Models . . . . . . . . . . . . . . . . . . . . . . . .

Random Factor Model . . . . . . . . . . . . . . . . . . . .

Fixed Factor Model . . . . . . . . . . . . . . . . . . . . . .

Principal Component Method . . . . . . . . . . . . . . . .

Maximum Likelihood Method . . . . . . . . . . . . . . . .

Least Squares Method . . . . . . . . . . . . . . . . . . . .

Determining the Number of Factors . . . . . . . . . . . . . . . . .

The Third Approach . . . . . . . . . . . . . . . . . . . . .

3 Algorithms of Least Squares Methods . . . . . . . . . . . . . . .

Least Squares on the Covariance Matrix . . . . . . . . . . . . . .

Least Squares on the Data Matrix . . . . . . . . . . . . . . . . . .

9 Mental Tests from Holzinger-Swineford . . . . . . . . . . . . . .

9 Mental Tests from Thurstone . . . . . . . . . . . . . . . . . . .

17 mental Tests from Thurstone/Bechtoldt . . . . . . . . . . . . .

16 Health Satisfaction items from Reise . . . . . . . . . . . . . . .

9 Emotional variable from Burt . . . . . . . . . . . . . . . . . . .

5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . .

LSFA methods Summary - 9 Mental Tests from Holzinger-Swineford 27

Loss function Summary - 9 Mental Tests from Holzinger-Swineford 27

Loading Matrices Summary - 9 Mental Tests from Holzinger-Swineford 28

LSFA methods Summary - 9 Mental Tests from Thurstone . . . .

Loss function Summary - 9 Mental Tests from Thurstone . . . . .

Loading Matrices Summary - 9 Mental Tests from Thurstone . . .

LSFA methods Summary - 17 mental Tests from Thurstone/Bechtoldt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Loss function Summary - 17 mental Tests from Thurstone/Bechtoldt 32

Loading Matrices Summary - 17 mental Tests from Thurstone . .

4.10 LSFA methods Summary - 16 Health Satisfaction items from Reise 34

4.12 Loading Matrices Summary - 16 Health Satisfaction items from

4.13 LSFA methods Summary - 9 Emotional variable from Burt . . . .

4.14 Loading Matrices Summary - 9 Emotional variable from Burt

Abstract of the Thesis

Least Squares Method for Factor Analysis

This paper demonstrates the implementation of using alternating least squares

the existence of a factor analysis model.

often has data or a large number of variables Y1 , Y2 , ... Ym , and it is reasonable

Factor Analysis Models

It can be rewritten as:

Random Factor Model

The matrix Y is assumed to be a realization of a matrix-valued random variable

Fixed Factor Model

In common factor analysis, the population covariance matrix of m variables

to solve, but recent computational advances, particularly by K. G. Joreskog,

Principal Component Method

Let variance-covariance matrix have corresponding eigenvalues and eigenvector.

The Spectral Decomposition of [13] says, the variance-covariance matrix

Instead of summing the equation (2.3) from 1 to m, we would sum it from 1

Recall equation (2.2), D2 is going to be equal to the variance-covariance matrix

Maximum Likelihood Method

Maximum-likelihood method was first proposed to factor analysis by Lawley

log-likelihood, which is given by the following expression:

then popularized and programmed by J

where S is the sample covariance matrix of Y, and = AA + D2 .

Least Squares Method

The least squares loss function used on the covariances matrix is

Determining the Number of Factors

variance extracted is computed by using the sum of the eigenvalues of

In the statistical procedure for determining the number of factors to extract,