Anda di halaman 1dari 55

Least Squares Adjustment:

Linear and Nonlinear Weighted Regression Analysis

Allan Aasbjerg Nielsen

Technical University of Denmark


Applied Mathematics and Computer Science/National Space Institute
Building 321, DK-2800 Kgs. Lyngby, Denmark
phone +45 4525 3425, fax +45 4588 1397
http://www.imm.dtu.dk/alan
e-mail alan@dtu.dk

19 September 2013

Preface

This note primarily describes the mathematics of least squares regression analysis as it is often used in
geodesy including land surveying and satellite based positioning applications. In these fields regression is
often termed adjustment1 . The note also contains a couple of typical land surveying and satellite positioning
application examples. In these application areas we are typically interested in the parameters of the model
(often 2- or 3-D positions) and their uncertainties and not in predictive modelling which is often the main
concern in other regression analysis applications.

Adjustment is often used to obtain estimates of relevant parameters in an over-determined system of equations
which may arise from deliberately carrying out more measurements than actually needed to determine the set
of desired parameters. An example may be the determination of a geographical position based on information
from a number of Global Navigation Satellite System (GNSS) satellites also known as space vehicles (SV).
It takes at least four SVs to determine the position (and the clock error) of a GNSS receiver. Often more than
four SVs are used and we use adjustment to obtain a better estimate of the geographical position (and the
clock error) and to obtain estimates of the uncertainty with which the position is determined.

Regression analysis is used in many other fields of application both in the natural, the technical and the social
sciences. Examples may be curve fitting, calibration, establishing relationships between different variables
in an experiment or in a survey, etc. Regression analysis is probably one the most used statistical techniques
around.

Dr. Anna B. O. Jensen provided insight and data for the Global Positioning System (GPS) example.

Matlab code and sections that are considered as either traditional land surveying material or as advanced
material are typeset with smaller fonts.

Comments in general or on for example unavoidable typos, shortcomings and errors are most welcome.
1
in Danish udjvning
2 Least Squares Adjustment

Contents

Preface 1

Contents 2

1 Linear Least Squares 4

1.1 Ordinary Least Squares, OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.1 Linear Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.2 Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1.4 Dispersion and Significance of Estimates . . . . . . . . . . . . . . . . . . . . . . . 11

1.1.5 Residual and Influence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1.6 Singular Value Decomposition, SVD . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.1.7 QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.1.8 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2 Weighted Least Squares, WLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2.1 Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2.2 Weight Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.2.3 Dispersion and Significance of Estimates . . . . . . . . . . . . . . . . . . . . . . . 21

1.2.4 WLS as OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3 General Least Squares, GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Nonlinear Least Squares 26

2.1 Nonlinear WLS by Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1.1 Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1.2 Iterative Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1.3 Dispersion and Significance of Estimates . . . . . . . . . . . . . . . . . . . . . . . 28

2.1.4 Confidence Ellipsoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.1.5 Dispersion of a Function of Estimated Parameters . . . . . . . . . . . . . . . . . . . 30


Allan Aasbjerg Nielsen 3

2.1.6 The Derivative Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2 Nonlinear WLS by other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.2.1 The Gradient or Steepest Descent Method . . . . . . . . . . . . . . . . . . . . . . . 48

2.2.2 Newtons Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.2.3 The Gauss-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.2.4 The Levenberg-Marquardt Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Final Comments 51

Literature 52

Index 53
4 Least Squares Adjustment

1 Linear Least Squares

Example 1 (from Conradsen, 1984, 1B p. 5.58) Figure 1 shows a plot of clock error as a function of time
passed since a calibration of the clock. The relationship between time passed and the clock error seems to
be linear (or affine) and it would be interesting to estimate a straight line through the points in the plot, i.e.,
estimate the slope of the line and the intercept with the axis time = 0. This is a typical regression analysis
task (see also Example 2). [end of example]

4.5

3.5

3
Clock error [seconds]

2.5

1.5

0.5

0
0 5 10 15 20 25 30 35 40 45 50
Time [days]

Figure 1: Example with clock error as a function of time.

Lets start by studying a situation where we want to predict one (response) variable y (as clock error in
Example 1) as a linear function of one (predictor) variable x (as time in Example 1). When we have one
predictor variable only we talk about simple regression. We have n joint observations of x (x1 , . . . , xn ) and
y (y1 , . . . , yn ) and we write the model where the parameter 1 is the slope of the line as

y1 = 1 x1 + e1 (1)
y2 = 1 x2 + e2 (2)
..
. (3)
yn = 1 xn + en . (4)

The ei s are termed the residuals; they are the differences between the data yi and the model 1 xi . Rewrite to
get

e1 = y1 1 x1 (5)
e2 = y2 1 x2 (6)
..
. (7)
en = yn 1 xn . (8)

In order to find the best line through (the origo and) the point cloud {xi yi }ni=1 by means of the least squares
Allan Aasbjerg Nielsen 5

principle write

1 n
1 n
= e2i = (yi 1 xi )2 (9)
2 i=1 2 i=1

and find the derivative of with respect to the slope 1

d n n
= (yi 1 xi )(xi ) = (1 x2i xi yi ). (10)
d1 i=1 i=1

Setting the derivative equal to zero and denoting the solution 1 we get

n
n
1 x2i = xi yi (11)
i=1 i=1

or (omitting the summation indices for clarity)



xi yi
1 = 2 . (12)
xi
Since
d2 n

2
= x2i > 0 (13)
d1 i=1

for non-trivial cases 1 gives a minimum for . This 1 gives the best straight line through the origo and the
point cloud, best in the sense that it minimizes (half) the sum of the squared residuals measured along the
y-axis, i.e., perpendicular to the x-axis. In other words: the xi s are considered as uncertainty- or error-free
constants, all the uncertainty or error is associated with the yi s.

Lets look at another situation where we want to predict one (response) variable y as an affine function of one
(predictor) variable x. We have n joint observations of x and y and write the model where the parameter 0
is the intercept of the line with the y-axis and the parameter 1 is the slope of the line as

y1 = 0 + 1 x1 + e1 (14)
y2 = 0 + 1 x2 + e2 (15)
..
. (16)
yn = 0 + 1 xn + en . (17)

Rewrite to get

e1 = y1 (0 + 1 x1 ) (18)
e2 = y2 (0 + 1 x2 ) (19)
..
. (20)
en = yn (0 + 1 xn ). (21)

In order to find the best line through the point cloud {xi yi }ni=1 (and this time not necessarily through the
origo) by means of the least squares principle write

1 n
1 n
= e2i = (yi (0 + 1 xi ))2 (22)
2 i=1 2 i=1
6 Least Squares Adjustment

and find the partial derivatives of with respect to the intercept 0 and the slope 1
n n n
= (yi (0 + 1 xi ))(1) = yi + n0 + 1 xi (23)
0 i=1 i=1 i=1
n n n n
= (yi (0 + 1 xi ))(xi ) = xi yi + 0 xi + 1 x2i . (24)
1 i=1 i=1 i=1 i=1

Setting the partial derivatives equal to zero and denoting the solutions 0 and 1 we get (omitting the summa-
tion indices for clarity)

x2i yi xi xi yi
0 = (25)
n x2i ( xi )2

n xi yi xi yi
1 = . (26)
n x2i ( xi )2

We see that 1 xi + n0 = yi or y = 0 + 1 x (leading to ei = [yi (0 + 1 xi )] = 0) where

x = xi /n is the mean value of x and y = yi /n is the mean value of y. Another way of writing this is
0 = y 1 x (27)

(xi x)(yi y) xy
1 = = . (28)
(xi x)2 x2

where xy = (xi x)(yi y)/(n 1) is the covariance between x and y, and x2 = (xi x)2 /(n 1) is
the variance of x. Also in this case 0 and 1 give a minimum for , see page 8.

Example 2 (continuing Example 1) With time points (xi ) [3 6 7 9 11 12 14 16 18 19 23 24 33 35 39 41 42 44


45 49]T days and clock errors (yi ) [0.435 0.706 0.729 0.975 1.063 1.228 1.342 1.491 1.671 1.696 2.122 2.181
2.938 3.135 3.419 3.724 3.705 3.820 3.945 4.320]T seconds we get 0 = 0.1689 seconds and 1 = 0.08422
seconds/day. This line is plotted in Figure 1. Judged visually the line seems to model the data fairly well.
[end of example]

More generally let us consider n observations of one dependent (or response) variable y and p independent
(or explanatory or predictor) variables xj , j = 1, . . . , p . The xj s are also called the regressors. When
we have more than one regressor we talk about multiple regression analysis. The words dependent and
independent are not used in their probabilistic meaning here but are merely meant to indicate that xj in
principle may vary freely and that y varies depending on xj . Our task is to 1) estimate the parameters j in
the model below, and 2) predict the expectation value of y where we consider y as a function of the j s and
not of the xj s which are considered as constants. For the ith set of observations we have
yi = yi (0 , 1 , . . . , p ; x1 , . . . , xp ) + ei (29)
= yi (; x) + ei (30)
= yi () + ei (31)
= (0 + ) 1 xi1 + + p xip + ei , i = 1, . . . , n (32)
where = [0 1 . . . p ]T , x = [x1 . . . xp ]T , and ei is the difference between the data and the model for
observation i with expectation value E{ei } = 0. ei is termed the residual or the error. The last equation above
is written with the constant or the intercept 0 in parenthesis since we may want to include 0 in the model or
we may not want to, see also Examples 3-5. Write all n equations in matrix form

y1 1 x11 x1p 0 e1

y2 1 x21 x2p 1 e2
.. = .. .. .. .. + .. (33)
..
. . . . . . .
yn 1 xn1 xnp p en
Allan Aasbjerg Nielsen 7

or
y = X + e (34)
where

y is n 1,
X is n p, p = p + 1 if an intercept 0 is estimated, p = p if not,
is p 1, and
e is n 1 with expectation E{e} = 0.

If we dont want to include 0 in the model, 0 is omitted from and so is the first column of ones in X.

Equations 33 and 34 are termed the observation equations2 . The columns in X must be linearly independent,
i.e., X is full column rank. Here we study the situation where the system of equations is over-determined,
i.e., we have more observations than parameters, n > p. f = n p is termed the number of degrees of
freedom3 .

The model is linear in the parameters but not necessarily linear in y and xj (for instance y could be replaced

by ln y or 1/y, or xj could be replaced by xj , extra columns with products xk xl called interactions could be
added to X or similarly). Transformations of y have implications for the nature of the residual.

Finding an optimal given a set of observed data (the ys and the xj s) and an objective function (or a cost or
a merit function, see below) is referred to as regression analysis in statistics. The elements of the vector are
also called the regression coefficients. In some application sciences such as geodesy including land surveying
regression analysis is termed adjustment4 .

All uncertainty (or error) is associated with y, the xj s are considered as constants which may be reasonable
or not depending on (the genesis of) the data to be analyzed.

1.1 Ordinary Least Squares, OLS

In OLS we assume that the variance-covariance matrix also known as the dispersion matrix of y is propor-
tional to the identity matrix, D{y} = D{e} = 2 I, i.e., all residuals have the same variance and they are

uncorrelated. We minimize the objective function = 1/2 ni=1 e2i = eT e/2 (hence the name least squares:
we minimize (half) the sum of squared differences between the data and the model, i.e., (half) the sum of the
squared residuals)
= 1/2(y X)T (y X) (35)
= 1/2(y T y y T X T X T y + T X T X) (36)
= 1/2(y T y 2 T X T y + T X T X). (37)
The derivative with respect to is

= X T y + X T X. (38)

2
in Danish observationsligningerne
3
in Danish antal frihedsgrader or antal overbestemmelser
4
in Danish udjvning
8 Least Squares Adjustment

When the columns of X are linearly independent the second order derivative 2 / T = X T X is positive
definite. Therefore we have a minimum for . Note that the p p X T X is symmetric, (X T X)T = X T X.

We find the OLS estimate for termed OLS (pronounced theta-hat) by setting / = 0 to obtain the
normal equations5

X T X OLS = X T y. (39)

1.1.1 Linear Constraints

Linear constraints can be build into the normal equations by defining

KT = c (40)

where the vector c and the columns of matrix K define the constraints, one constraint per column of K and per element of c. If
for example = [1 2 3 4 5 ]T and 2 , 3 and 5 are the three angles in a triangle which must sum to also known as 200
gon in land surveying(with no constraints on 1 and 4 ), use K T = [0 1 1 0 1] and c = 200 gon.

Also, we must add a term to the expression for in Equation 35 above setting the constraints to zero

L = + T (K T c) (41)

where is a vector of so-called Lagrangian multipliers.

Setting the partial derivatives of Equations 41 and 40 to zero leads to


[ ][ ] [ ]
XT X K OLS XT y
= . (42)
KT 0 c

1.1.2 Parameter Estimates

If the symmetric matrix X T X is well behaved, i.e., it is full rank (equal to p) corresponding to linearly
independent columns in X a formal solution is

OLS = (X T X)1 X T y. (43)

For reasons of numerical stability especially in situations with nearly linear dependencies between the columns
of X (causing slight alterations to the observed values in X to lead to substantial changes in the estimated ;
this problem is known as multicollinearity) the system of normal equations should not be solved by inverting
X T X but rather by means of SVD, QR or Cholesky decomposition, see Sections 1.1.6, 1.1.7 and 1.1.8.

If we apply Equation 43 to the simple regression problem in Equations 14-17 of course we get the same
solution as in Equations 25 and 26 (as an exercise you may want to check this).

When we apply regression analysis in other application areas we are often interested in predicting the response
variable based on new data not used in the estimation of the parameters or the regression coefficients . In
land surveying and GNSS applications we are typically interested in and not on this predictive modelling.

(In the linear case OLS can be found in one go because eT e is quadratic in ; unlike in the nonlinear case
dealt with in Section 2 we dont need an initial value for and an iterative procedure.)
5
in Danish normalligningerne
Allan Aasbjerg Nielsen 9






Figure 2: y is the projection of y onto the hyperplane spanned by the vectors xi in the columns of matrix X
(modified from Hastie, Tibshirani and Friedman (2009) by Jacob S. Vestergaard).

The estimate for y termed y (pronounced y-hat) is

y = X OLS = X(X T X)1 X T y = Hy (44)

where H = X(X T X)1 X T is the so-called hat matrix since it transforms or projects y into y (H puts
the hat on y). In geodesy (and land surveying) these equations are termed the fundamental equations6 . H
is a projection matrix: it is symmetric, H = H T , and idempotent, HH = H. We also have HX = X and
that the trace of H, trH = tr(X(X T X)1 X T ) = tr(X T X(X T X)1 ) = trI p = p.

The estimate of the error term e (also known as the residual) termed e (pronounced e-hat) is

e = y y = y Hy = (I H)y. (45)

Also I H is symmetric, I H = (I H)T , and idempotent, (I H)(I H) = I H. We also have


(I H)X = 0 and tr(I H) = n p.

X and e, and y and e are orthogonal: X T e = 0 and y T e = 0. Geometrically this means that our analysis
finds the orthogonal projection y of y onto the hyperplane spanned by the linearly independent columns of
X. this gives the shortest distance between y and y, see Figure 2.

Since the expectation of OLS

E{ OLS } = E{(X T X)1 X T y} (46)


= (X T X)1 X T E{y} (47)
= (X T X)1 X T E{X + e} (48)
= , (49)

OLS is unbiased or a central estimator.

1.1.3 Regularization

If X T X is near singular (also known as ill-conditioned) we may use so-called regularization. In the regularized case we penalize
some characteristic of , for example size, by introducing an extra term into Equation 35 (typically with X T X normalized to
correlation form), namely T where describes some characteristic of and the small positive scalar determines the
6
in Danish fundamentalligningerne
10 Least Squares Adjustment

amount of regularization. If we wish to penalize large i , i.e., we wish to penalize size, is the unit matrix. In this case we use the
term ridge regression. In the regularized case the normal equations become

(X T X + ) OLS = X T y, (50)

with formal solution

OLS = (X T X + )1 X T y. (51)

For ridge regression this becomes

OLS = (X T X + I)1 X T y = (I + (X T X)1 )1 OLS . (52)

Example 3 (from Strang and Borre, 1997, p. 306) Between four points A, B, C and D situated on a straight
line we have measured all pairwise distances AB, BC, CD, AC, AD and BD. The six measurements are
y = [3.17 1.12 2.25 4.31 6.51 3.36]T m. We wish to determine the distances 1 = AB, 2 = BC and
3 = CD by means of linear least squares adjustment. We have n = 6, p = 3 and f = 3. The six observation
equations are

y1 = 1 + e1 (53)
y2 = 2 + e2 (54)
y3 = 3 + e3 (55)
y4 = 1 + 2 + e4 (56)
y5 = 1 + 2 + 3 + e5 (57)
y6 = 2 + 3 + e6 . (58)

In matrix form we get (this is y = X + e; units are m)



3.17 1 0 0 e1

1.12 0 1 0 e2
1
2.25 0 0 1 e3
= 2 + . (59)
4.31 1 1 0 e4

3
6.51 1 1 1 e5
3.36 0 1 1 e6

The normal equations are (this is X T X = X T y; units are m)



3 2 1 1 13.99

2 4 2 2 = 15.30 .
(60)
1 2 3 3 12.12

The hat matrix is



1/2 1/4 0 1/4 1/4 1/4

1/4 1/2 1/4 1/4 0 1/4


0 1/4 1/2 1/4 1/4 1/4
H=

. (61)
1/4 1/4 1/4 1/2 1/4 0


1/4 0 1/4 1/4 1/2 1/4
1/4 1/4 1/4 0 1/4 1/2

The solution is = [3.1700 1.1225 2.2350]T m, see Matlab code in page 13.
Allan Aasbjerg Nielsen 11

Now, let us estimate an intercept 0 also corresponding to an imprecise zero mark of the distance measuring
device used. In this case we have n = 6, p = 4 and f = 2 and we get (in m)

3.17 1 1 0 0 e1

1.12 1 0 1 0 0 e2

2.25 1 0 0 1 1 e3
= + . (62)
4.31 1 1 1 0 2 e4


6.51 1 1 1 1 3 e5
3.36 1 0 1 1 e6

The normal equations in this case are (in m)



6 3 4 3 0 20.72
3 3 2 1 1 13.99

= . (63)
4 2 4 2 2 15.30
3 1 2 3 3 12.12

The hat matrix is



3/4 0 1/4 1/4 0 1/4

0 3/4 0 1/4 1/4 1/4


1/4 0 3/4 1/4 0 1/4
H=

. (64)
1/4 1/4 1/4 1/2 1/4 0


0 1/4 0 1/4 3/4 1/4
1/4 1/4 1/4 0 1/4 1/2

The solution is = [0.0150 3.1625 1.1150 2.2275]T m, see Matlab code in page 13. [end of example]

1.1.4 Dispersion and Significance of Estimates

Dispersion or variance-covariance matrices for y, OLS , y and e are

D{y} = 2 I (65)
D{ OLS } = D{(X T X)1 X T y} (66)
= (X T X)1 X T D{y}X(X T X)1 (67)
= 2 (X T X)1 (68)
D{y} = D{X OLS } (69)
= XD{ OLS }X T (70)
= 2 H, V{yi } = 2 Hii (71)
D{e} = D{(I H)y} (72)
= (I H)D{y}(I H)T (73)
= 2 (I H) = D{y} D{y}, V{ei } = 2 (1 Hii ). (74)

The ith diagonal element of H, Hii , is called the leverage7 for observation i. We see that a high leverage
gives a high variance for yi indicating that observation i is poorly predicted by the regression model. This
again indicates that observation i may be an outlier, see also Section 1.1.5 on residual and influence analysis.
7
in Danish potentialet
12 Least Squares Adjustment

For the sum of squared errors (SSE, also called RSS for the residual sum of squares) we get

eT e = y T (I H)y (75)

with expectation E{eT e} = 2 (n p). The mean squared error MSE is

2 = eT e/(n p) (76)

and the root mean squared error RMSE is also known as s. = s has the same unit as ei and yi .

The square roots of the diagonal elements of the dispersion matrices in Equations 65, 68, 71 and 74 are the
standard errors of the quantities in question. For example, the standard error of i denoted i is the square
root of the ith diagonal element of 2 (X T X)1 .

Example 4 (continuing Example 3) The estimated residuals in the case with no intercept
are e = [0.0000
0.0025 0.0150 0.0175 0.0175 0.0025]T m. Therefore the RMSE or = s = e e/3 m = 0.0168 m.
T

The inverse of X T X is
1
3 2 1 1/2 1/4 0

2 4 2 = 1/4 1/2 1/4
. (77)
1 2 3 0 1/4 1/2

This gives standard deviations for , = [0.0119 0.0119 0.0119]T m. The case with an intercept gives
= s = 0.0177 m and standard deviations for , = [0.0177 0.0153 0.0153 0.0153]T m. [end of example]
So far we have assumed only that E{e} = 0 and that D{e} = 2 I, i.e., we have made no assumptions about the distribution of e.
Let us further assume that the ei s are independent and identically distributed (written as iid) following a normal distribution. Then
OLS (which in this case corresponds to a maximum likelihood estimate) follows a multivariate normal distribution with mean
and dispersion 2 (X T X)1 . Assuming that i = ci where ci is a constant it can be shown that the ratio

i ci
zi = (78)
i

follows a t distribution with n p degrees of freedom. This can be used to test whether i ci is significantly different from 0. If
for example zi with ci = 0 has a small absolute value then i is not significantly different from 0 and xi should be removed from
the model.

Example 5 (continuing Example 4) The t-test statistics zi with ci = 0 in the case with no intercept are [266.3 94.31 187.8]T
which are all very large compared to 95% or 99% percentiles in a two-sided t-test with three degrees of freedom, 3.182 and 5.841
respectively. The probabilities of finding larger values of |zi | are [0.0000 0.0000 0.0000]T . Hence all parameter estimates are
significantly different from zero. The t-test statistics zi with ci = 0 in the case with an intercept are [0.8485 206.6 72.83 145.5]T ;
all but the first value are very large compared to 95% and 99% percentiles in a two-sided t-test with two degrees of freedom,
4.303 and 9.925 respectively. The probabilities of finding larger values of |zi | are [0.4855 0.0000 0.0002 0.0000]T . Therefore the
estimate of 0 is insignificant (i.e., it is not significantly different from zero) and the intercept corresponding to an imprecise zero
mark of the distance measuring device used should not be included in the model. [end of example]

Often a measure of variance reduction termed the coefficient of determination denoted R2 and a version that
2
adjusts for the number of parameters denoted Radj are defined in the statistical literature:
SST0 = y T y (if no intercept 0 is estimated)
SST1 = (y y)T (y y) (if an intercept 0 is estimated)
SSE = eT e
R2 = 1 SSE/SSTi
2
Radj = 1 (1 R2 )(n i)/(n p) where i is 0 or 1 as indicated by SSTi .
Both R2 and Radj
2
lie in the interval [0,1]. For a good model with a good fit to the data both R2 and Radj
2

should be close to 1.
Allan Aasbjerg Nielsen 13

Matlab code for Examples 3 to 5

% (C) Copyright 2003


% Allan Aasbjerg Nielsen
% aa@imm.dtu.dk, www.imm.dtu.dk/aa

% model without intercept

y = [3.17 1.12 2.25 4.31 6.51 3.36];


X = [1 0 0; 0 1 0; 0 0 1; 1 1 0; 1 1 1; 0 1 1];
[n,p] = size(X);
f = n-p;

thetah = X\y;
yh = X*thetah;
eh = y-yh;
s2 = eh*eh/f;
s = sqrt(s2);
iXX = inv(X*X);
Dthetah = s2.*iXX;
stdthetah = sqrt(diag(Dthetah));
t = thetah./stdthetah;
pt = betainc(f./(f+t.2),0.5*f,0.5);

H = X*iXX*X;
Hii = diag(H);

% model with intercept

X = [ones(n,1) X];
[n,p] = size(X);
f = n-p;

thetah = X\y;
yh = X*thetah;
eh = y-yh;
s2 = eh*eh/f;
s = sqrt(s2);
iXX = inv(X*X);
Dthetah = s2.*iXX;
stdthetah = sqrt(diag(Dthetah));
t = thetah./stdthetah;
pt = betainc(f./(f+t.2),0.5*f,0.5);

H = X*iXX*X;
Hii = diag(H);

The Matlab backslash operator \ or mldivide, left matrix divide, in this case with X non-square computes the QR factor-
ization (see Section 1.1.7) of X and finds the least squares solution by back-substitution.

Probabilities in the t distribution are calculated by means of the incomplete beta function evaluated in Matlab by the betainc
function.

1.1.5 Residual and Influence Analysis

Residual analysis is performed to check the model and to find possible outliers or gross errors in the data.
Often inspection of listings or plots of e against y and e against the columns in X (the explanatory variables
or the regressors) are useful. No systematic tendencies should be observable in these listings or plots.

Standardized residuals
ei
ei = (79)
1 Hii
which have unit variance (see Equation 74) are often used.
14 Least Squares Adjustment

Studentized or jackknifed residuals (regression omitting observation i to obtain a prediction for the omitted
2
observation y(i) and an estimate of the corresponding error variance (i) )

yi y(i)
ei = (80)
V{yi y(i) }

are also often used. We dont have to redo the adjustment each time an observation is left out since it can be
shown that
/v
u
u n p e 2
ei = ei t
i
. (81)
np1


For the sum of the diagonal elements Hii of the hat matrix we have trH = ni=1 Hii = p which means that
the average value Hii = p/n. Therefore an alarm for very influential observations which may be outliers
could be set if Hii > 2p/n (or maybe if Hii > 3p/n). As mentioned above Hii is termed the leverage for
observation i. None of the observations in Example 3 have high leverages.

Another often used measure of influence of the individual observations is called Cooks distance also known
as Cooks D. Cooks D for observation i measures the distance between the vector of estimated parameters
with and without observation i (often skipping the intercept 0 if estimated). Other influence statistics exist.

Example 6 In this example two data sets are simulated. The first data set contains 100 observations with
one outlier. This outlier is detected by means of its residual, the leverage of the outlier is low since the
observation does not influence the regression line, see Figure 3. In the top-left panel the dashed line is from
a regression with an insignificant intercept and the solid line is from a regression without the intercept. The
outlier has a huge residual, see the bottom-left panel. The mean leverage is p/n = 0.01. Only a few leverages
are greater then 0.02, see the top-right panel. No leverages are greater then 0.03.

The second data set contains four observations with one outlier, see Figure 3 bottom-right panel. This outlier
(observation 4 with coordinates (100,10)) is detected by means of its leverage, the residual of the outlier
is low, see Table 1. The mean leverage is p/n = 0.5. The leverage of the outlier is by far the greatest,
H44 2p/n. [end of example]

1.1.6 Singular Value Decomposition, SVD

In general the data matrix X can be factorized as

X = V U T , (82)

Table 1: Residuals and leverages for simulated example with one outlier (observation 4) detected by the
leverage.

Obs x y Residual Leverage


1 1 1 0.9119 0.3402
2 2 2 0.0062 0.3333
3 3 3 0.9244 0.3266
4 100 10 0.0187 0.9998
Allan Aasbjerg Nielsen 15

First simulated example Leverage, Hii


12 0.03

10 0.025
8
0.02
6
0.015
4
0.01
2

0 0.005

2 0
0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100

Residuals Second simulated example


12 12

10 10
8
8
6
6
4
4
2

0 2

2 0
0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100

Figure 3: Simulated examples with 1) one outlier detected by the residual (top-left and bottom-left) and 2)
one outlier (observation 4) detected by the leverage (bottom-right).

where V is n p, is p p diagonal with the singular values of X on the diagonal, and U is p p with U T U = U U T =
V T V = I p . This leads to the following solution to the normal equations

X T X OLS = XT y (83)
(V U T )T (V U T ) OLS = (V U T )T y (84)
U V T V U T OLS = U V T y (85)
U 2 U T OLS = U V T y (86)
U T OLS = V Ty (87)

and therefore

OLS = U 1 V T y. (88)

1.1.7 QR Decomposition

An alternative factorization of X is

X = QR, (89)

where Q is n p with QT Q = I p and R is p p upper triangular. This leads to

X T X OLS = XT y (90)
16 Least Squares Adjustment

(QR)T QR OLS = (QR)T y (91)


T T T T
R Q QR OLS = R Q y (92)
T
R OLS = Q y. (93)

This system of equations can be solved by back-substitution.

1.1.8 Cholesky Decomposition

Both the SVD and the QR factorizations work on X. Here we factorize X T X

X T X = CC T , (94)

where C is p p lower triangular. This leads to

X T X OLS = XT y (95)
CC T OLS = X T y. (96)

This system of equations can be solved by two times back-substitution.


A Trick to Obtain eT e with the Cholesky Decomposition X T X = CC T , C is p p lower triangular

CC T OLS = X T y (97)
C(C T OLS ) = X T y (98)

so Cz = X T y with C T OLS = z. Expand p p X T X with one more row and column to (p + 1) (p + 1)


[ ]
T XT X XT y
C C = . (99)
(X T y)T yT y

With
[ ] [ ]
C 0 T CT z
C = and C = (100)
zT s 0T s

we get
[ ]
T CC T Cz
C C = . (101)
zT C T z T z + s2

We see that

s2 = yT y zT z (102)
T
= y y T
OLS CC T OLS (103)
T
= yT y OLS X T y (104)
= y y T
y T X OLS (105)
1
= y y y X(X X)
T T T
X y T
(106)
= y T y y T Hy (107)
= y T (I H)y (108)
= eT e. (109)
T
Hence, after Cholesky decomposition of the expanded matrix, the lower right element of C is eT e. The last column in C
(skipping s in the last row) is C T OLS , hence OLS can be found by back-substitution.
Allan Aasbjerg Nielsen 17

1.2 Weighted Least Squares, WLS

In WLS we allow the uncorrelated residuals to have different variances and assume that D{y} = D{e} =
diag[12 , . . . , n2 ]. We assign a weight pi (p for pondus which is Latin for weight) to each observation so that
p1 12 = = pi i2 = = pn n2 = 1 02 or i2 = 02 /pi with pi > 0. 0 is termed the standard deviation of
unit weight8 . Therefore D{y} = D{e} = 02 diag[1/p1 , . . . , 1/pn ] = 02 P 1 and we minimize the objective

function = 1/2 ni=1 pi e2i = eT P e/2 where

p1 0 0

0 p2 0
P =
.. .. .. .. .
(110)
. . . .
0 0 pn
We get

= 1/2(y X)T P (y X) (111)


= 1/2(y T P y y T P X T X T P y + T X T P X) (112)
= 1/2(y T P y 2 T X T P y + T X T P X). (113)

The derivative with respect to is



= X T P y + X T P X. (114)

When the columns of X are linearly independent the second order derivative 2 / T = X T P X is
positive definite. Therefore we have a minimum for . Note that X T P X is symmetric, (X T P X)T =
X T P X.

We find the WLS estimate for termed W LS (pronounced theta-hat) by setting / = 0 to obtain the
normal equations

X T P X W LS = X T P y (115)

or N W LS = c with N = X T P X and c = X T P y.

1.2.1 Parameter Estimates

If the symmetric matrix N = X T P X is well behaved, i.e., it is full rank (equal to p) corresponding to
linearly independent columns in X a formal solution is

W LS = (X T P X)1 X T P y = N 1 c. (116)

For reasons of numerical stability especially in situations with nearly linear dependencies between the columns
of X (causing slight alterations to the observed values in X to lead to substantial changes in the estimated ;
this problem is known as multicollinearity) the system of normal equations should not be solved by inverting
X T P X but rather by means of SVD, QR or Cholesky decomposition, see Sections 1.1.6, 1.1.7 and 1.1.8.

When we apply regression analysis in other application areas we are often interested in predicting the response
variable based on new data not used in the estimation of the parameters or the regression coefficients . In
land surveying and GNSS applications we are typically interested in and not on this predictive modelling.
8
in Danish spredningen pa vgtenheden
18 Least Squares Adjustment

(In the linear case W LS can be found in one go because eT P e is quadratic in ; unlike in the nonlinear case
dealt with in Section 2 we dont need an initial value for and an iterative procedure.)

The estimate for y termed y (pronounced y-hat) is

y = X W LS = X(X T P X)1 X T P y = Hy = XN 1 c (117)

where H = X(X T P X)1 X T P is the so-called hat matrix since it transforms y into y. In geodesy
(and land surveying) these equations are termed the fundamental equations. In WLS regression H is not
symmetric, H = H T . H is idempotent HH = H. We also have HX = X and that the trace of H,
trH = tr(X(X T P X)1 X T P ) = tr(X T P X(X T P X)1 ) = trI p = p. Also P H = H T P = H T P H
which is symmetric.

The estimate of the error term e (also known as the residual) termed e (pronounced e-hat) is

e = y y = y Hy = (I H)y. (118)

In WLS regression I H is not symmetric, I H = (I H)T . I H is idempotent, (I H)(I H) =


I H. We also have (I H)X = 0 and tr(I H) = n p. Also P (I H) = (I H)T P =
(I H)T P (I H) which is symmetric.

X and e, and y and e are orthogonal (with respect to P ): X T P e = 0 and y T P e = 0. Geometrically


this means that our analysis finds the orthogonal projection (with respect to P ) y of y onto the hyperplane
spanned by the linearly independent columns of X. This gives the shortest distance between y and y in the
norm defined by P .

Since the expectation of W LS is

E{ W LS } = E{(X T P X)1 X T P y} (119)


= (X T P X)1 X T P E{y} (120)
= (X T P X)1 X T P E{X + e} (121)
= , (122)

W LS is unbiased or a central estimator.

1.2.2 Weight Assignment

In general we assign weights to observations so that the weight of an observation is proportional to the inverse
expected (prior) variance of that observation, pi 1/i,prior
2
.

In traditional land surveying and GNSS we deal with observations of distances, directions and heights. In

WLS we minimize half the weighted sum of squared residuals = 1/2 ni=1 pi e2i . For this sum to make sense
all terms must have the same unit. This can be obtained by demanding that pi e2i has no unit. This means that
pi has units of 1/e2i or 1/yi2 . If we consider the weight definition 02 = p1 12 = = pi i2 = = pn n2
2
we see that 02 has no unit. Choosing pi = 1/i,prior we obtain that 0 = 1 if measurements are carried out
with the expected (prior) variances (and the regression model is correct). i,prior depends on the quality of the
instruments applied and how measurements are performed. Below formulas for weights are given, see Jacobi
(1977).
Allan Aasbjerg Nielsen 19

Distance Measurements Here we use


n
pi = (123)
s2G + a2 s2a
where

n is the number of observations,

sG is the combined expected standard deviation of the distance measuring instrument itself and on
centering of the device,

sa is the expected distance dependent standard deviation of the distance measuring instrument, and

a is the distance between the two points in question.

Directional Measurements Here we use


n
pi = 2 (124)
n asc2 + s2t

where

n is the number of observations,

sc is the expected standard deviation on centering of the device, and

st is the expected standard deviation of one observed direction.

Levelling or Height Measurements Here we traditionally choose weights pi equal to the number of mea-
surements divided by the distance between the points in question measured in units of km, i.e., a weight of 1
is assigned to one measured height difference if that height difference is measured over a distance of 1 km.
Since here in general pi = 1/i,prior
2
this choice of weights does not ensure 0 = 1. In this case the units for
the weights are not those of the inverse prior variances so 0 is not unit-free, and also this tradition makes it
impossible to carry out adjustment of combined height, direction and distance observations.

In conclusion we see that the weights for distances and directions change if the distance a between points
change. The weights chosen for height measurements are generally not equal to the inverse of the expected
(prior) variance of the observations. Therefore they do not lead to 0 = 1. Both distance and directional
measurements lead to nonlinear least squares problems, see Section 2.

Example 7 (from Mrsk-Mller and Frederiksen, 1984, p. 74) From four points Q, A, B and C we have
measured all possible pairwise height differences, see Figure 4. All measurements are carried out twice. Q has
a known height KQ = 34.294 m which is considered as fixed. We wish to determine the heights in points A,
B and C by means of weighted least squares adjustment. These heights are called 1 , 2 and 3 respectively.
The mean of the two height measurements are (with the distance di between points in parentheses)
from Q to A 0.905 m (0.300 km),
from A to B 1.675 m (0.450 km),
from C to B 8.445 m (0.350 km),
from C to Q 5.864 m (0.300 km),
from Q to B 2.578 m (0.500 km), and
from C to A 6.765 m (0.450 km).
20 Least Squares Adjustment

Figure 4: From four points Q, A, B and C we measure all possible pairwise height differences (from Mrsk-
Mller and Frederiksen, 1984).

The weight for each observation is pi = 2/di , see immediately above, resulting in (units are in km1 )

6.6667 0 0 0 0 0

0 4.4444 0 0 0 0

0 0 5.7143 0 0 0
P =

.
(125)
0 0 0 6.6667 0 0

0 0 0 0 4.0000 0
0 0 0 0 0 4.4444

The six observation equations are

y1 = 1 KQ + e1 (126)
y2 = 2 1 + e2 (127)
y3 = 2 3 + e3 (128)
y4 = KQ 3 + e4 (129)
y5 = 2 KQ + e5 (130)
y6 = 1 3 + e6 . (131)

In matrix form we get (units are m)



0.905 1 0 0 34.294 e1

1.675 1 1 0 0.000 e2
1

8.445
= 0 1 1
2 + 0.000
+ e3

(132)
5.864 0 0 1 34.294 e4
3
2.578 0 1 0 34.294 e5
6.765 1 0 1 0.000 e6
Allan Aasbjerg Nielsen 21

or (with a slight misuse of notation since we reuse the i s and the ei s; this is y = X + e; units are mm)

35, 199 1 0 0 e1

1, 675 1 1 0 e2
1

8, 445
= 0 1 1
e3
.
2 + (133)
28, 430 0 0 1


e4
3
36, 872 0 1 0 e5
6, 765 1 0 1 e6

The normal equations are (this is X T P X = X T P y; units are mm)



15.5556 4.4444 4.4444 1 257, 282.22

4.4444 14.1587 5.7143 2 = 202, 189.59 . (134)
4.4444 5.7143 16.8254 3 111, 209.52

The solution is = [35, 197.8 36, 873.6 28, 430.3]T mm, see Matlab code in page 23. [end of example]

1.2.3 Dispersion and Significance of Estimates

Dispersion or variance-covariance matrices for y, W LS , y and e are

D{y} = 02 P 1 (135)
D{ W LS } = 02 (X T P X)1 = 02 N 1 (136)
D{y} = 02 XN 1 X T = 02 HP 1 (137)
D{e} = 02 (P 1 XN 1 X T ) (138)
= 02 (I H)P 1 = D{y} D{y}. (139)

We see that since the dispersion of y is not proportional to the hat matrix, an alternative measure of leverage
in this case is XN 1 X T (although with units as opposed to the elements of the hat matrix which are unit
free). This alternative measure may be useful only if measurements have the same units and are on the same
scale.

For the weighted sum of squared errors (SSE, also called RSS for the residual sum of squares) we get

eT P e = y T (I H)T P (I H)y (140)


= y T (I H)T (P P H)y (141)
= y T (P P H H T P + H T P H)y (142)
= y T P (I H)y (143)

with expectation E{eT P e} = 02 (n p). The mean squared error MSE is

02 = eT P e/(n p) (144)

and the root mean squared error RMSE is 0 also known as s0 . 0 = s0 has no unit. For well performed mea-
surements (with no outliers or gross errors), a good model, and properly chosen weights (see Section 1.2.2),
s0 1. This is due to the fact that assuming that the ei s with variance 02 /pi are independent and normally distributed, eT P e
(with well chosen pi , see Section 1.2.2) follows a 2 distribution with n p degrees of freedom which has expectation n p.
Therefore eT P e/(n p) has expectation 1 and its square root is approximately 1.
22 Least Squares Adjustment

What if s0 is larger than 1? How much larger than 1 is too large? If we assume that the ei s are independent and follow a normal
distribution, eT P e = (n p)s20 follows a 2 distribution with n p degrees of freedom. If the probability of finding (n p)s20
larger than the observed value is much smaller than the traditionally used 0.05 (5%) or 0.01 (1%), then s0 is too large.

The square roots of the diagonal elements of the dispersion matrices in Equations 135, 136, 137 and 138
are the standard errors of the quantities in question. For example, the standard error of i denoted i is the
square root of the ith diagonal element of 02 (X T P X)1 .

As in the OLS case we can define standardized residuals



ei pi
ei = (145)
0 1 Hii

with unit variance.

Example 8 (continuing Example 7) The hat matrix is



0.5807 0.1985 0.0287 0.2495 0.1698 0.2208

0.2977 0.4655 0.2941 0.0574 0.2403 0.2367


0.0335 0.2288 0.5452 0.2595 0.2260 0.1953
H=

(146)
0.2495 0.0383 0.2224 0.5664 0.1841 0.2112


0.2830 0.2670 0.3228 0.3069 0.4101 0.0159
0.3312 0.2367 0.2511 0.3169 0.0143 0.4320

and p/n = 3/6 = 0.5, no diagonal element is higher than two (or three) times 0.5. The diagonal of

0.0871 0.0447 0.0050 0.0374 0.0424 0.0497

0.0447 0.1047 0.0515 0.0086 0.0601 0.0533


0.0050 0.0515 0.0954 0.0389 0.0565 0.0439
XN 1 X T =

mm2 (147)
0.0374 0.0086 0.0389 0.0850 0.0460 0.0475


0.0424 0.0601 0.0565 0.0460 0.1025 0.0036
0.0497 0.0533 0.0439 0.0475 0.0036 0.0972

is an alternative measure of leverage and no (diagonal) element is larger than two or three times the average,
i.e, no observations have high leverages. In this case where all measurements have the same units and are
on the same scale both measures of leverage appear sensible. (See also Example 10 where this is not the
case.) Checking the diagonal of XN 1 X T of course corresponds to checking the variances (or standard
deviations) of the predicted observations y.

are e = [1.1941 0.7605 1.6879 0.2543 1.5664 2.5516] mm. Therefore the
T
The estimated residuals
RMSE or 0 = s0 = eT P e/3 mm/km1/2 = 4.7448 mm/km1/2 . The inverse of X T P X is
1
15.556 4.4444 4.4444 0.087106 0.042447 0.037425

4.4444 14.159 5.7143 = 0.042447 0.10253 0.046034

. (148)
4.4444 5.7143 16.825 0.037425 0.046034 0.084954

This gives standard deviations for , = [1.40 1.52 1.38]T mm.


Although the weighting scheme for levelling is not designed to give s0 = 1 (with no unit) we look into the magnitude of s0 for
illustration. s0 is larger than 1. Had the weighting scheme been designed to obtain s0 = 1 (with no unit) would s0 = 4.7448 be too
large? If the ei s are independent and follow a normal distribution, eT P e = (n p)s20 follows a 2 distribution with three degrees
of freedom. The probability of finding (n p)s20 larger than the observed 3 4.74482 = 67.5382 is smaller than 1013 which is
much smaller than the traditionally used 0.05 or 0.01. So s0 is too large. Judged from the residuals, the standard deviations and the
Allan Aasbjerg Nielsen 23

t-test statistics (see Example 9) the


fit to the model is excellent. Again for illustration: had the weights been one tenth of the values
used above, s0 would be 4.7448/ 10 = 1.5004, again larger than 1. The probability of finding (n p)s20 > 3 1.50042 = 6.7538
is 0.0802. Therefore this value of s0 would be suitably small. [end of example]

If we assume that the ei s are independent and follow a normal distribution W LS follows a multivariate normal distribution with
mean and dispersion 02 (X T P X)1 . Assuming that i = ci where ci is a constant it can be shown that the ratio

i ci
zi = (149)
i

follows a t distribution with n p degrees of freedom. This can be used to test whether i ci is significantly different from 0. If
for example zi with ci = 0 has a small absolute value then i is not significantly different from 0 and xi should be removed from
the model.

Example 9 (continuing Example 8) The t-test statistics zi with ci = 0 are [25, 135 24, 270 20, 558]T which are all extremely
large compared to 95% or 99% percentiles in a two-sided t-test with three degrees of freedom, 3.182 and 5.841 respectively. To
double precision the probabilities of finding larger values of |zi | are [0 0 0]T . All parameter estimates are significantly different
from zero. [end of example]
Matlab code for Examples 7 to 9

% (C) Copyright 2003


% Allan Aasbjerg Nielsen
% aa@imm.dtu.dk, www.imm.dtu.dk/aa

Kq = 34.294;
X = [1 0 0;-1 1 0;0 1 -1;0 0 -1;0 1 0;1 0 -1];
[n p] = size(X);
%number of degrees of freedom
f = n-p;
dist = [0.30 0.45 0.35 0.30 0.50 0.45];
P = diag(2./dist); % units [km(-1)]
%P = 0.1*P; % This gives a better s0
%OLS
%P = eye(size(X,1));
y = [.905 1.675 8.445 5.864 2.578 6.765];

%units are mm
y = 1000*y;
Kq = 1000*Kq;

cst = Kq.*[1 0 0 -1 1 0];


y = y+cst;
%OLS by "\" operator: mldivide
%thetahat = X*X\(X*y)
N = X*P;
c = N*y;
N = N*X;
%WLS
thetahat = N\c;
yhat = X*thetahat;
ehat = y-yhat;
yhat = yhat-cst;
%MSE
SSE = ehat*P*ehat;
s02 = SSE/f;
%RMSE
s0 = sqrt(s02);

%Variance/covariance matrix of the observations, y


Dy = s02.*inv(P);
%Standard deviations
stdy = sqrt(diag(Dy));

%Variance/covariance matrix of the adjusted elements, thetahat


Ninv = inv(N);
Dthetahat = s02.*Ninv;
%Standard deviations
stdthetahat = sqrt(diag(Dthetahat));
24 Least Squares Adjustment

%Variance/covariance matrix of the adjusted observations, yhat


Dyhat = s02.*X*Ninv*X;
%Standard deviations
stdyhat = sqrt(diag(Dyhat));

%Variance/covariance matrix of the adjusted residuals, ehat


Dehat = Dy-Dyhat;
%Standard deviations
stdehat = sqrt(diag(Dehat));

%Correlations between adjusted elements, thetahat


aux = diag(1./stdthetahat);
corthetahat = aux*Dthetahat*aux;

% tests

% t-values and probabilities of finding larger |t|


% pt should be smaller than, say, (5% or) 1%
t = thetahat./stdthetahat;
pt = betainc(f./(f+t.2),0.5*f,0.5);

% probability of finding larger s02


% should be greater than, say, 5% (or 1%)
pchi2 = 1-gammainc(0.5*SSE,0.5*f);

Probabilities in the 2 distribution are calculated by means of the incomplete gamma function evaluated in Matlab by the gammainc
function.

A Trick to Obtain eT P e with the Cholesky Decomposition X T P X = CC T , C p p lower triangular

CC T W LS = XT P y (150)
C(C T W LS ) = XT P y (151)

so Cz = X T P y with C T W LS = z. Expand p p X T P X with one more row and column to (p + 1) (p + 1)


[ ]
T XT P X XT P y
C C = . (152)
(X T P y)T yT P y

With
[ ] [ ]
C 0 T CT z
C = and C = (153)
zT s 0T s

we get
[ ]
T CC T Cz
C C = . (154)
zT C T z T z + s2

We see that

s2 = yT P y zT z (155)
T
= y Py T
W LS CC T W LS (156)
T
= yT P y W LS X T P y (157)
= y Py T
y T P X W LS (158)
1
= y P y y P X(X P X) X P y
T T T T
(159)
= y T P (I X(X T P X)1 X T P )y (160)
= eT P e. (161)
T
Hence, after Cholesky decomposition of the expanded matrix, the lower right element of C is eT P e. The last column in C
(skipping s in the last row) is C T W LS , hence W LS can be found by back-substitution.
Allan Aasbjerg Nielsen 25

1.2.4 WLS as OLS

The WLS problem can be turned into an OLS problem by replacing X by X = P 1/2 X and y by y = P 1/2 y

with P 1/2 = diag[ p1 , . . . , pn ] to get the OLS normal equations
X T P X W LS = X T P y (162)
(P 1/2 X)T (P 1/2 X) W LS = (P 1/2 X)T (P 1/2 y) (163)
T T
X X W LS = X y. (164)

1.3 General Least Squares, GLS

In GLS the residuals may be correlated and we assume that D{y} = D{e} = 02 . So 02 is the dispersion
or variance-covariance matrix of the residuals possibly with off-diagonal elements. This may be the case for
instance when we work on differenced data and not directly on observed data. We minimize the objective
function = eT 1 e/2
= 1/2(y X)T 1 (y X) (165)
= 1/2(y T 1 y y T 1 X T X T 1 y + T X T 1 X) (166)
= 1/2(y T 1 y 2 T X T 1 y + T X T 1 X). (167)
Just as in the WLS case we obtain the normal equations
X T 1 X GLS = X T 1 y. (168)

If the symmetric matrix X T 1 X is well behaved, i.e., it is full rank (equal to p) corresponding to linearly
independent columns in X a formal solution is
GLS = (X T 1 X)1 X T 1 y (169)
with dispersion
D{ GLS } = 02 (X T 1 X)1 . (170)
As with OLS and WLS
e = y y = y X GLS . (171)
For the dispersion of y we get
D{y} = 02 X(X T 1 X)1 X T . (172)
The mean squared error MSE is
02 = eT 1 e/(n p) (173)
and the root mean squared error RMSE is 0 also known as s0 .
The GLS problem can be turned into an OLS problem by means of the Cholesky decomposition of = CC T
(or of 1 )
X T 1 X GLS = X T 1 y (174)
X T C T C 1 X GLS = X T C T C 1 y (175)
(C 1 X)T (C 1 X) GLS = (C 1 X)T (C 1 y) (176)
T T
X X GLS = X y, (177)
1 1
i.e., replace X by X = C X and y by y = C y.
26 Least Squares Adjustment

1.3.1 Regularization

In the so-called regularized or penalized case we penalize some characteristic of , for example size, by introducing an extra term
into Equation 165, namely T where describes some characteristic of and the small positive scalar determines the
amount of regularization. If we wish to penalize large i , i.e., we wish to penalize size, is the unit matrix. In the regularized case
the normal equations become

(X T 1 X + ) GLS = X T 1 y, (178)

the dispersion of GLS becomes

D{ GLS } = 02 (X T 1 X + )1 X T 1 X(X T 1 X + )1 (179)

leading to this dispersion of y

D{y} = 02 X(X T 1 X + )1 X T 1 X(X T 1 X + )1 X T . (180)

2 Nonlinear Least Squares

Consider y as a general, nonlinear function of the j s where f can subsume a constant term if present

yi = fi (1 , . . . , p ) + ei , i = 1, . . . , n. (181)

In the traditional land surveying notation of Mrsk-Mller and Frederiksen (1984) we have (yi i , fi
Fi , j xj , and ei vi )

i = Fi (x1 , . . . , xp ) + vi , i = 1, . . . , n. (182)

(Mrsk-Mller and Frederiksen (1984) use vi ; whether we use +vi or vi is irrelevant for LS methods.)
Several methods are available to solve this problem, see Sections 2.2.1, 2.2.2, 2.2.3 and 2.2.4. Here we use a
linearization method.

If we have one parameter x only we get (we omit the observation index i)

= F (x) + v. (183)

In geodesy and land surveying the parameters are often called elements. We perform a Taylor expansion of
F around a chosen initial value x9
1 1
= F (x ) + F (x )(x x ) + F (x )(x x )2 + F (x )(x x )3 + + v (184)
2! 3!
and retain up till the first order term only (i.e., we linearize F near x to approximate v 2 to a quadratic near
x ; a single prime denotes the first order derivative, two primes denote the second order derivative etc.)

F (x ) + F (x )(x x ) + v. (185)

Geometrically speaking we work on the tangent of F (x) at x .

If we have p parameters or elements x = [x1 , . . . , xp ]T we get

= F (x1 , . . . , xp ) + v = F (x) + v (186)


9
in Danish forelbig vrdi or forelbigt element
Allan Aasbjerg Nielsen 27

and from a Taylor expansion we retain the first order terms only

F F
F (x1 , . . . , xp ) + (x1 x1 ) + + (xp xp ) + v (187)

x1 x1 =x
xp xp =x
1 p

or

F (x ) + [F (x )]T (x x ) + v (188)

where F (x ) is the gradient of F , [F (x )]T = [F/x1 . . . F/xp ]x=x , evaluated at x = x =


[x1 , . . . , xp ]T . Geometrically speaking we work in the tangent hyperplane of F (x) at x .

Write all n equations in vector notation



1 F1 (x) v1

2 F2 (x) v2
.. = .. + .. (189)

. . .
n Fn (x) vn
or

= F (x) + v (190)

and get

F (x ) + A(x x ) + v (191)

where the n p derivative matrix A is



[ ]
F1
x1
F1
xp
F F F .. .. ..
A= = =
. . .
(192)
x x1 xp Fn
x1
Fn
xp

with all Aij = Fi /xj evaluated at xj = xj . Therefore we get (here we use = instead of the correct )

k = A + v (193)

where k = F (x ) and = x x (Mrsk-Mller and Frederiksen (1984) use k = F (x ) ).


= F (x) are termed the fundamental equations in geodesy and land surveying. Equations 190 and 193 are
termed the observation equations. Equation 193 is a linearized version.

2.1 Nonlinear WLS by Linearization

If we compare k = A + v in Equation 193 with the linear expression y = X + e in Equation 34 and


the normal equations for the linear WLS problem in Equation 115, we get the normal equations for the WLS
of the increment
estimate
= AT P k
AT P A (194)

or N = c with N = AT P A and c = AT P k (Mrsk-Mller and Frederiksen (1984) use k and


therefore also c).
28 Least Squares Adjustment

2.1.1 Parameter Estimates

If the symmetric matrix N = AT P A is well behaved, i.e., it is full rank (equal to p) corresponding to
linearly independent columns in A a formal solution is
= (AT P A)1 AT P k = N 1 c.
(195)
For reasons of numerical stability especially in situations with nearly linear dependencies between the columns
this prob-
of A (causing slight alterations to the values in A to lead to substantial changes in the estimated ;
lem is known as multicollinearity) the system of normal equations should not be solved by inverting AT P A
but rather by means of SVD, QR or Cholesky decomposition, see Sections 1.1.6, 1.1.7 and 1.1.8.

When we apply regression analysis in other application areas we are often interested in predicting the response
variable based on new data not used in the estimation of the parameters or the regression coefficients x. In
land surveying and GNSS applications we are typically interested in x and not on this predictive modelling.

2.1.2 Iterative Solution

To find the solution we update x to x +


and go again. For how long do we go again or iterate? Until the

elements in become small, or based on a consideration in terms of the sum of weighted squared residuals
v T P v = (k A) T P (k A)
(196)
= k T P k k T P A T AT P k +
T AT P A
(197)
T T
= k T P k k T P A

AT P k + AT P A(AT P A)1 AT P k (198)
= k T P k k T P A
T AT P k +
T AT P k (199)
= k T P k k T P A
(200)
= kT P k T AT P k (201)
T
= kT P k
c (202)
= kT P k cT N 1 c. (203)
Hence
kT P k cT N 1 c
= 1 + 1. (204)
v T P v v T P v
Therefore we iterate until the ratio of the two quadratic forms on the right hand side is small compared to 1.
The method described here is identical to the Gauss-Newton method sketched in Section 2.2.3 with A as the Jacobian.

2.1.3 Dispersion and Significance of Estimates

When iterations are over and we have a solution we find dispersion or variance-covariance matrices for ,
x, and v (again by analogy with the linear WLS case; the Qs are (nearly) Mrsk-Mller and Frederiksen
(1984) notation, and again we use = instead of the correct )
Q = D{} = 02 P 1 (205)
Qx = D{xW LS } = 02 (AT P A)1 = 02 N 1 (206)
Q = D{} = 02 AN 1 AT (207)
Qv = D{v} = 02 (P 1 AN 1 AT ) = D{} D{} = Q Q . (208)
Allan Aasbjerg Nielsen 29

For the weighted sum of squared errors (SSE, also called RSS for the residual sum of squares) we get

v T P v = kT P (I AN 1 AT P )k = kT P (I H)k (209)

with expectation E{v T P v} = 02 (n p). H = AN 1 AT P . The mean squared error MSE is

02 = v T P v/(n p) (210)

and the root mean squared error RMSE is 0 also known as s0 . 0 = s0 has no unit. For well performed mea-
surements (with no outliers or gross errors), a good model, and properly chosen weights (see Section 1.2.2),
s0 1. This is due to the fact that assuming that the vi s with variance 02 /pi are independent and normally distributed v T P v
(with well chosen pi , see Section 1.2.2) follows a 2 distribution with n p degrees of freedom which has expectation n p.
Therefore v T P v/(n p) has expectation 1 and its square root is approximately 1.

The square roots of the diagonal elements of the dispersion matrices in Equations 205, 206, 207 and 208
are the standard errors of the quantities in question. For example, the standard error of xi denoted xi is the
square root of the ith diagonal element of 02 (AT P A)1 .
The remarks on 1) the distribution and significance of x in Section 1.2.3, and 2) on influence and leverage in Section 1.1.5,
are valid here also.
T
A and v, and k and v are orthogonal (with respect to P ): AT P v = 0 and k P v = 0. Geometrically
this means that our analysis finds the orthogonal projection (with respect to P ) k of k onto the hyperplane
spanned by the linearly independent columns of A. This gives the shortest distance between k and k in the
norm defined by P .

2.1.4 Confidence Ellipsoids

We already described the quality of the estimates in x by means of their standard deviations, i.e., the square roots of the diagonal
elements of D{x} = Qx . Another description which allows for the covariances between the elements of x is based on confidence
ellipsoids. A confidence ellipsoid or error ellipsoid is described by the equation

(x x)T Qx1 (x x) = q (211)


T T 1
y (V V ) y = q (212)
y T V 1 V T y = q (213)
(1/2 V T y)T (1/2 V T y) = q (214)
1/2 T 1/2
( z) ( z) = q (215)
2
(z1 / 1 ) + + (zp / p ) = q
2
(216)

(z1 / q 1 )2 + + (zp / q p )2 = 1 (217)

(q 0) where V is a matrix with the eigenvectors of Qx = V V T in the columns (hence V T V = V V T = I) and is a


diagonal matrix of eigenvalues of Qx ; y = x x and z = V T y, y = V z. This shows that the ellipsoid has semi axes in the
directions of the eigenvectors and that their lengths are proportional to the square roots of the eigenvalues. The constant q depends
on the confidence level and the distribution of the left hand side, see below. Since Qx = 02 (AT P A)1 with known A and P we
have two situations 1) 02 known and 2) 02 unknown.

02 known In practice 02 is unknown so this case does not occur in the real world. If, however, 02 were known (xx)T Q1 x (x x)
would follow a 2 distributionwith p degrees of freedom, (x x)T Q1 x (x x) 2
(p), and the semi axes of a, say, 95%
confidence ellipsoid would be q i where q is the 95% fractile of the 2 (p) distribution and i are the eigenvalues of Qx .

02 unknown In this case we estimate 02 as 02 = v T P v/(n p) which means that (n p)02 2 (n p). Also, (x
x)T (AT P A)(x x) 2 (p). This means that

(x x)T (AT P A)(x x)/p


F (p, n p) (218)
02
30 Least Squares Adjustment

(since the independent numerator and denominator above follow 2 (p)/p and 2 (n p)/(n p) distributions, respectively. As
n goes to infinity the above quantity multiplied by p approaches a 2 (p)
distribution so the above case with 02 known serves as
a limiting case.) The semi axes of a, say, 95% confidence ellipsoid are q p i where q is the 95% fractile of the F (p, n p)
distribution, p is the number of parameters and i are the eigenvalues of Qx . If a subset
of m < p parameters are studied the
semi axes of a, say, 95% confidence ellipsoid of the appropriate submatrix of Qx are q m i where q is the 95% fractile of
the F (m, n p) distribution, m is the number of parameters and i are the eigenvalues of that submatrix, see also Examples 10
(page 32) and 11 (page 40) with Matlab code.

2.1.5 Dispersion of a Function of Estimated Parameters

To estimate the dispersion of some function f of the estimated parameters/elements (e.g. a distance deter-
mined by estimated coordinates) we perform a first order Taylor expansion around x

f (x) f (x) + [f (x)]T (x x). (219)

With g = f (x) we get (again we use = instead of the correct )

D{f } = 02 g T (AT P A)1 g, (220)

see also Example 10 (page 34, example starts on page 32) with Matlab code.

2.1.6 The Derivative Matrix

The elements of the derivative matrix A, Aij = Fi /xj , can be evaluated analytically or numerically.

Analytical partial derivatives for height or levelling observations are (zA is the height in point A, zB is
the height in point B)

F = zB zA (221)
F
= 1 (222)
zA
F
= 1. (223)
zB
Equation 221 is obviously linear. If we do levelling only and dont combine with distance or directional
observations we can do linear adjustment and we dont need the iterative procedure and the initial values for
the elements. There are very few other geodesy, land surveying and GNSS related problems which can be
solved by linear adjustment.

Analytical partial derivatives for 3-D distance observations are (remember that d( u)/du = 1/(2 u) and
use the chain rule for differentiation)

F = (xB xA )2 + (yB yA )2 + (zB zA )2 (224)
= dAB (225)
F 1
= 2(xB xA )(1) (226)
xA 2dAB
xB xA
= (227)
dAB
F F
= (228)
xB xA
Allan Aasbjerg Nielsen 31

and similarly for yA , yB , zA and zB .

Analytical partial derivatives for 2-D distance observations are



F = (xB xA )2 + (yB yA )2 (229)
= aAB (230)
F 1
= 2(xB xA )(1) (231)
xA 2aAB
xB xA
= (232)
aAB
F F
= (233)
xB xA
and similarly for yA and yB .

Analytical partial derivatives for horizontal direction observations are (remember that d(arctan u)/du =
1/(1 + u2 ) and again use the chain rule; arctan gives radians, rA is in gon, = 200/ gon; rA is related to
the arbitrary zero for the horizontal direction measurement termed the orientation unknown10 )
yB yA
F = arctan rA (234)
xB xA
F 1 yB yA
= yB yA 2 ( )(1) (235)
xA 1 + ( xB xA ) (xB xA )2
yB yA
= (236)
a2AB
F F
= (237)
xB xA
F 1 1
= yB yA 2 (1) (238)
yA 1 + ( xB xA ) xB xA
xB xA
= (239)
a2AB
F F
= (240)
yB yA
F
= 1. (241)
rA

Numerical partial derivatives can be calculated as


F (x1 , x2 , . . . , xp ) F (x1 + , x2 , . . . , xp ) F (x1 , x2 , . . . , xp )
(242)
x1
F (x1 , x2 , . . . , xp ) F (x1 , x2 + , . . . , xp ) F (x1 , x2 , . . . , xp )
(243)
x2
..
.

or we could use a symmetrized form


F (x1 , x2 , . . . , xp ) F (x1 + , x2 , . . . , xp ) F (x1 , x2 , . . . , xp )
(244)
x1 2
10
in Danish kredsdrejningselement
32 Least Squares Adjustment

F (x1 , x2 , . . . , xp ) F (x1 , x2 + , . . . , xp ) F (x1 , x2 , . . . , xp )


(245)
x2 2
..
.

both with appropriately small. Generally, one should be careful with numerical derivatives. There are two
sources of error in the above equations, roundoff error that has to do with exact representation in the computer,
and truncation error having to do with the magnitude of . In relation to Global Navigation Satellite System
(GNSS) distance observations we are dealing with F s with values larger than 20,000,000 meters (this is the
approximate nadir distance from the GNSS space vehicles to the surface of the earth). In this connection a
of 1 meter is small compared to F , it has an exact representation in the computer, and we dont have to do
the division by (since it equals one). Note that when we use numerical partial derivatives we need p + 1
function evaluations (2p for the symmetrized form) for each iteration rather than one.
Example 10 (from Mrsk-Mller and Frederiksen, 1984, p. 86) This is a traditional land surveying example. From point 103
with unknown (2-D) coordinates we measure horizontal directions and distances to four points 016, 020, 015 and 013 (no distance
is measured to point 020), see Figure 5. We wish to determine the coordinates of point 103 and the orientation unknown by means
of nonlinear weighted least squares adjustment. The number of parameters is p = 3.

Points 016, 020, 015 and 013 are considered as error free fix points. Their coordinates are

Point x [m] y [m]


016 3725.10 3980.17
020 3465.74 4268.33 .
015 3155.96 4050.70
013 3130.55 3452.06

We measure four horizontal directions and three distances so we have seven observations, n = 7. Therefore we have f = 7 3 = 4
degrees of freedom. We determine the (2-D) coordinates [x y]T of point 103 and the the orientation unknown, r so [x1 x2 x3 ]T =
[x y r]T . The observation equations are (assuming that arctan gives radians and we want gon, = 200/ gon)

3980.17 y
1 = arctan r + v1 (246)
3725.10 x
4268.33 y
2 = arctan r + v2 (247)
3465.74 x
4050.70 y
3 = arctan r + v3 (248)
3155.96 x
3452.06 y
4 = arctan r + v4 (249)
3130.55 x

5 = (3725.10 x)2 + (3980.17 y)2 + v5 (250)

6 = (3155.96 x)2 + (4050.70 y)2 + v6 (251)

7 = (3130.55 x)2 + (3452.06 y)2 + v7 . (252)

We obtain the following observations (i )

From To Horizontal Horizontal


point point direction [gon] distance [m]
103 016 0.000 706.260
103 020 30.013
103 015 56.555 614.208
103 013 142.445 132.745

where the directional observations are means of two measurements. As the initial value [x y ]T for the coordinates [x y]T of
point 103 we choose the mean values for the coordinates of the four fix points. As the initial value r for the direction unknown r
Allan Aasbjerg Nielsen 33

Figure 5: From point 103 with unknown coordinates we measure horizontal directions and distances (no distance is measured to
point 020) to four points 016, 020, 015 and 013 (from Mrsk-Mller and Frederiksen, 1984; lefthand coordinate system).

we choose zero. First order Taylor expansions of the observation equations near the initial values give (assuming that arctan gives
radians and we want gon; units for the first four equations are gon, for the last three units are m)

3980.17 y 3980.17 y 3725.10 x


1 = arctan r + x y r + v1 (253)
3725.10 x a21 a21
4268.33 y 3980.17 y 3725.10 x
2 = arctan
r + x y r + v2 (254)
3465.74 x a22 a22
4050.70 y 3980.17 y 3725.10 x
3 = arctan
r + x y r + v3 (255)
3155.96 x a32 a23
3452.06 y 3980.17 y 3725.10 x
4 = arctan r + x y r + v4 (256)
3130.55 x a24 a24
3725.10 x 3980.17 y
5 = a1 x y + v5 (257)
a1 a1
3155.96 x 4050.70 y
6 = a3 x y + v6 (258)
a3 a3
3130.55 x 3452.06 y
7 = a4 x y + v7 (259)
a4 a4

where (units are m)



a1 = (3725.10 x )2 + (3980.17 y )2 (260)

a2 = (3565.74 x )2 + (4268.33 y )2 (261)

a3 = (3155.96 x )2 + (4050.70 y )2 (262)

a4 = (3130.55 x )2 + (3452.06 y )2 . (263)
34 Least Squares Adjustment

as above units for the first four equations are gon, for the last three units are m)
In matrix form we get (k = A;

3980.17y
3980.17y 3725.10x 1
0.000 arctan 3725.10x+r

a2
1
a2
1

30.013 arctan 3980.17y 3725.10x 1

4268.33y
+ r a2 a2
x
2 2
3465.74x
3980.17y

3725.10x

1
56.555 arctan 4050.70y
+ r a2 a2
3155.96x
3 3
y .
142.445 arctan 3452.06y
+ r = 3980.17y 3725.10x 1 (264)
3130.55x a2 a2
706.260 a1 4

4 r

3725.10x 3980.17y 0
614.208 a3 a1

a1
3155.96x 4050.70y 0
132.745 a4 a3

a3
3130.55x
a4 3452.06y
a4 0

The starting weight matrix is (for directions: n = 2, sc = 0.002m, and st = 0.0015gon; for distances: n = 1, sG = 0.005m, and
sa = 0.005m/1000m = 0.000005), see Section 1.2.2 (units for the first four weights are gon2 , for the last three units are m2 )

0.7992 0 0 0 0 0 0
0 0.7925 0 0 0 0 0

0 0 0.7127 0 0 0 0

P = 0 0 0 0.8472 0 0 0 (265)

0 0 0 0 0.03545 0 0

0 0 0 0 0 0.03780 0
0 0 0 0 0 0 0.03094

and after eleven iterations with the Matlab code below we end with (again, units for the first four weights are gon2 , for the last
three units are m2 )

0.8639 0 0 0 0 0 0
0 0.8714 0 0 0 0 0

0 0 0.8562 0 0 0 0

P = 0 0 0 0.4890 0 0 0 . (266)

0 0 0 0 0.02669 0 0

0 0 0 0 0 0.02904 0
0 0 0 0 0 0 0.03931

After the eleven iterations we get [x y r]T = [3, 263.155m 3, 445.925m 54.612gon]T with standard deviations [4.14mm 2.49mm
0.641mgon]T . The diagonal elements of the hat matrix H are [0.3629 0.3181 0.3014 0.7511 0.3322 0.2010 0.7332] and p/n =
3/7 = 0.4286. The diagonal elements of AN 1 AT are [0.4200mm2 0.3650mm2 0.3521mm2 1.5360mm2 12.4495mgon2
6.9222mgon2 18.6528mgon2 ]. In this case where the observations have different units and therefore are on completely different
scales, we need to look at the first four diagonal elements (representing the distance measurements) and the last three (representing
the angle measurements) separately. The average of the first four is 0.6683mm2 , the average of the last three is 12.6748mgon2 .
No observations have high leverages. Checking the diagonal of AN 1 AT of course corresponds to checking the variances (or
standard deviations) of the predicted observations . The estimated residuals are v = [0.2352mm 0.9301mm 0.9171mm
0.3638mm 5.2262mgon 6.2309mgon 2.3408mgon]T . The resulting RMSE is s0 = 0.9563. The probability of finding a larger
value for RSS = v T P v is 0.4542 so s0 is suitably small.

As an example on application of Equation 220 we calculate the distance between fix point 020 and point 103 and the standard
deviation of the distance. From the Matlab code below we get the distance 846.989 m with a standard deviation of 2.66 mm.

The plots made in the code below allow us to study the iteration pattern of the Gauss-Newton method applied. The last plot
produced, see Figure 6, shows the four fix points as triangles, the initial coordinates for point 103 as a plus, and the iterated
solutions as circles marked with iteration number. The final solution is marked by both a plus and a circle. We see that since there
are eleven iterations the last 3-4 iterations overlap in the plot.

A 95% confidence ellipsoid for [x y r]T with semi axes 18.47, 11.05 and 2.41 ( p 6.591 i where p = 3 is the number of
parameters, 6.591 is the 95% fractile in the F (3, 4) distribution, and i are the eigenvalues of Qx = 02 (AT P A)1 ) is shown in
Figure 7. Since the ellipsoid in the Matlab code in the notation of Section 2.1.4 in page 29 is generated in the z-space we rotate by
V to get to y-space. [end of example]
Matlab code for Example 10

% (C) Copyright 2003-2004


% Allan Aasbjerg Nielsen
Allan Aasbjerg Nielsen 35

6 x and y over iterations


x 10
4.4

020

4.2

015
4 016
103 start

3.8

3.6

013 103 stop


3.4 2 7 6
4

3.2
5

3
2.8
2.6 2.8 3 3.2 3.4 3.6 3.8 4
6
x 10

Figure 6: Development of x and y coordinates of point 103 over iterations with first seven iterations annotated; righthand
coordinate system.

% aa@imm.dtu.dk, www.imm.dtu.dk/aa

% analytical or numerical partial derivatives?


%partial = analytical;
partial = n;

cst = 200/pi; % radian to gon


eps = 0.001; % for numerical differentiation

% positions of points 016, 020, 015 and 013 in network, [m]


xx = [3725.10 3465.74 3155.96 3130.55];
yy = [3980.17 4268.33 4050.70 3452.06];

% observations: 1-4 are directions [gon], 5-7 are distances [m]


l = [0 30.013 56.555 142.445 706.260 614.208 132.745]; % l is \ell (not one)
n = size(l,1);

% initial values for elements: x- and y-coordinates for point 103 [m], and
% the direction unknown [gon]
x = [3263.150 3445.920 54.6122];
% play with initial values to check robustness of method
x = [0 0 -200];
x = [0 0 -100];
x = [0 0 100];
x = [0 0 200];
x = [0 0 40000];
x = [0 0 0];
x = [100000 100000 0];
x = [mean(xx) mean(yy) 0];
%x = [mean(xx) 3452.06 0]; % approx. same y as 013
p = size(x,1);

% desired units: mm and mgon


xx = 1000*xx;
yy = 1000*yy;
36 Least Squares Adjustment

r [mgon]
0
2

15
10 10
5
0 0
5
10
15 10
y [mm]
x [mm]

10

5
y [mm]

10
15 10 5 0 5 10 15
x [mm]

Figure 7: 95% ellipsoid for [x y r]T with projection on xy-plane.

l = 1000*l;
x = 1000*x;
cst = 1000*cst;

%number of degrees of freedom


f = n-p;

x0 = x;

sc = 0.002*1000;%[mm]
st = 0.0015*1000;%[mgon]
sG = 0.005*1000;%[mm]
sa = 0.000005;%[m/m], no unit
%a [mm]

idx = [];
e2 = [];
dc = [];
X = [];

for iter = 1:50 % iter ---------------------------------------------------------

% output from atan2 is in radian, convert to gon


F1 = cst.*atan2(yy-x(2),xx-x(1))-x(3);
a = (x(1)-xx).2+(x(2)-yy).2;
F2 = sqrt(a);
F = [F1; F2([1 3:end])]; % skip distance from 103 to 020

% weight matrix
%a [mm]
P = diag([2./(2*(cst*sc).2./a+st2); 1./(sG2+a([1 3:end])*sa.2)]);
diag(P)

k = l-F; % l is \ell (not one)

A1 = [];
A2 = [];
Allan Aasbjerg Nielsen 37

if strcmp(partial,analytical)
% A is matrix of analytical partial derivatives
error(not implemented yet);
else
% A is matrix of numerical partial derivatives
%directions
dF = (cst.*atan2(yy- x(2) ,xx-(x(1)+eps))- x(3) -F1)/eps;
A1 = [A1 dF];
dF = (cst.*atan2(yy-(x(2)+eps),xx- x(1) ) - x(3) -F1)/eps;
A1 = [A1 dF];
dF = (cst.*atan2(yy- x(2) ,xx- x(1) ) -(x(3)+eps)-F1)/eps;
A1 = [A1 dF];
%distances
dF = (sqrt((x(1)+eps-xx).2+(x(2) -yy).2)-F2)/eps;
A2 = [A2 dF];
dF = (sqrt((x(1) -xx).2+(x(2)+eps-yy).2)-F2)/eps;
A2 = [A2 dF];
dF = (sqrt((x(1) -xx).2+(x(2) -yy).2)-F2)/eps;
A2 = [A2 dF];
A2 = A2([1 3:4],:);% skip derivatives of distance from 103 to 020

A = [A1; A2];
end

N = A*P;
c = N*k;
N = N*A;
%WLS
deltahat = N\c;
khat = A*deltahat;
vhat = k-khat;
e2 = [e2 vhat*P*vhat];
dc = [dc deltahat*c];
%update for iterations
x = x+deltahat;
X = [X x];

idx = [idx iter];

% stop iterations
itertst = (k*P*k)/e2(end);
if itertst < 1.000001
break;
end

end % iter -------------------------------------------------------------------

%x-x0
% number of iterations
iter

%MSE
s02 = e2(end)/f;
%RMSE
s0 = sqrt(s02)

%Variance/covariance matrix of the observations, l


Dl = s02.*inv(P);
%Standard deviations
stdl = sqrt(diag(Dl))

%Variance/covariance matrix of the adjusted elements, xhat


Ninv = inv(N);
Dxhat = s02.*Ninv;
%Standard deviations
stdxhat = sqrt(diag(Dxhat))

%Variance/covariance matrix of the adjusted observations, lhat


Dlhat = s02.*A*Ninv*A;
%Standard deviations
stdlhat = sqrt(diag(Dlhat))

%Variance/covariance matrix of the adjusted residuals, vhat


Dvhat = Dl-Dlhat;
38 Least Squares Adjustment

%Standard deviations
stdvhat = sqrt(diag(Dvhat))

%Correlations between adjusted elements, xhat


aux = diag(1./stdxhat);
corrxhat = aux*Dxhat*aux

% Standard deviation of estimated distance from 103 to 020


d020 = sqrt((xx(2)-x(1))2+(yy(2)-x(2))2);
%numerical partial derivatives of d020, i.e. gradient of d020
grad = [];
dF = (sqrt((xx(2)-(x(1)+eps))2+(yy(2)- x(2) )2)-d020)/eps;
grad = [grad; dF];
dF = (sqrt((xx(2)- x(1) )2+(yy(2)-(x(2)+eps))2)-d020)/eps;
grad = [grad; dF; 0];
stdd020 = s0*sqrt(grad*Ninv*grad)

% plots to illustrate progress of iterations


figure
%plot(idx,e2);
semilogy(idx,e2);
title(vTPv);
figure
%plot(idx,dc);
semilogy(idx,dc);
title(cTN{-1}c);
figure
%plot(idx,dc./e2);
semilogy(idx,dc./e2);
title((cTN{-1}c)/(vTPv));
for i = 1:p
figure
plot(idx,X(i,:));
%semilogy(idx,X(i,:));
title(X(i,:) vs. iteration index);
end
figure
%loglog(x0(1),x0(2),k+)
plot(x0(1),x0(2),k+) % initial values for x and y
text(x0(1)+30000,x0(2)+30000,103 start)
hold
% positions of points 016, 020, 015 and 013 in network
%loglog(xx,yy,xk)
plot(xx,yy,k)
txt = [016; 020; 015; 013];
for i = 1:4
text(xx(i)+30000,yy(i)+30000,txt(i,:));
end
for i = 1:iter
% loglog(X(1,i),X(2,i),ko);
plot(X(1,i),X(2,i),ko);
end
for i = 1:7
text(X(1,i)+30000,X(2,i)-30000,num2str(i));
end
plot(X(1,end),X(2,end),k+);
%loglog(X(1,end),X(2,end),k+);
text(X(1,end)+30000,X(2,end)+30000,103 stop)
title(x and y over iterations);
%title(X(1,:) vs. X(2,:) over iterations);
axis equal
axis([2.6e6 4e6 2.8e6 4.4e6])

% t-values and probabilities of finding larger |t|


% pt should be smaller than, say, (5% or) 1%
t = x./stdxhat;
pt = betainc(f./(f+t.2),0.5*f,0.5);

% probabilitiy of finding larger vPv


% should be greater than, say, 5% (or 1%)
pchi2 = 1-gammainc(0.5*e2(end),0.5*f);

% semi-axes in confidence ellipsoid


% 95% fractile for 3 dfs is 7.815 = 2.7962
Allan Aasbjerg Nielsen 39

% 99% fractile for 3 dfs is 11.342 = 3.3682


%[vDxhat dDxhat] = eigsort(Dxhat(1:2,1:2));
[vDxhat dDxhat] = eigsort(Dxhat);
%semiaxes = sqrt(diag(dDxhat));
% 95% fractile for 2 dfs is 5.991 = 2.4482
% 99% fractile for 2 dfs is 9.210 = 3.0352
% df F(3,df).95 F(3,df).99
% 1 215.71 5403.1
% 2 19.164 99.166
% 3 9.277 29.456
% 4 6.591 16.694
% 5 5.409 12.060
% 10 3.708 6.552
% 100 2.696 3.984
% inf 2.605 3.781
% chi2 approximation, 95% fractile
semiaxes = sqrt(7.815*diag(dDxhat))
figure
ellipsoidrot(0,0,0,semiaxes(1),semiaxes(2),semiaxes(3),vDxhat);
axis equal
xlabel(x [mm]); ylabel(y [mm]); zlabel(r [mgon]);
title(95% confidence ellipsoid, \chi2 approx.)
% F approximation, 95% fractile. NB the fractile depends on df
semiaxes = sqrt(3*6.591*diag(dDxhat))
figure
ellipsoidrot(0,0,0,semiaxes(1),semiaxes(2),semiaxes(3),vDxhat);
axis equal
xlabel(x [mm]); ylabel(y [mm]); zlabel(r [mgon]);
title(95% confidence ellipsoid, F approx.)
view(37.5,15)
print -depsc2 confxyr.eps
%clear
%close all

function [v,d] = eigsort(a)


[v1,d1] = eig(a);
d2 = diag(d1);
[dum,idx] = sort(d2);
v = v1(:,idx);
d = diag(d2(idx));

function [xx,yy,zz] = ellipsoidrot(xc,yc,zc,xr,yr,zr,Q,n)


%ELLIPSOID Generate ellipsoid.
%
% [X,Y,Z] = ELLIPSOID(XC,YC,ZC,XR,YR,ZR,Q,N) generates three
% (N+1)-by-(N+1) matrices so that SURF(X,Y,Z) produces a rotated
% ellipsoid with center (XC,YC,ZC) and radii XR, YR, ZR.
%
% [X,Y,Z] = ELLIPSOID(XC,YC,ZC,XR,YR,ZR,Q) uses N = 20.
%
% ELLIPSOID(...) and ELLIPSOID(...,N) with no output arguments
% graph the ellipsoid as a SURFACE and do not return anything.
%
% The ellipsoidal data is generated using the equation after rotation with
% orthogonal matrix Q:
%
% (X-XC)2 (Y-YC)2 (Z-ZC)2
% -------- + -------- + -------- = 1
% XR2 YR2 ZR2
%
% See also SPHERE, CYLINDER.

% Modified by Allan Aasbjerg Nielsen (2004) after


% Laurens Schalekamp and Damian T. Packer
% Copyright 1984-2002 The MathWorks, Inc.
% $Revision: 1.7 $ $Date: 2002/06/14 20:33:49 $

error(nargchk(7,8,nargin));

if nargin == 7
n = 20;
end

[x,y,z] = sphere(n);
40 Least Squares Adjustment

x = xr*x;
y = yr*y;
z = zr*z;
xvec = Q*[reshape(x,1,(n+1)2); reshape(y,1,(n+1)2); reshape(z,1,(n+1)2)];
x = reshape(xvec(1,:),n+1,n+1)+xc;
y = reshape(xvec(2,:),n+1,n+1)+yc;
z = reshape(xvec(3,:),n+1,n+1)+zc;
if(nargout == 0)
surf(x,y,z)
% surfl(x,y,z)
% surfc(x,y,z)
axis equal
%shading interp
%colormap gray
else
xx = x;
yy = y;
zz = z;
end

Example 11 In this example we have data on the positions of Navstar Global Positioning System (GPS)
space vehicles (SV) 1, 4, 7, 13, 20, 24 and 25 and pseudoranges from our position to the SVs. We want
to determine the (3-D) coordinates [X Y Z]T of our position and the clock error in our GPS receiver, cdT ,
[x1 x2 x3 x4 ]T = [X Y Z cdT ]T , so the number of parameters is p = 4. The positions of and pseudoranges
() to the SVs given in a data file from the GPS receiver are

SV X [m] Y [m] Z [m] [m]


1 16,577,402.072 5,640,460.750 20,151,933.185 20,432,524.0
4 11,793,840.229 10,611,621.371 21,372,809.480 21,434,024.4
7 20,141,014.004 17,040,472.264 2,512,131.115 24,556,171.0
.
13 22,622,494.101 4,288,365.463 13,137,555.567 21,315,100.2
20 12,867,750.433 15,820,032.908 16,952,442.746 21,255,217.0
24 3,189,257.131 17,447,568.373 20,051,400.790 24,441,547.2
25 7,437,756.358 13,957,664.984 21,692,377.935 23,768,678.3

The true position is (that of the no longer existing GPS station at Landmalervej in Hjortekr) [X Y Z]T =
[3, 507, 884.948 780, 492.718 5, 251, 780.403]T m. We have seven observations, n = 7. Therefore we have
f = n p = 7 4 = 3 degrees of freedom. The observation equations are (in m)

1 = (16577402.072 X)2 + (5640460.750 Y )2 + (20151933.185 Z)2 + cdT + v1 (267)

2 = (11793840.229 X)2 + (10611621.371 Y )2 + (21372809.480 Z)2 + cdT + v2 (268)

3 = (20141014.004 X)2 + (17040472.264 Y )2 + (2512131.115 Z)2 + cdT + v3 (269)

4 = (22622494.101 X)2 + (4288365.463 Y )2 + (13137555.567 Z)2 + cdT + v4 (270)

5 = (12867750.433 X)2 + (15820032.908 Y )2 + (16952442.746 Z)2 + cdT + v5 (271)

6 = (3189257.131 X)2 + (17447568.373 Y )2 + (20051400.790 Z)2 + cdT + v6 (272)

7 = (7437756.358 X)2 + (13957664.984 Y )2 + (21692377.935 Z)2 + cdT + v7 (273)

As the initial values [X Y Z cdT ]T we choose [0 0 0 0]T , center of the Earth, no clock error. First order
Taylor expansions of the observation equations near the initial values give (in m)

1 = d1 + cdT (274)
Allan Aasbjerg Nielsen 41

16577402.072 X 5640460.750 Y 20151933.185 Z


X Y Z + cdT + v1
d1 d1 d1
2 = d2 + cdT (275)

11793840.229 X 10611621.371 Y 21372809.480 Z
X Y Z + cdT + v2
d2 d2 d2
3 = d3 + cdT (276)

20141014.004 X 17040472.264 Y 2512131.115 Z
X Y Z + cdT + v3
d3 d3 d3
4 = d4 + cdT (277)
22622494.101 X 4288365.463 Y 13137555.567 Z
X Y Z + cdT + v4
d4 d4 d4
5 = d5 + cdT (278)
12867750.433 X 15820032.908 Y 16952442.746 Z
X Y Z + cdT + v5
d5 d5 d5
6 = d6 + cdT (279)

3189257.131 X 17447568.373 Y 20051400.790 Z
X Y Z + cdT + v6
d6 d6 d6
7 = d7 + cdT (280)

7437756.358 X 13957664.984 Y 21692377.935 Z
X Y Z + cdT + v7
d7 d7 d7
where (in m)

d1 = (16577402.072 X )2 + (5640460.750 Y )2 + (20151933.185 Z )2 (281)

d2 = (11793840.229 X )2 + (10611621.371 Y )2 + (21372809.480 Z )2 (282)

d3 = (20141014.004 X )2 + (17040472.264 Y )2 + (2512131.115 Z )2 (283)

d4 = (22622494.101 X )2 + (4288365.463 Y )2 + (13137555.567 Z )2 (284)

d5 = (12867750.433 X )2 + (15820032.908 Y )2 + (16952442.746 Z )2 (285)

d6 = (3189257.131 X )2 + (17447568.373 Y )2 + (20051400.790 Z )2 (286)

d7 = (7437756.358 X )2 + (13957664.984 Y )2 + (21692377.935 Z )2 . (287)

as above units are m)


In matrix form we get (k = A;

20432524.0 d1 cdT 16577402.072X
d1
5640460.750Y
d1
20151933.185Z
d1
1
10611621.371Y
21434024.4 d2 cdT
11793840.229X 21372809.480Z 1
X


d2
d2
17040472.264Y
d2
24556171.0 d3 cdT 20141014.004X 2512131.115Z 1
d3 d3 d3 Y

4288365.463Y
21315100.2 d4 cdT =
22622494.101X 13137555.567Z 1

(288)
.

d4
d4
15820032.908Y
d4 Z

21255217.0 d5 cdT
12867750.433X 16952442.746Z 1
d5 d5 d5 cdT

24441547.2 d6 cdT
3189257.131X
d6

17447568.373Y
d6
20051400.790Z
d6

1

23768678.3 d7 cdT 7437756.358X

13957664.984Y

21692377.935Z

1
d7 d7 d7

After five iterations with the Matlab code below (with all observations weighted equally, pi = 1/(10m)2 )
we get [X Y Z cdT ]T = [3, 507, 889.1 780, 490.0 5, 251, 783.8 25, 511.1]T m with standard deviations
[6.42 5.31 11.69 7.86]T m. 25, 511.1 m corresponds to a clock error of 0.085 ms. The difference between the
true position and the solution found is [4.18 2.70 3.35]T m, all well within one standard deviation. The
corresponding distance is 6.00 m. Figure 8 shows the four parameters over the iterations including the starting
42 Least Squares Adjustment

guess. The diagonal elements of the hat matrix H are [0.4144 0.5200 0.8572 0.3528 0.4900 0.6437 0.7218]
and p/n = 4/7 = 0.5714 so no observations have high leverages. The estimated residuals are v =
[5.80 5.10 0.74 5.03 3.20 5.56 5.17]T m. With prior variances of 102 m2 , the resulting RMSE is
s0 = 0.7149.
6 5 6 5
x 10 x 10 x 10 x 10
6 10 8 15

6
4 10
5 4
2 5
2

0 0 0 0
0 5 0 5 0 5 0 5

Figure 8: Parameters [X Y Z cdT ]T over iterations including the starting guess.

The probability of finding a larger value for RSS = v T P v is 0.6747 so s0 is suitably small. With prior variances of 52 m2 instead
of 102 m2 , s0 is 1.4297 and the probability of finding a larger value for RSS = v T P v is 0.1054 so also in this situation s0 is
suitably small. With prior variances of 32 m2 instead of 102 m2 , s0 is 2.3828 and the probability of finding a larger value for
RSS = v T P v is 0.0007 so in this situation s0 is too large.

95% confidence ellipsoids for [X Y Z]T in an earth-centered-earth-fixed (ECEF) coordinate system and in a local Easting-
Northing-Up (ENU) coordinate
system are shown in Figure 9. The semi axes in both the ECEF and the ENU systems are 64.92,
30.76 and 23.96 (this is m 9.277 i where m = 3 is the number of parameters, 9.277 is the 95% fractile in the F (3, 3) distribu-
tion, and i are the eigenvalues of QXY Z , the upper-left 3 3 submatrix of Qx = 02 (AT P A)1 ); units are metres. The rotation
to the local ENU system is performed by means of the orthogonal matrix (F T F = F F T = I)

sin cos 0
F T = sin cos sin sin cos (289)
cos cos cos sin sin

where is the latitude and is the longitude. The variance-covariance matrix of the position estimates in the ENU coordinate
system is QEN U = F T QXY Z F . Since

QXY Z a = a (290)
T T T
F QXY Z F F a = F a (291)
QEN U (F T a) = (F T a) (292)

we see that QXY Z and QEN U have the same eigenvalues and their eigenvectors are related as indicated. Since the ellipsoid in the
Matlab code in the notation of Section 2.1.4 in page 29 is generated in the z-space we rotate by V to get to y-space.

Dilution of Precision, DOP Satellite positioning works best when there is a good angular separation between the space vehicles.
A measure of this separation is termed the dilution of precision, DOP. Low values of the DOP correspond to a good angular
separation, high values to a bad angular separation, i.e., a high degree of clustering of the SVs. There are several versions of DOP.
From Equation 206 the dispersion of the parameters is

Qx = D{xW LS } = 02 (AT P A)1 . (293)

This matrix has contributions from our prior expectations to the precision of the measurements (P ), the actual precision of the
measurements (02 ) and the geometry of the problem (A). Lets look at the geometry alone and define the symmetric matrix
2
qX qXY qXZ qXcdT
qXY qY2 qY Z qY cdT
QDOP = Qx /(02 prior
2
) = (AT P A)1 /prior
2
= qXZ
(294)
qY Z 2
qZ qZcdT
2
qXcdT qY cdT qZcdT qcdT
2 2
where prior = i,prior , i.e., all prior variances are equal, see Section 1.2.2. In WLS (with equal weights on all observations) this
corresponds to QDOP = (AT A)1 .
Allan Aasbjerg Nielsen 43

We are now ready to define the position DOP



P DOP = 2 + q2 + q2 ,
qX (295)
Y Z

the time DOP



T DOP = 2
qcdT = qcdT (296)

and the geometric DOP



GDOP = 2 + q2 + q2 + q2
qX (297)
Y Z cdT

which is the square root of the trace of QDOP . It is easily seen that GDOP 2 = P DOP 2 + T DOP 2 .

In practice PDOP values less than 2 are considered excellent, between 2 and 4 good, up to 6 acceptable. PDOP values greater than
around 6 are considered suspect.

DOP is a measure of size of the matrix QDOP (or sub-matrices thereof, for PDOP for example the upper left three by three matrix).
As an alternative measure of this size we could use the determinant. Such a DOP measure would allow for off-diagonal elements
of QDOP , i.e., for covariances between the final estimates of the position. Determinant based DOP measures are not used in the
GNSS literature.

After rotation from the ECEF to the ENU coordinate system which transforms the upper-left 3 3 submatrix QXY Z of Qx into
QEN U , we can define
2
qE qEN qEU
QDOP,EN U = QEN U /(02 prior
2
) = qEN qN 2
qN U , (298)
2
qEU qN U qU

the horizontal DOP



HDOP = 2 + q2
qE (299)
N

and the vertical DOP



V DOP = 2 =q .
qU (300)
U

We see that P DOP 2 = HDOP 2 + V DOP 2 which is the trace of QDOP,EN U . [end of example]

Matlab code for Example 11


Code for functions eigsort and ellipsoidrot are listed under Example 10.

% (C) Copyright 2004


% Allan Aasbjerg Nielsen
% aa@imm.dtu.dk, www.imm.dtu.dk/aa

format short g

% use analytical partial derivatives


partial = analytical;
%partial = n;
% speed of light, [m/s]
%clight = 300000000;
clight = 299792458;
% length of C/A code, [m]
%L = 300000;

% true position (Landmaalervej, Hjortekaer)


xtrue = [3507884.948 780492.718 5251780.403 0];
% positions of satellites 1, 4, 7, 13, 20, 24 and 25 in ECEF coordinate system, [m]
xxyyzz = [16577402.072 5640460.750 20151933.185;
11793840.229 -10611621.371 21372809.480;
20141014.004 -17040472.264 2512131.115;
22622494.101 -4288365.463 13137555.567;
12867750.433 15820032.908 16952442.746;
-3189257.131 -17447568.373 20051400.790;
44 Least Squares Adjustment

60
60
30
40
40
20
20
20
10

U [m]
0
Z [m]

N [m]
0 0
20
20 10
40
40 20
60 30
60
20 20 0 20
20 20 0 20
0 0 0 E [m]
20
20 20 20
Y [m] X [m] N [m] E [m]

Figure 9: 95% ellipsoid for [X Y Z]T in ECEF (left) and ENU (middle) coordinate systems with projection on EN-plane (right).

-7437756.358 13957664.984 21692377.935];


pseudorange = [20432524.0 21434024.4 24556171.0 21315100.2 21255217.0 ...
24441547.2 23768678.3]; % [m]
l = pseudorange; % l is \ell (not one)
xx = xxyyzz(:,1);
yy = xxyyzz(:,2);
zz = xxyyzz(:,3);
n = size(xx,1); % number of observations

% weight matrix
sprior2 = 102; %52; %prior variance [m2]
P = eye(n)/sprior2; % weight = 1/"prior variance" [m(-2)]

% preliminary position, [m]


x = [0 0 0 0];
p = size(x,1); % number of elements/parameters
f = n-p; % number of degrees of freedom
x0 = x;

for iter = 1:20 % iter --------------------------------------------------------------

range = sqrt((x(1)-xx).2+(x(2)-yy).2+(x(3)-zz).2);
prange = range+x(4);
F = prange;

A = [];
if strcmp(partial,analytical)
% A is matrix of analytical partial derivatives
irange = 1./range;
dF = irange.*(x(1)-xx);
A = [A dF];
dF = irange.*(x(2)-yy);
A = [A dF];
dF = irange.*(x(3)-zz);
A = [A dF];
dF = ones(n,1);
A = [A dF];
else
% A is matrix of numerical partial derivatives
dF = sqrt((x(1)+1-xx).2+(x(2) -yy).2+(x(3) -zz).2)+ x(4) -prange;
A = [A dF];
dF = sqrt((x(1) -xx).2+(x(2)+1-yy).2+(x(3) -zz).2)+ x(4) -prange;
A = [A dF];
dF = sqrt((x(1) -xx).2+(x(2) -yy).2+(x(3)+1-zz).2)+ x(4) -prange;
Allan Aasbjerg Nielsen 45

A = [A dF];
dF = sqrt((x(1) -xx).2+(x(2) -yy).2+(x(3) -zz).2)+(x(4)+1)-prange;
A = [A dF];
end

k = l-F; % l is \ell (not one)


%k = -l+F;
N = A*P;
c = N*k;
N = N*A;
deltahat = N\c;
% OLS solution
%deltahat = A\k;
% WLS-as-OLS solution
%sqrtP = sqrt(P);
%deltahat = (sqrtP*A)\(sqrtP*k)

khat = A*deltahat;
vhat = k-khat;

% prepare for iterations


x = x+deltahat;

% stop iterations
if max(abs(deltahat))<0.001
break
end
%itertst = (k*P*k)/(vhat*P*vhat);
%if itertst < 1.000001
% break
%end

end % iter --------------------------------------------------------------

% DOP
SSE = vhat*P*vhat; %RSS or SSE
s02 = SSE/f; % MSE
s0 = sqrt(s02); %RMSE
Qdop = inv(A*P*A);
Qx = s02.*Qdop;
Qdop = Qdop/sprior2;
PDOP = sqrt(trace(Qdop(1:3,1:3)));
% must be in local Easting-Northing-Up coordinates
%HDOP = sqrt(trace(Qdop(1:2,1:2)));
% must be in local Easting-Northing-Up coordinates
%VDOP = sqrt(Qdop(3,3));
TDOP = sqrt(Qdop(4,4));
GDOP = sqrt(trace(Qdop));

% Dispersion etc of elements


%Qx = s02.*inv(A*P*A);
sigmas = sqrt(diag(Qx));
sigma = diag(sigmas);
isigma = inv(sigma);
% correlations between estimates
Rx = isigma*Qx*isigma;

% Standardised residuals
%Qv = s02.*(inv(P)-A*inv(A*P*A)*A);
Qv = s02.*inv(P)-A*Qx*A;
sigmares = sqrt(diag(Qv));
stdres = vhat./sigmares;

disp(----------------------------------------------------------)
disp(estimated parameters/elements [m])
x
disp(estimated clock error [s])
x(4)/clight
disp(number of iterations)
iter
disp(standard errors of elements [m])
sigmas
%tval = x./sigmas
disp(s0)
46 Least Squares Adjustment

s0
disp(PDOP)
PDOP
%stdres
disp(difference between estimated elements and initial guess)
deltaori = x-x0
disp(difference between true values and estimated elements)
deltaori = xtrue-x
disp(----------------------------------------------------------)

% t-values and probabilities of finding larger |t|


% pt should be smaller than, say, (5% or) 1%
t = x./sigmas;
pt = betainc(f./(f+t.2),0.5*f,0.5);

% probabilitiy of finding larger s02


% should be greater than, say, 5% (or 1%)
pchi2 = 1-gammainc(0.5*SSE,0.5*f);

% semi-axes in confidence ellipsoid for position estimates


% 95% fractile for 3 dfs is 7.815 = 2.7962
% 99% fractile for 3 dfs is 11.342 = 3.3682
[vQx dQx] = eigsort(Qx(1:3,1:3));
semiaxes = sqrt(diag(dQx));
% 95% fractile for 2 dfs is 5.991 = 2.4482
% 99% fractile for 2 dfs is 9.210 = 3.0352

% df F(3,df).95 F(3,df).99
% 1 215.71 5403.1
% 2 19.164 99.166
% 3 9.277 29.456
% 4 6.591 16.694
% 5 5.409 12.060
% 10 3.708 6.552
% 100 2.696 3.984
% inf 2.605 3.781
% chi2 approximation, 95% fractile
figure
ellipsoidrot(0,0,0,semiaxes(1)*sqrt(7.815),semiaxes(2)*sqrt(7.815),...
semiaxes(3)*sqrt(7.815),vQx);
axis equal
xlabel(X [m]); ylabel(Y [m]); zlabel(Z [m]);
title(95% confidence ellipsoid, ECEF, \chi2 approx.)
% F approximation, 95% fractile. NB the fractile depends on df
figure
ellipsoidrot(0,0,0,semiaxes(1)*sqrt(3*9.277),semiaxes(2)*sqrt(3*9.277),...
semiaxes(3)*sqrt(3*9.277),vQx);
axis equal
xlabel(X [m]); ylabel(Y [m]); zlabel(Z [m]);
title(95% confidence ellipsoid, ECEF, F approx.)
print -depsc2 confXYZ.eps
%% F approximation; number of obs goes to infinity
%figure
%ellipsoidrot(0,0,0,semiaxes(1)*sqrt(3*2.605),semiaxes(2)*sqrt(3*2.605),...
% semiaxes(3)*sqrt(3*2.605),vQx);
%axis equal
%xlabel(X [m]); ylabel(Y [m]); zlabel(Z [m]);
%title(95% confidence ellipsoid, ECEF, F approx., nobs -> inf)

% To geographical coordinates, from Strang & Borre (1997)


[bb,ll,hh,phi,lambda] = c2gwgs84(x(1),x(2),x(3))

% Convert Qx (ECEF) to Qenu (ENU)


sp = sin(phi);
cp = cos(phi);
sl = sin(lambda);
cl = cos(lambda);
Ft = [-sl cl 0; -sp*cl -sp*sl cp; cp*cl cp*sl sp]; % ECEF -> ENU
Qenu = Ft*Qx(1:3,1:3)*Ft;
% std.err. of ENU
sigmasenu = sqrt(diag(Qenu));
[vQenu dQenu] = eigsort(Qenu(1:3,1:3));
semiaxes = sqrt(diag(dQenu));
Allan Aasbjerg Nielsen 47

% F approximation, 95% fractile. NB the fractile depends on df


figure
ellipsoidrot(0,0,0,semiaxes(1)*sqrt(3*9.277),semiaxes(2)*sqrt(3*9.277),...
semiaxes(3)*sqrt(3*9.277),vQenu);
axis equal
xlabel(E [m]); ylabel(N [m]); zlabel(U [m]);
title(95% confidence ellipsoid, ENU, F approx.)
print -depsc2 confENU.eps
% Same thing, only more elegant
figure
ellipsoidrot(0,0,0,semiaxes(1)*sqrt(3*9.277),semiaxes(2)*sqrt(3*9.277),...
semiaxes(3)*sqrt(3*9.277),Ft*vQx);
axis equal
xlabel(E [m]); ylabel(N [m]); zlabel(U [m]);
title(95% confidence ellipsoid, ENU, F approx.)

%PDOP = sqrt(trace(Qenu)/sprior2)/s0;
HDOP = sqrt(trace(Qenu(1:2,1:2))/sprior2)/s0;
VDOP = sqrt(Qenu(3,3)/sprior2)/s0;

% Studentized/jackknifed residuals
if f>1 studres = stdres./sqrt((f-stdres.2)/(f-1));
end

function [bb,ll,h,phi,lambda] = c2gwgs84(x,y,z)

% C2GWGS84
% Convertion of cartesian coordinates (X,Y,Z) to geographical
% coordinates (phi,lambda,h) on the WGS 1984 reference ellipsoid
%
% phi and lambda are output as vectors: [degrees minutes seconds]

% Modified by Allan Aasbjerg Nielsen (2004) after


% Kai Borre 02-19-94
% Copyright (c) by Kai Borre
% $Revision: 1.0 $ $Date: 1997/10/15 %

a = 6378137;
f = 1/298.257223563;

lambda = atan2(y,x);
ex2 = (2-f)*f/((1-f)2);
c = a*sqrt(1+ex2);
phi = atan(z/((sqrt(x2+y2)*(1-(2-f))*f)));

h = 0.1; oldh = 0;
while abs(h-oldh) > 1.e-12
oldh = h;
N = c/sqrt(1+ex2*cos(phi)2);
phi = atan(z/((sqrt(x2+y2)*(1-(2-f)*f*N/(N+h)))));
h = sqrt(x2+y2)/cos(phi)-N;
end

phi1 = phi*180/pi;
b = zeros(1,3);
b(1) = fix(phi1);
b(2) = fix(rem(phi1,b(1))*60);
b(3) = (phi1-b(1)-b(2)/60)*3600;
bb = [b(1) b(2) b(3)];
lambda1 = lambda*180/pi;
l = zeros(1,3);
l(1) = fix(lambda1);
l(2) = fix(rem(lambda1,l(1))*60);
l(3) = (lambda1-l(1)-l(2)/60)*3600;
ll = [l(1) l(2) l(3)];

2.2 Nonlinear WLS by other Methods

The remaining sections describe a few other methods often used for solving the nonlinear (weighted) least squares regression
problem.
48 Least Squares Adjustment

2.2.1 The Gradient or Steepest Descent Method

Let us go back to Equation 181: yi = fi () + ei , i = 1, . . . , n and consider the nonlinear WLS case

1
n
e2 () = eT P e/2 = pi [yi fi ()]2 . (301)
2 i=1

The components of the gradient e2 are

e2 n
fi ()
= pi [yi fi ()] . (302)
k i=1
k

From an initial value (an educated guess) we can update by taking a step in the direction in which e2 decreases most rapidly,
namely in the direction of the negative gradient

new = old e2 ( old ) (303)

where > 0 determines the step size. This is done iteratively until convergence.

2.2.2 Newtons Method

Let us now expand e2 () to second order around an initial value 0


1
e2 () e2 ( 0 ) + [e2 ( 0 )]T ( 0 ) + [ 0 ]T H( 0 )[ 0 ] (304)
2
where H = 2 e2 ()/ T is the second order derivative of e2 () also known as the Hessian matrix not to be confused
with the hat matrix in Equations 44, 117 and 209. The gradient of the above expansion is

e2 () e2 ( 0 ) + H( 0 )[ 0 ]. (305)

At the minimum e2 () = 0 and therefore we can find that minimum by updating

new = old H 1
old e ( old )
2
(306)

until convergence.

From Equation 302 the elements of the Hessian H are


[ ]
2 e2
n
fi () fi () 2 fi ()
Hkl = = pi [yi fi ()] . (307)
k l i=1
k l k l

We see that the Hessian is symmetric. The second term in Hkl depends on the sum of the residuals between model and data,
which is supposedly small both since our model is assumed to be good and since its terms can have opposite signs. It is therefore
customary to omit this term. If the Hessian is positive definite we have a local minimizer and if its negative definite we have a local
maximizer (if its indefinite, i.e., it has both positive and negative eigenvalues, we have a saddle point). H is sometimes termed the
curvature matrix.

2.2.3 The Gauss-Newton Method

The basis of the Gauss-Newton method is a linear Taylor expansion of e

e() e( 0 ) + J ( 0 )[ 0 ] (308)

where J is the so-called Jacobian matrix containing the partial derivatives of e (like A containing the partial derivatives of F in
Equation 192). In the WLS case this leads to
1 T 1 1
e ()P e() eT ( 0 )P e()0 + [ 0 ]T J T ( 0 )P e( 0 ) + [ 0 ]T J T ( 0 )P J ( 0 )[ 0 ]. (309)
2 2 2
Allan Aasbjerg Nielsen 49

The gradient of this expression is J T ( 0 )P e( 0 ) + J T ( 0 )P J ( 0 )[ 0 ] and its Hessian is J T ( 0 )P J ( 0 ). The gradient


evaluated at 0 is J T ( 0 )P e( 0 ). We see that the Hessian is independent of 0 , it is symmetric and it is positive definite if
J ( 0 ) is full rank corresponding to linearly independent columns. Since the Hessian is positive definite we have a minimizer and
since e2 ( old ) = J T ( old )P e( old ) we get from Equation 306

new = old [J T ( old )P J ( old )]1 J T ( old )P e( old ). (310)

This corresponds to the normal equations for new old

[J T ( old )P J ( old )]( new old ) = J T ( old )P e( old ). (311)

This is equivalent to Equation 194 so the linearization method described in Section 2.1 is actually the Gauss-Newton method with
A as the Jacobian.

It can be shown that if the function to be minimized is twice continuously differentiable in a neighbourhood around the solution
, if J () over iterations is nonsingular, and if the initial solution 0 , is close enough to , then the Gauss-Newton method
converges. It can also be shown that the convergence is quadratic, i.e., the length of the increment vector ( in Section 2.1 and h
in the Matlab function below) decreases quadratically over iterations.

Below is an example of a Matlab function implementation of the unweighted version of the Gauss-Newton algorithm. Note, that
the code to solve the positioning problem is a while loop the body of which is three (or four) statements only. This is easily
extended with the weighting and the statistics part given in the previous example (do this as an exercise). Note also, that in the call
to function gaussnewton we need to call the function fJ with the at symbol (@) to create a Matlab function handle.

Matlab code for Example 11

function x = gaussnewton(fJ, x0, tol, itermax)

% x = gaussnewton(@fJ, x0, tol, itermax)


%
% gaussnewton solves a system of nonlinear equations
% by the Gauss-Newton method.
%
% fJ - gives f(x) and the Jacobian J(x) by [f, J] = fJ(x)
% x0 - initial solution
% tol - tolerance, iterate until maximum absolute value of correction
% is smaller than tol
% itermax - maximum number of iterations
%
% x - final solution
%
% fJ is written for the occasion

% (c) Copyright 2005


% Allan Aasbjerg Nielsen
% aa@imm.dtu.dk, www.imm.dtu.dk/aa
% 8 Nov 2005

% Modified after
% L. Elden, L. Wittmeyer-Koch and H.B. Nielsen (2004).

if nargin < 2, error(too few input arguments); end


if nargin < 3, tol = 1e-2; end
if nargin < 4, itermax = 100; end

iter = 0;
x = x0;
h = realmax*ones(size(x0));

while (max(abs(h)) > tol) && (iter < itermax)


[f, J] = feval(fJ, x);
h = J\f;
x = x - h;
iter = iter + 1;
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function [f, J] = fJ(x)


50 Least Squares Adjustment

xxyyzz = [16577402.072 5640460.750 20151933.185;


11793840.229 -10611621.371 21372809.480;
20141014.004 -17040472.264 2512131.115;
22622494.101 -4288365.463 13137555.567;
12867750.433 15820032.908 16952442.746;
-3189257.131 -17447568.373 20051400.790;
-7437756.358 13957664.984 21692377.935]; % [m]
l = [20432524.0 21434024.4 24556171.0 21315100.2 21255217.0 24441547.2 ...
23768678.3]; % [m]

xx = xxyyzz(:,1);
yy = xxyyzz(:,2);
zz = xxyyzz(:,3);

range = sqrt((x(1)-xx).2 + (x(2)-yy).2 + (x(3)-zz).2);


prange = range + x(4);
f = l - prange;

if nargout < 2, return; end

n = size(f,1); % # obs
p = 4; % # parameters
J = zeros(n,p);

% analytical derivatives
J(:,1) = -(x(1)-xx)./range;
J(:,2) = -(x(2)-yy)./range;
J(:,3) = -(x(3)-zz)./range;
J(:,4) = -ones(n,1);

return

% numerical derivatives
delta = 1;
for i = 1:p
y = x;
y(i) = x(i) + delta;
J(:,i) = (fJ(y) - f); %./delta;
end

return

% or symmetrized
delta = 0.5;
for i = 1:p
y = x;
z = x;
y(i) = x(i) + delta;
z(i) = x(i) - delta;
J(:,i) = (fJ(y) - fJ(z)); %./(2*delta);
end

2.2.4 The Levenberg-Marquardt Method

The Gauss-Newton method may cause the new to wander off further from the minimum than the old because of nonlinear com-
ponents in e which are not modelled. Near the minimum the Gauss-Newton method converges very rapidly whereas the gradient
method is slow because the gradient vanishes at the minimum. In the Levenberg-Marquardt method we modify Equation 311 to

[J T ( old )P J ( old ) + I]( new old ) = J T ( old )P e( old ) (312)

where 0 is termed the damping factor. The Levenberg-Marquardt method is a hybrid of the gradient method far from the
minimum and the Gauss-Newton method near the minimum: if is large we step in the direction of the steepest descent, if = 0
we have the Gauss-Newton method.

Also Newtons method may cause the new to wander off further from the minimum than the old since the Hessian may be
indefinite or even negative definite (this is not the case for J T P J ). In a Levenberg-Marquardt-like extension to Newtons method
we could modify Equation 306 to

new = old (H old + I)1 e2 ( old ). (313)


Allan Aasbjerg Nielsen 51

3 Final Comments

In geodesy (and land surveying and GNSS) applications of regression analysis we are often interested in
the estimates of the regression coefficients also known as the parameters or the elements which are often
2- or 3-D geographical positions, and their estimation accuracies. In many other application areas we are
(also) interested in the ability of the model to predict values of the response variable from new values of the
explanatory variables not used to build the model.

Unlike the Gauss-Newton method both the gradient method and Newtons method are general and not re-
stricted to least squares problems, i.e., the functions to be optimized are not restricted to the form eT e or
eT P e. Many other methods than the ones described and sketched here both general and least squares meth-
ods such as quasi-Newton methods, conjugate gradients and simplex search methods exist.

Solving the problem of finding a global optimum in general is very difficult. The methods described and
sketched here (and many others) find a minimum that depends on the set of initial values chosen for the
parameters to estimate. This minimum may be local. It is often wise to use several sets of initial values to
check the robustness of the solution offered by the method chosen.
52 Least Squares Adjustment

Literature

P.R. Bevington (1969). Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill.

K. Borre (1990). Landmaling. Institut for Samfundsudvikling og Planlgning, Aalborg. In Danish.

K. Borre (1992). Mindste Kvadraters Princip Anvendt i Landmalingen. Aalborg. In Danish.

M. Canty, A.A. Nielsen and M. Schmidt (2004). Automatic radiometric normalization of multitemporal
satellite imagery. Remote Sensing of Environment 41(1), 4-19.

P. Cederholm (2000). Udjvning. Aalborg Universitet. In Danish.

R.D. Cook and S. Weisberg (1982). Residuals and Infuence in Regression. Chapman & Hall.

K. Conradsen (1984). En Introduktion til Statistik, vol. 1A-2B. Informatik og Matematisk Modellering, Dan-
marks Tekniske Universitet. In Danish.

K. Dueholm, M. Laurentzius and A.B.O. Jensen (2005). GPS. 3rd Edition. Nyt Teknisk Forlag. In Danish.

L. Elden, L. Wittmeyer-Koch and H.B. Nielsen (2004). Introduction to Numerical Computation - analysis
and MATLAB illustrations. Studentlitteratur.

N. Gershenfeld (1999). The Nature of Mathematical Modeling.

Cambridge University Press.

G.H. Golub and C.F. van Loan (1996). Matrix Computations, Third Edition. Johns Hopkins University Press.

P.S. Hansen, M.P. Bendse and H.B. Nielsen (1987). Liner Algebra - Datamatorienteret. Informatik og
Matematisk Modellering, Matematisk Institut, Danmarks Tekniske Universitet. In Danish.

T. Hastie, R. Tibshirani and J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Infer-
ence, and Prediction. Second Edition. Springer.

O. Jacobi (1977). Landmaling 2. del. Hovedpunktsnet. Den private Ingenirfond, Danmarks Tekniske Uni-
versitet. In Danish.

A.B.O. Jensen (2002). Numerical Weather Predictions for Network RTK. Publication Series 4, volume 10.
National Survey and Cadastre, Denmark.

N. Kousgaard (1986). Anvendt Regressionsanalyse for Samfundsvidenskaberne. Akademisk Forlag. In Dan-


ish.

K. Madsen, H.B. Nielsen and O. Tingleff (1999). Methods for Non-Linear Least Squares Problems. Infor-
matics and Mathematical Modelling, Technical University of Denmark.

P. McCullagh and J. Nelder (1989). Generalized Linear Models. Chapman & Hall. London, U.K.

E.M. Mikhail, J.S. Bethel and J.C. McGlone (2001). Introduction to Modern Photogrammetry. John Wiley
and Sons.

E. Mrsk-Mller and P. Frederiksen (1984). Landmaling: Elementudjvning. Den private Ingenirfond,


Danmarks Tekniske Universitet. In Danish.
Allan Aasbjerg Nielsen 53

A.A. Nielsen (2001). Spectral mixture analysis: linear and semi-parametric full and partial unmixing in
multi- and hyperspectral image data. International Journal of Computer Vision 42(1-2), 17-37 and Journal
of Mathematical Imaging and Vision 15(1-2), 17-37.

W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery (1992). Numerical Recipes in C: The Art of
Scientific Computing. Second Edition. Cambridge University Press.

J.A. Rice (1995). Mathematical Statistics and Data Analysis. Second Edition. Duxbury Press.

G. Strang (1980). Linear Algebra and its Applications. Second Edition. Academic Press.

G. Strang and K. Borre (1997). Linear Algebra, Geodesy, and GPS. Wellesley-Cambridge Press.

P. Thyregod (1998). En Introduktion til Statistik, vol. 3A-3D. Informatik og Matematisk Modellering, Dan-
marks Tekniske Universitet. In Danish.

W.N. Venables and B.D. Ripley (1999). Modern Applied Statistics with S-PLUS. Third Edition. Springer.
Index
F distribution, 30 GPS, 40
R2 , 12 gradient method, 48
2 distribution, 21, 24, 29
0 , 17, 18 hat matrix, 9, 18
s0 , 21, 25, 29 Hessian, 4850
t distribution, 12, 13, 23
idempotent, 9, 18
t-test, two-sided, 12, 23
iid, 12
adjustment, 7 influence, 11, 14, 29
initial value, 26, 30, 48, 51
chain rule, 30, 31 iterations, 28
Cholesky, 8, 16, 17, 24, 28 iterative solution, 28
coefficient of determination, 12
confidence Jacobian, 28, 48, 49
ellipsoid, 29, 34, 42
Cooks distance, 14 least squares
coordinate system general (GLS), 25
ECEF, 42 nonlinear (NLS), 26
ENU, 42 ordinary (OLS), 7
weighted (WLS), 17
damping factor, 50 WLS as OLS, 25
decomposition levelling, 19
Cholesky, 16 Levenberg-Marquardt method, 50
QR, 15 leverage, 11, 14, 21, 22, 29, 34, 42
singular value, 14 linear constraints, 8
degrees of freedom, 7 linearization, 26, 27
derivative matrix, 27, 30
dilution of precision, 42 Matlab command mldivide, 13
dispersion matrix, 11, 21, 28 minimum
distribution global, 51
F , 30 local, 51
2 , 21, 24, 29 MSE, 12, 21, 25, 29
t, 12, 13, 23 multicollinearity, 8, 17, 28
normal, 12, 22, 23 multiple regression, 6
DOP, 42
Navstar, 40
ECEF coordinate system, 42 Newtons method, 48, 51
ENU coordinate system, 42 normal distribution, 12, 22, 23
error normal equations, 8, 17, 25
ellipsoid, 29
objective function, 7, 17
gross, 13
observation equations, 7, 27
or residual, 6, 7
optimum
estimator
global, 51
central, 9, 18
local, 51
unbiased, 9, 18
orientation unknown, 31, 32
fundamental equations, 9, 18, 27 outlier, 13, 14

Gauss-Newton method, 28, 34, 48, 49, 51 partial derivatives


Global Navigation Satellite System, 30, 32 analytical, 30
Global Positioning System, 40 numerical, 31
GNSS, 30, 32 precision
Allan Aasbjerg Nielsen 55

dilution of, 42
pseudorange, 40

QR, 8, 13, 1517, 28

regression, 7
multiple, 6
ridge, 10
simple, 4
regressors, 6
regularization, 9, 26
residual
jackknifed, 14
or error, 6
standardized, 13, 22
studentized, 14
ridge regression, 10
RMSE, 12, 21, 25, 29
RSS, 12, 21, 29

significance, 12, 23, 29


simple regression, 4
space vehicle, 32, 40
SSE, 12, 21, 29
standard deviation of unit weight, 17
steepest descent method, 48
SVD, 8, 14, 16, 17, 28

Taylor expansion, 26, 30, 48

uncertainty, 7

variable
dependent, 6
explanatory, 6
independent, 6
predictor, 6
response, 6
variance-covariance matrix, 11, 21, 28

weights, 18

Anda mungkin juga menyukai