Anda di halaman 1dari 52

Geophysical Inverse Theory

Notes by German A. Prieto


Universidad de los Andes
March 11, 2011
c
2009

ii

Contents
1 Introduction to inverse theory
1.1 Why is the inverse problem more difficult?
1.1.1 Example: Non-uniqueness . . . . .
1.2 So, what can we do? . . . . . . . . . . . .
1.2.1 Example: Instability . . . . . . . .
1.2.2 Example: Null space . . . . . . . .
1.3 Some terms . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

2 Review of Linear Algebra


2.1 Matrix operations . . . . . . . . . . . . . . . . .
2.1.1 The condition Number . . . . . . . . . . .
2.1.2 Matrix Inverses . . . . . . . . . . . . . . .
2.2 Solving systems of equations . . . . . . . . . . . .
2.2.1 Some notes on Gaussian Elimination . . .
2.2.2 Some examples . . . . . . . . . . . . . . .
2.3 Linear Vector Spaces . . . . . . . . . . . . . . . .
2.4 Functionals . . . . . . . . . . . . . . . . . . . . .
2.4.1 Linear functionals . . . . . . . . . . . . .
2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Norms and the inverse problem . . . . . .
2.5.2 Matrix Norms and the Condition Number
3 Least Squares & Normal Equations
3.1 Linear Regression . . . . . . . . . . . . .
3.2 The simple least squares problem . . . .
3.2.1 General LS Solution . . . . . . .
3.2.2 Geometrical Interpretation of the
3.2.3 Maximum Likelihood . . . . . . .
3.3 Why LS and the effect of the norm . . .
3.4 The L2 problem from 3 Perspectives . .
3.5 Full Example: Line fit . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

1
2
2
3
3
5
6

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

7
8
10
10
11
13
14
16
18
19
19
21
21

. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
normal equations
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

23
23
24
25
27
29
32
33
34

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

iv
4 Tikhonov Regularazation
4.1 Tikhonov Regularization . . . . . . . . . . . .
4.2 SVD Implementation . . . . . . . . . . . . . .
4.3 Resolution vs variance, the choice of , or p .
4.3.1 Example 1: Shaws problem . . . . . .
4.4 Smoothing Norms or Higher-Order Tikhonov
4.4.1 The discrete Case . . . . . . . . . . .
4.5 Fitting within tolerance . . . . . . . . . . . .
4.5.1 Example 2 . . . . . . . . . . . . . . . .

CONTENTS

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

37
37
38
39
40
42
43
44
45

Chapter 1

Introduction to inverse
theory
In geophysics we are often faced with the following situation: We have measurements at the surface of the Earth of some quantity (magnetic field, seismic
waveforms) and we want to know some property of the ground under the place
where we made the measurements. Inverse theory is a method to infer the
unknown physical properties (model) from these measurements (data).
This class is called Geophysical Inverse Theory (GIT) because it is assumed
we understand the physics of the system. That is, if we knew the properties
accurately, we would be able to reconstruct the observations that we have taken.
First, we need to be able to solve the forward problem
di = Gi (m)

(1.1)

where from a known field m(x, t, . . . ) we can predict the observations di . We


assume there are a finite number N of observations, thus di is a N-dimensional
data vector.
G is the theory that predicts the data from the model m. This theory is based
on physics. Mathematically, G(m) is a functional, a rule that unambiguously
assigns a single real number to an element of a vector space.
As its name suggests, the inverse problem reverses the process of predicting
the values of the measurements. It tries to invert the operator G to get an
estimate of the model
m = F (d)
(1.2)
Some examples of properties inside the Earth (model) and its surface observations used to make inferences about the model are shown in table 1.1
The inverse problem is usually more difficult than the forward problem. To
start, we assume that the physics are completely under control, before even
thinking about the inverse problem. There are plenty of geophysical systems
where the forward problem is still incompletely understood, such as the geodynamo problem or earthquake fault dynamics.

Chapter 1. Introduction to inverse theory

Table 1.1: Example properties and measurements for inverse problems


Model
Topography
Magnetic field at CMB
Mass distribution
Fault slip
Seismic velocity

1.1

Data
Altitute/Bathymetry measurements
Magnetic field at the surface
Gravity measurements
Waveforms / Geodetic motion
Arrival times / Waveforms

Why is the inverse problem more difficult?

A simple reason is that we have a finite number of measurements (and of limited


precision). The unknown property we are after is a function of position or time
and requires in principle infinitely many parameters to describe it. This leads
to the problem that in many cases the inverse problem is non-unique. Nonuniqueness means that more than one solution can reproduce the data in hand.
A finite di , where i = 1, . . . , M
does not allow us to estimate a function that would take an infinite
number of coefficients to describe.

1.1.1

Example: Non-uniqueness

Imagine we want to describe the Earths velocity structure, the forward problem
could be described as follows:
(, , r) =

X
l

X
X

Ylm (, )Zn (r)almn

(1.3)

l=0 m=l n=0

where is the P -wave velocity as measured at position (, , r), Zn (r) are


the basis functions that control radial dependence, Ylm are the basis functions
that describe angular dependence (lat, lon) and almn are the unknown model
coefficients.
Note that even if we had 1000s of exact measurements of velocity i (, , r)
the discretized forward problem is
i (, , r) =

X
l

X
X

(i)

Ylm (, )Zn(i) (r)almn

(1.4)

l=0 m=l n=0

where i = 1, . . . , M . We have an infinite number of parameters almn to determine, leading to the non-uniqueness problem.
A commonly used strategy is to drastically oversimplify the model
i (, , r) =

l
6
6
X
X
X
l=0 m=l n=0

(i)

Ylm (, )Zn(i) (r)almn

(1.5)

1.2 So, what can we do?

or a 1D velocity assumption with radial dependence only


i (, , r) =

20
X

Zn(i) (r)an

(1.6)

n=0

In this cases the number of data points is larger than the model parameters
M > N , so the problem is overdetermined.
If the oversimplification (i. e., radial dependence only) is justified by observations this may be a fine approach, but when there is no evidence for this
arrangement even if the data is fit, we will be uncertain o the significance of the
result. Another problem is that this may unreasonably limit the solution.

1.2

So, what can we do?

Imagine we could interpolate between measurements to have a complete data. In


a few cases that would be enough, but in most cases geophysical inverse problems
are ill-posed. In this sense they are unstable, an infinitesimal perturbation in
the data can result in a finite change in the model. So, how you interpolate may
control the features of the predicted model. The forward problem on the other
hand is unique (remember the term functional), and it is stable too.

1.2.1

Example: Instability

Consider the anti-plane problem for an infinitely long strike-slip fault

x2

x1
x3
Figure 1.1: Anti-plane slip for infinitely long strike-slip fault
The displacement at the Earths surface u(x1 , x2 , x3 ) is in the x
1 direction,
due to slip S() as a function of depth, ,
1
u1 (x2 , x3 = 0) =


S()

x2
2
x2 + 2


d

(1.7)

Chapter 1. Introduction to inverse theory

where S() is the slip along x1 and varies only with depth x
3 . If we had only
discrete measurements
di =

(i)
u1 (x2 )

Z
S() gi ()d

(1.8)

where

1
gi () =


(i)
x2
2
(i)
x2
+

Now, lets assume that slip occurs only at some depth c, so that S() = (c)

d(x2 )


S()

x22


d

x2
+ c2

(1.9)
(1.10)

u1

x2
x22 + 2

x2
Figure 1.2: Observations at the surface due to concentrated slip at depth c.
The results (Figure 1.2) show
1. Effect of concentrated slip is spread widely
2. This will lead to trouble (instability) in the inverse problem
so that even if we did have data at every point on the surface of the Earth, the
inverse problem would be unstable.
The kernel of functional g() smooths the focused deformation. The
problem lies in the physical model, not really how you solve it

1.3 Some terms

1.2.2

Example: Null space

We consider data for a vertical gravity anomaly observed at some height h to


estimate the unknown buried line mass density distribution m(x) = (x). The
forward problem is described by
Z
d(s)

h


(x s) +

h2

3/2 m(x)dx

(1.11)

Z
g(x s) m(x)dx

(1.12)

Suppose now we can find a smooth function m+ (x), such that the integral
in (1.12) vanishes, such that d(s) = 0. Because of the symmetry of the kernel
g(x s), if we choose m+ (x) to be a line with a given slope, the observed
anomaly d(s) will be zero. The consequence of this is that we can add to the
true anomaly, an anomaly function m+ to it
m = mtrue + m+
and the new gravity anomaly profile will match the data just as well as mtrue
Z
d(s)

g(x s) [mtrue (x) + m+ ]dx

Z
g(x s) mtrue (x)dx +

g(x s) m+ (x)dx

(1.14)

g(x s) mtrue (x)dx + 0

(1.13)

(1.15)

From the field observations, even if error free and infinitely sampled there is no
way to distinguish between the real anomaly and any member of an infinitely
large family of alternatives.
Models m+ (x) that lie in the null space of g(x s) are solutions to
Z
g(x s)m(x)dx = 0
By superposition, any linear combination of these null space models
can be added to a particular model and not change the fit to the
data. This kind of problems do not have a unique answer even with
perfect data,

Chapter 1. Introduction to inverse theory

Table 1.2: Example of inverse problems


Model
Discrete
Discrete
Discrete
Continuous
Continuous

1.3

Theory
Linear
Linear
Nonlinear
Linear
Nonlinear

Determinancy
Overdetermined
Underdetermined
Overdetermined
Underdetermined
Underdetermined

Examples
Line fit
Interpolation
Earthquake Location
Fault Slip
Tomography

Some terms

The inverse problem is not just simple linear algebra.


1. For the continuous case, you dont invert a matrix of infinite rows
2. Even the discrete case
d = Gm
you could simply multiply by the inverse of the matrix
G1 d = G1 Gm = m = GG1 m
and this is only possible for square matrices, so for under/over determined
cases would not work.
Overdetermined
More observations than unknowns N > M
Due to errors, you are never able to fit all data points
Getting rid of data is not ideal (why?)
Find compromise in fitting all data simultaneously (least-squares
sense)
Underdetermined
More unknowns than equations N < M
Data could be fit exactly, but we could vary some components of the
model arbitrarily
Add additional constraints, such as smoothness or positivity

Chapter 2

Review of Linear Algebra


A matrix is a rectangular array of real (or complex) numbers arranged in sets
of m rows with n entries each. The set of such m by n matrices is called Rmn
(or Cmn f orcomplexones). A vector is simple a matrix consistent of a single
column. Notice we will use the notation Rm rather than Rm1 or R1m . Also,
be careful since Matlab does understand the difference between a raw vector
and a column vector.
Notation is important. We will use boldface capital letters (A, B, . . . ) for
matrices, lowercase bold letters (a, b, . . . ) for vectors and lowercase roman and
Greek letters (m, n, , beta, . . . ) to denote scalars.
When referring to specific entries of the array A Rmn I use the indices
aij , which means the entry on the ith row and the jth column. If we have a
vector x, xj refers to its jth entry.

x1
a11 a12 . . . a1n
x2

..
..

.
a
.
A=
x
=
..
21

.
..
.
amn
xm
We can also think of a matrix as an ordered collection of column vectors

a11 a12 . . . a1n

.. 

..
.
.
A=
a21
= a 1 a2 an
..
.
amn
There are a number of special matrices to keep in mind. These are useful
since some of them are used to get matrix inverses.
Square matrix
Diagonal Matrix
Tridiagonal matrix

m=n
aij = 0

whenever i 6= j

aij = 0

whenever |i j| > 1

Chapter 2. Review of Linear Algebra


Upper triangular matrix

aij = 0

whenever i > j

Lower triangular matrix

aij = 0

whenever i < j

Sparse matrix

Most entries zero

Note that the definition of upper and lower triangular matrices may apply to
non-square matrices as well as square ones.
A zero matrix is a matrix composed of all zero elements. It plays the role
on matrix algebra as the scalar 0.
A+0

= A
= 0+A

The unit matrix is the square, diagonal matrix with only unity in the diagonal
and zeros elsewhere and is usually denoted I. Assuming the right matrix sizes
apply
AI = A = IA

2.1

Matrix operations

Having a set of matrices Rmn , addition


A = B + C means

aij = bij + cij

and scalar multiplication


A = B

means

aij = bij

where is a scalar.
Another basic manipulation is transposition
B = AT

means

bij = aji

More important is matrix multiplication, where


Rmn Rnp Rmp
C = AB

means

cij =

n
X

aik bkj

k=1

Notice we can only multiply two matrices when the numbers of columns (n) in
the first one equals the number of rows in the second. The other dimensions are
not important, so non-square matrices can be multiplied.
Other standard aritmethic rules are valid, such as distribution A(B + C) =
AB + AC. Less obvious the association of multiplication holds A(BC) =
(AB)C as long as the matrix sizes permit. But multiplication is not commutative
AB 6= BA

2.1 Matrix operations

unless some special properties exist.


When one multiplies a matrix into a vector, there are a number of useful
ways of interpreting the operation
y = Ax

(2.1)

1. If x and y are in the same space Rm , A is providing a linear mapping or


linear transformation of one vector into another.
Example 1: m = 3, A represents the components of a tensor
x
A
y

angular velocity
inertia tensor
angular momentum

Example 2: Rigid body rotation, used in Plate tectonics reconstructions.


2. We can think of A as a collection of column vectors, then
y

= Ax = [a1 , a2 , . . . , an ] x
= x1 a1 + x2 a2 + xn an

so that the new vector y is simply a linear combination of te column vectors


of A, with expansion coefficients given by the elements of x. Note, this
is the way we think about matrix multiplication when fitting a model,
y contains data values, A are the predictions of the theory hat includes
some unknown weights (the model) given by the entries of x.
d = Gm
There are two ways of multiplying two vectors. For two vector in
the outer product is

x1
x1 y1 x1 yq
x2 


..
..
T
..
xy = . y1 y2 yq = .
.
.
..
xp y1 xp yq
xp

Rp and Rq

and the inner product of two vectors of the same length is

y1



y2
xT y = x1 x2 xp . = x1 y1 + x2 y2 + + xp yp
..
yp
The inner product is just a vector dot product of vector analysis.

10

Chapter 2. Review of Linear Algebra

If A is a square matrix and if there is a matrix B such that


AB = I
the matrix B is called the inverse of A and is written A1 . Square matrices
that posses no inverse are called singular, when the inverse exists A is called
nonsingular.
The inverse of the transpose is the transpose of the inverse
1

(AT )

= (A1 )

The inverse is useful for solving linear systems of algebraic equations. Starting with equation 2.1

y
y

= Ax
= A1 Ax
= Ix = x

so if we know y and A and A is square and has an inverse we can recover


the unknown vector x. As you will see later, calculating the inverse and then
multiplying to y is a poor way to solve for x numerically.
A final rule about transposes and inverses
T

(AB)

(AB)

2.1.1

= BT A T
= B1 A1

The condition Number

The key to understanding the accuracy in the solution of


y = Ax
is to look at the condition number of the matrix A
A = kAkkA1 k
which estimates the factor by which small errors in y or A are magnified in
the solution x . This can sometimes be very large (> 1010 ). It can be shown
that the condition number in solving the normal equations (to be studied later)
is the square of the condition number using a QR decomposition, which can
sometimes lead to catastrophic error build up.

2.1.2

Matrix Inverses

Remember our definition that if we have a matrix A Rnn is invertible if


there exists A1 such that
A1 A = I

and AA1 = I

2.2 Solving systems of equations

11

Some examples of inverses

d1

0
d2

D=
0

D1 =

d3

1
d1

0
1
d2

1
d3

the inverse of a diagonal matrix, is a diagonal matrix with the diagonal elements
to the negative power.

1 0 0
1 0 0
P= 0 0 1
P1 = 0 0 1
0 1 0
0 1 0
If ones exchanges rows 2 and 3, you get a
simple inverse.

1 0
E= 2 1
0 0

diagonal matrix, so P will have a

0
0
1

In this case, the matrix is not diagonal, but we can use Gaussian Elimination
which we will go through next.

2.2

Solving systems of equations

Consider a system of equations


1
2
3

2x +
4x +
2x +

y
y
2y

+ z
+ z

=
1
= 2
=
7

and solve using Gaussian elimination.


The first step is to end up with zeros in first column for all rows (except first
one)
Subtract 2 (1) from (2).

the factor 2 is called pivot

Subtract 1 (1) from (3).

the -1 is called pivot

1
2
3

2x +

y
y
3y

+ z
2z
+ 2z

=
1
= 4
=
8

The next step is


Subtract 3 (2) from (3).
1
2
3

2x +

y
y

z
2z
4z

=
1
= 4
= 4

12

Chapter 2. Review of Linear Algebra

and now solve each equation from bottom to


substitution
3
4z =
1
2
y 2 = 4
1
2x + 2 + 1 =
1

top by the process called back

z=1
y=2
x = 2

In solving this system of equations we have used elementary row operations,


namely adding multiple of one equation to another, multiplying by a constant
or swaping two equations. This process can be extended to solve systems of
equations with an arbitrary number of equations.
Another way to think of Gaussian Elimination is as a matrix factorization
(triangular factorization). Rewrite the system of equations in matrix form

or

2
4
2

1
1
2

Ax

= b

Aij xj

= bi

1
x
1
0 y = 2
1
z
7

We are going to try and get A = LU, where L is a lower triangular matrix and
U is upper triangular with the same Gaussian Elimination steps.
1. Subtract two times the first equation from the second

1 0 0
2 1 1
x
1
2 1 0 4 1 0 y = 2
0 0 1
2 2 1
z
7

2
1
1
x
1
0 1 2 y = 4
2
2
1
z
7
or for short
E1 Ax

= E1 b

A1 x = b1
2. Subtract 1 times the first equation from

1 0 0
2
1
1
0 1 0 0 1 2
1 0 1
2
2
1

2
1
1
0 1 2
0
3
2

the third

x
1
y = 4
7
z

x
1
y = 4
8
z

or for short
E2 A1 x = E2 b1
A2 x = b2

2.2 Solving systems of equations

13

3. Subtract 3 times the second equation from the third

1
0 0
2
1
1
x
0
1 0 0 1 2 y =
0 3 1
0
3
2
z

2
1
1
x
0 1 2 y =
0
0 4
z

1
4
8

1
4
4

or for short
E3 A2 x = E3 b2
A3 x = b3

This new matrix will be assigned a new name, so the system now looks
E3 E2 E1 Ax

= E3 E2 E1 b

Ux = c
and since
U = E3 E2 E1 A
A

= E1 1 E2 1 E3 1 U = LU

where our matrix L is

1
L= 2
1

0 0
1 0
3 1

which is a lower triangular matrix. Notice that the non-diagonal components of


L are the pivots.
From this result it is suggested that if we find A = LU we only need to
change
Ax

to
Ux = c
and back-substitute. Easy right?

2.2.1

Some notes on Gaussian Elimination

Basic steps
Uses multiples of first equation to eliminate first coefficient of subsequent
equations.
Repeat for coefficients n 1.
Back substitute in reverse order.

14

Chapter 2. Review of Linear Algebra

Problems
Zero in first column
Linearly dependent equations
Inconsistent equations
Efficiency
If we count division, multiplication, sum as one operation and assume we have
a matrix A Rnn
n operations to get zero in first coefficient
n 1 for the # of rows to do
n2 n operations so far
N = (12 + + n2 ) (1 + + n) =

n3 n
3

to do remaining coefficients.

For large n N n3 /3
Back-substitution part N = n2 /2
There are other more efficient ways to solve systems of equations.

2.2.2

Some examples

We want to solve systems of equations with m equations and n unknowns.


Square matrices
There are three possible outcomes for square matrices, with A Rmm
1. A 6= 0 x = A1 b This is a non-singular case where A is an invertible
metrix and the solution x is unique.
2. A = 0, b = 0 0x = 0 and x could be anything. This is the underdetermined case, the solution x in non-unique.
3. A = 0, b 6= 0 0x = b. There is no solution. this is an inconsistent case
for which there is no solution.

2.2 Solving systems of equations

15

Non-square matrices
An example of a system with 3 equations and four unknowns (overdetermined,
underdetermined?) is as follows

x1

1
3 3 2
0

x
2
= 0
2
6 9 5
x3
1 3 3 0
0
x4
We can use Gaussian elimination by setting to zero first
2 and 3

x1
1 0 0
1
3 3 2
x2

2 1 0 2
6 9 5
x3 =
1 0 1
1 3 3 0
x4

x1
1 3 3 2

0 0 3 1 x2 =
x3
0 0 6 2
x4
and for the third coefficient for the last

1
0 0
1 3 3
0
1 0 0 0 3
0 2 1
0 0 6

row

2
1

1 3 3 2
0 0 3 1

0 0 0 0

x1
x2
=
x3
x4

x1
x2
=
x3
x4

coefficients in rows

0
0
0

0
0
0

0
0
0

0
0
0

The underlined values are the pivots. The pivots have a column of zeros below
and are to the right and below other pivots.
Now we can try and solve the equations, but note that the last row has not
information, xj could get any value.
0
0

= x1 + 3x2 + 3x3 + 2x4


= 3x3 + x4

and solving by steps

we have the solution

x3

= x4 /3

x1

= 3x2 x4



x4
3x2 x4
3x2



0
x
x
2
2
=
+
x=

x4 /3
0 13 x4
x4
0
x4

16

Chapter 2. Review of Linear Algebra

which means that all solutions to our initial problem Ax = b are combinations
of this two vectors and form an infinite set of possible solutions. You can choose
ANY value of x2 or x4 and you will get always the correct answer.

2.3

Linear Vector Spaces

A vector space is an abstraction of ordinary space and its members can be


loosely be regarded as ordinary vectors. To define a linear vector space (LVS)
it involves two types of objects, the elements of the space (f, g) and scalars
(, R, although sometimes C is useful). A real linear vector space is a set
V containing elements which can be related by two operations
f +g

and

addition

f
scalar multiplication

where f, g V and R. In addition, for any f, g, h V and any scalar ,


the following set of nine relations must be valid
f +g

(2.2)

(2.3)

f +g = g+f
f + (g + h) = (f + g) + h

(2.4)
(2.5)

f + g = f + h, if and only ifg = h


(f + g) = f + g

(2.6)
(2.7)

( + )f

(2.8)

= f + f

(f ) = ()f
1f = f

(2.9)
(2.10)

An important consequence of these laws is that every vector space contains a


unique zero element 0
f +0=f V
and whenever
f = 0

either = 0 or f = 0

Some examples
The most obvious space is Rn , so
x = [x1 , x2 , . . . , xn ]
is an element of Rn .
Perhaps less familiar are spaces whose elements are functions, not just a
finite set of numbers. One could define a vector space C N [a, b], a space of all

2.3 Linear Vector Spaces

17

N -differentiable functions on the interval [a, b]. Or solutions to PDEs (2 = 0)


with homogeneous boundary conditions.
You can check some of the laws. For example, in the vector space C N [a, b] it
should be easy to proof that adding two N -differentiable functions the resultant
function is also N -differentiable.
Linear combinations
In a linear vector space you can add together a collection of elements to form a
linear combination
g = 1 f1 + 2 f2 + . . .
where fj V , j R and obviously g V .
Now, a set of elements in a linear vector space a1 , a2 , . . . , an is said to be
linearly independent if
n
X

only if 1 = 2 = = n = 0

j aj = 0

j=1

in words, the only linear combination of the elements that equals zero is the one
in which all the scalars vanish.
Subspaces
A subspace of a linear vector space V is a subset of V that is itself a LVS,
meaning all the laws apply. For example
Rn

is a subset of Rn+1

or
C n+1 [a, b]

is a subset of C n [a, b]

since all (N + 1)-differentiable functions are themselves N -differentiable.


Other terms
span the spanning set of a collection of vectors is the LVS that can be
nuilt from linear combinations of the vectors.
basis a set of linearly independent vectors that form or span the LVS.
range written R(A) of a matrix Rmn , it is simply the linear vector space
that can be formed by taking linear combinations of the column vectors.
Ax R(A)
Ax = b
is the set of ALL vectors b that can be build by linear combinations of
the elements in A by using x with all possible scalar coefficients.

18

Chapter 2. Review of Linear Algebra


rank: The rank represents the number of linearly independent rows in A.
rank((A)) = dim[R(A)]
A matrix is said to be full rank if
rank(A Rmn ) = min(m, n)
or to be rank deficient otherwise.
Null space: This is the other side of the coin of the rank. This is the set
of xi s that cause
Ax = 0
and it can be shown that
dim[N (A)] = min(m, n) rank(A)

2.4

Functionals

In geophysics we usually have a collection of real numbers (could be complex


numbers in for example EM) as our observations. An observation or measurement will be a single real number.
The forward problem is
Z
dj = gj (x)m(x)dx
(2.11)
where gj (x) is the mathematical model and will be treated as an element in the
vector space V .
We thus need something, a rule that unambigously assigns a real number to
an element gj (x) and this is where the term functional comes in.
A functional is a rule that unambigously assigns a single real
number to an element in V .
Note that every element in V will not necessarily be connected with a real
number (remember the terms range and null space). Some examples of functionals include
Zb
Ii [m]

gi (x)m(x)dx

m C 0 [a, b]


d2 f
D2 [f ] =
f C 2 [a, b]
dx2 x=0
N1 [x] = |x1 | + |x2 | + + |xn |
x Rn
There are two kinds of functionals that will be relevant to our work, linear
functionals and norms. We will devote a section to the second one later.

2.5 Norms

2.4.1

19

Linear functionals

For f, g D and , R a linear functional L obeys


L[f + g] = L[f ] + L[g]
and in general
f + g D
so that a combination of elements in space D, lies in space D. It is a subspace
of D.
The most general linear functions in RN is the dot product
Y [x] = x1 y1 + x2 y2 + + xN yN =

xi yi

which is an example of an inner product For finite models and data, the general
relationship is
d = gj m
or for multiple data
di = Gij mj
and is some way our forward problem is an inner product between the model
and the mathematical theory to generate the data.

2.5

Norms

The norm provides a mean of attributing sizes to elements of a vector space. It


should be recognized that there are many ways to define the size of an element.
This leads to some level of arbitrariness, but it turns out that one can choose a
norm with the right behavior to suit a particular problem.
A norm, denoted k k is a real-valued functional and satisfies the following
conditions
kf k
kf k
kf + gk
kf k

> 0

(2.12)

= ||kf k
6 |f | + |g|
= 0

(2.13)
the triangle inequality
only iff = 0

(2.14)
(2.15)

If we omit the last condition, the functional is called a seminorm.


Using the norms, in a linear vector space equipped with such a norm the
distance between two elements
d(f, g) = kf gk

20

Chapter 2. Review of Linear Algebra

Some norms in finite dimensional space


Here we define some of the common used norms
L1
L2
L
Lp

kxk1 = |x1 | + |x2 | + + |xN |


1/2
kxk2 = x1 2 + x2 2 + + xN 2
max|xi |
1/p
kxkp = (|x1 |p + |x2 |p + + |xN |p )

x RN
Euclidean norm
p61

The areas for which the so called p-norms are less that unit (kxk 6 1) are shown
p=1

p=2

p=3

p=

Figure 2.1: The unit circle p-norms


in Figure 2.1. For the Euclidean norm the area is called the unit ball. Note that
for large values of p, the larger vectors will tend to dominate the norm.
Some norms in infinite dimensional space
For the infinite dimensional space we work with functions rather than vectors
Zb
kf k1

|f (x)|dx

kf k2

b
1/2
Z
= |f (x)|2 dx

kf k

max
a6x6b(|f (x)|)

2.5 Norms

21

and other norms can be designed to measure some aspect of the roughness of
the functions

00

1/2

Zb

00

kf k

= f 2 (a) + [f (a)] +

kf kS

b
1/2
Z
0
= (w0 (x)f 2 (x) + w1 (x)f (x)2 )dx

[f (x)] dx
a

Sobolev norm

This last set of norms are going to be useful when we try to solve underdetermined problems. They are typically applied to the model rather than the
data.

2.5.1

Norms and the inverse problem

Remembering our simple inverse problem


d = Gm

(2.16)

we form the residual


r = d Gm

r = dd
where from our physics we can make data predictions d and we want our predictions to be as close as possible to the acquired data.
What do we mean by small? We use the norm to define how small is small
by setting the length of r, namely the norm of r krk as small as possible
L1 :

1
kd dk

or minimizing the Euclidean or 2-norm


L2 :

kd dk
2

leading in the second case to the least squares solution.

2.5.2

Matrix Norms and the Condition Number

We return to the question of the condition number. Imagine we have a discrete


inverse problem for the unperturbed system
y = Ax

(2.17)

y0 = Ax0

(2.18)

and the perturbed case is

22

Chapter 2. Review of Linear Algebra

Here, assume the perturbation is small. Note that in real life, we have uncertainties in our observations, and we wish to know whether these small errors in
the observations are severely effecting our end-result.
Using a norm, we wish to know what the effect of the small perturbations
is, so using the relations above
A(x x0 )

= y y0

(x x0 )

= A1 (y y0 )

kx x0 k

6 kA1 kky y0 k

where in the last step, we use the triangular inequality rule.


To get an idea of the relative effect of the perturbations to our result,
kx x0 k
Ax
kx x0 k
kx x0 k
kxk

ky y0 k
y
ky y0 k
6 kAxkkA1 k
kyk
ky

y0 k
6 kAkkA1 k
kyk
6 kA1 k

where we defined the condition number


(A) = kAkkA1 k

(2.19)

and shows the amount that a small perturbation in the observations (y) is
reflected in perturbations in the resultant estimated model x. For the L2 norm,
the condition number of a matrix = max /min , where i are the eigenvalues
of the matrix in question.

Chapter 3

Linear regression, least


squares and normal
equations
3.1

Linear Regression

Sometimes we will talk about the term inverse problem, while some other people
will prefer the term regression. What is the difference? In practice, none.
In the case where we are dealing with a function fitting procedure that can be
cast as an inverse problem, the procedure is many times referred as a regression.
In fact, economists use regressions quite extensively.
Finding a parameterized curve that approximately fits a set of data points
is referred to as regression. For example, the parabolic trajectory problem is
defined
y(t) = m1 + m2 t (1/2)m3 t2
where y(t) represents the altitude of the object at time t, and the three (N = 3)
model parameters mi are associated with a constant, slope and quadratic terms.
Note that even if the function is quadratic, the problem in question is linear for
the three parameters.
If we have M discrete measurements yi at times ti , the linear regression
problem or inverse problem can be written in the form

1 2
1 t1
y1

2 t1
1 2
y 2 1 t2
m1
2 t2

.. = ..
..
.. m2
. .
.
. m3
yM
1 tM 12 t2M
When the regression model is linear in the unknown parameters, then we call
this a linear regression or linear inverse problem.

24

3.2

Chapter 3. Least Squares & Normal Equations

The simple least squares problem

We start the application of all those terms we have learned above by looking at
an overdetermined linear problem (more equations than unknowns) involving
the simplests of norms, the L2 or Euclidean norm.
Suppose we are given a collection of M measurements of a property to form
a vector d RM . From our geophysics we know the forward problem such that
we can predict the data from a known model m RN . That is, we know the N
vectors gk RM such that
d=

N
X

gk mk = Gm

(3.1)

k=1

where
G = [g1 , g2 , . . . , gN ]
that minimizes the size of the residual vector
We are looking for a model m
defined
r=d

N
X

gk m
k

k=1

We do not expect to have an exact fit so there will be some error, and we use a
norm to measure the size of the residual
krk = kd Gmk
For the least squares problem we use the L2 or Euclidean norm
v
uM
uX
krk2 = t
rk2
k=1

Example 1: the mean value


Suppose we have M measurements of the same quantity, so we have our data
vector
T

d = [d1 d2 , . . . , dM ]

The residual is defined as the distance between each individual measurement


and the predicted value m:

ri = d i m

3.2 The simple least squares problem

25

Using the L2 norm


krk22 =

M
X

rk2

k=1

M
X

(di m)
2

k=1

M
X

d2k 2md
i+m
2

k=1

M
X

d2k 2m

M
X

dk + M m
2

k=1

k=1

Now, to minimize the residual, we take the derivative w.r.t. the model m

and set to zero


M
X
d
krk22 = 2
dk + 2M m
=0
dm

k=1

and by solving for m


we have
m
=

M
1 X
dk
M
k=1

which shows that the sample mean is the result of a least squares solution for
the measurements.
The corresponding estimate that minimizes the L1 norm is the median. Note
that the median is not found by a linear operation on the data, which is a general
feature of the L1 norm estimates.

3.2.1

General LS Solution

Going back to our general problem, we have


= Gm

d
and the predicted data is a linear combination of the gk s. Using the Linear
must lie in the
vector space theory, we can show that the predicted data d
estimation space, on ALL possible results that G can produce (range).
Setting the L2 norm for the residuals between the data and the prediction
krk22

2
= kd dk
2
T (d Gm)

= rT r = (d Gm)
T
T
T
G d+m
T GT Gm

= d d 2m

and set to zero


now we take the derivative wrt m

d  T
d
T GT d + m
T GT Gm
=0
krk22 =
d d 2m

dm
dm
0

0 2GT d + 2GT Gm

26

Chapter 3. Least Squares & Normal Equations

Note that it is worth pointing out that the derivative is of a scalar with respect
to a vector. We will show below that this works as simply as it appears, by
writing out all the components. Simplifying a bit more

GT d = GT Gm

(3.2)

which is called the normal equations.


to end up with
Assuming the inverse of (GT G) exists, we can isolate m
1

= (GT G)
m

GT d

Note that the matrix (GT G) is a square N N matrix and GT d is an N column


vector.
Derivation with another notation
Starting with the L2 norm of the residuals
krk22

M
X

(rj ) =

j=1

M
X

dj

j=1

N
X

!2
gji mi

i=1

we take the derivative and set to zero


0

M
d
d X
krk22 =
(rj )2

dm
dm
k j=1

!
!
M
N
N
X
X
d X
dj
dj
gji mi
gjl ml
dm
k j=1
i=1
l=1
#
"
M
N
N
N X
N
X
X
X
d X
dj dj dj
gjl ml dj
gji mi +
gji gjl mi ml
dm
k j=1
i=1
i=1
l=1

l=1

We may look at each of these terms independently. The first term is


M
d X
dj dj = 0
dm
k j=1

the second and third terms are similar


#
"
#
"
M
N
N
M
X
X
d X
d X
gjl ml
2dj
gjl ml
=
2dj
dm
k j=1
dm
k
j=1
l=1

l=1

= 2

M
X
j=1

dj gjk lk GT d

3.2 The simple least squares problem

27

and the last term


#
"N N
M
d X XX
gji gjl mi ml
dm
k j=1
i=1
l=1

M
X

N
X

d
gji gjl mi ml
dm
k
j=1
i,l=1
"N N
#
M
X
XX
=
(ik gji gjl ml + lk gji gjl mi )

j=1

l=1 i=1

"N N
M
X
XX
j=1

#
(gjk gjl ml + gji gjk mi )

l=1 i=1

and now note that gjk gjl gji gjk due to symmetry. So in the end we will have
2

M X
N
X

mi gjk gji GT Gm

j=1 i=1

and we have derived the same previous result for the normal equations.

3.2.2

Geometrical Interpretation of the normal equations

The normal equations seem to have no intuitive content.


1

= (GT G)
m

GT d

which was derived from


= GT d
(GT G)m

(3.3)

as a linear combination of the gk vectors


Lets consider the data prediction d
and assume they are linearly independent
= Gm
= g1 m
d
1 + g2 m
2 + + gN m
N
where gk is the kth column vector of the G matrix.
Recall that the set of gk s form a subspace of the entire RM data space,
sometimes called the estimation space or model space. Starting with (3.3) we
have
= GT d
(GT G)m

GT (Gm)
= GT d
d) = 0
GT (Gm
and recalling the definition of the residual
d) = GT r = 0
GT (Gm

28

Chapter 3. Least Squares & Normal Equations

So in other words the normal equation in the least squares sense means that

GTr = 0
g.1 r

g.2 r

= 0=
..

0
0
..
.

g.N r

suggesting that the residual vector is orthogonal to every one of the column
vectors of the G matrix. The key thing here is that making the residual perpendicular to the estimation sub-space minimizes the length of r.

Gm

subspace of G

Figure 3.1: Geometrical interpretation of the LS & normal equations. We are


basically projecting the data d RM onto the column space of G.
This concept is called the orthogonal projection of d into the subspace of
R(G), such that the actual measurements d can be expressed as
+r
d=d
where d
is called the orthogonal
= d,
We have created a vector Gm
projection of d into the subspace of G. The idea of this projection relies on
the Projection Theorem for Hilbert spaces, but we are not going to go too
deeply into this.
The theorem says that given a subspace of G, every vector can be written
uniquely as the sum of two parts, one part lies in the subspace of G and the
other part is orthogonal to the first (see Figure 3.1). The part lying in this
subspace of G is the orthogonal projection of the vector d onto G,
+r
d=d
There is a linear operator PG , the projection matrix, that acts on d to generate

d.
PG = G(GT G)1 GT

3.2 The simple least squares problem

29

This projection matrix has particularly interesting properties. For example,


P2 = P, meaning that if we apply the projection matrix twice to a vector d,
This also
we get the same result as if we apply it only once, namely we get d.
suggests that P must be a symmetric matrix.
Example: Straight line fit
Assume we have 3 measurements d RM where M = 3. For a straight line
we only need 2 coefficients, the zero crossing and the slope, thus m RN with
N = 2. The data predictions are then
= Gm


d1
1 x1 
1
d2 = 1 x2 m
m
2
1 x3
d3
or
d1
d2

= g11 m
1 + g12 m
2

d3

= g31 m
1 + g32 m
2

= g21 m
1 + g22 m
2

and as we have said, the residual vector would be

r=dd
Reorganizing we have
+ r,
d=d

rd

which is described in the figure below.

3.2.3

Maximum Likelihood

We can also use the Maximum LIkelihood method in order to interpret the Least
Squares method and normal equations. This technique was developed by R.A.
Fisher in the 1920s and has dominated the field of statistical inference since
then. Its power is that it can (in principle) be applied to any type of estimation
problem, provided that one can write down the joint probability distribution of
the random variables which we are assuming model the observations.
The maximum likelihood looks for the optimum values of the unknown model
parameters as those that maximize the probability that the observed data is due
to the model from a probabilistic point of view.
Suppose we have a random sample of M observations x = x1 , x2 , . . . , xM
drawn from a probability distribution (PDF) f (xi , ) where the parameter
is unknown. We can extend this probability to a set of model parameters to
f (xi , m). The joint probability for all M observations is:
f (x, m) = f (x1 , m)f (x2 , m) f (xM , m) = L(x, m)

30

Chapter 3. Least Squares & Normal Equations

y3

r3

d = Gm

y1
y2

r1

x1

r2

x2

x3

Figure 3.2: The LS fit for a straight line. The estimation space is the straight
this is where all predictions will lie. The real measurements
line given by Gm,
dk line above or below this line, and are sort of projected into the line via the
residual.

We call L(x, m) = f (x, m) the likelihood function of m. If L(x, m0 ) >


L(x, m1 ) we can say that m0 is a more plausible value for the model vector m
than m1 , because m0 ascribes a larger probability to the observed values in
vector x than m1 does.
In practice we are given a particular data vector and we wish to find the
more plausible model that generated these data, by finding the model that
gives the largest likelihood.

Example 1: The mean value


Assume we are given M measurements of the same quantity and that the data
contains normally distributed errors, then d N (, 2 ), where is the mean
value and 2 is the variance. The probability function for a single datum is
1
exp
f (di , ) =
2

(di )2
2 2

and the joint distribution or likelihood function is

L(d, ) = (2)M/2 M exp

2
(di )

i=1

2 2

3.2 The simple least squares problem

31

Maximizing the likelihood function is equal to maximizing its logarithm, so


L = max L(d, )

max ln{L(d, )}

max [ln{f (d, )}]

where we let L be our log-likelihood function to maximize


L=

M
M
1 X
ln(2) M ln() 2
(di )2
2
2 i=1

taking the derivative with respect to


0=

M
1 X
(di )
2 i=1
M
X

(di )

i=1

M
X

m
X

()

i=1

(di ) m

i=1

and as expected we obtain the arithmetic mean


=

M
1 X
di
M i=1

We can also look for the maximum likelihood for the variance 2
0=

M
M
1 X
+ 3
(di )2

i=1

we get
M
1 X
=
(di )2
M i=1
2

The least squares problem with maximum likelihood


We return to the linear inverse problems we had before
d = Gm + 
where we assume the errors are normally distributed i N (0, i2 ). The joint
probability distribution or likelihood function in this case is
L(d, m) =

M
Y


1
exp (di Gi m/2i2
Q
M
M/2
(2)
i=1 i i=1

32

Chapter 3. Least Squares & Normal Equations

We want to maximize the function above, thus the constant term has no
effect, leading to
"
( M
)#
X
2
2
max L = max exp
(di Gi m) /2i
m

i=1

take the logarithm of this likelihood function


" M
#
X
2
2
max L = max
(di Gi m) /2i
m

i=1

and switching to a minimization instead


"M
#
X
2
2
min
(di Gi m) /2i
m

i=1

In matrix form this can be expressed as




1
min (d Gm)T 1 (d Gm)
m
2
where is the data covariance matrix.
So, to minimize, we want to take the derivative with respect to the model
parameter vector as set to zero
0



(d Gm)T 1 (d Gm)
m

 1 T
=
d d 2mT GT 1 d + mT GT 1 Gm
m

= 0 2GT 1 d + 2GT 1 Gm
=

finally leading to
= (GT 1 G)1 GT 1 d
m
which comes from the sometimes called the generalized normal equations
= GT 1 d
(GT 1 G)m

(3.4)

or the weighted least squares solution for the overdetermined case.

3.3

Why LS and the effect of the norm

As you might have expected, the choice of norm is kind of arbitrary. So why is
the use of least squares so popular?
1. Least squares estimates are linear in data and easy to program.

3.4 The L2 problem from 3 Perspectives

33

L1
L2

outlier

x
Figure 3.3: Schematic of a straight line fit for (x, d) data points under the L1,
L2, and L norms. The L1 is not as affected by the single outlier.
2. Corresponds to maximum likelihood estimate for normally distributed errors. The normal distribution comes from the central limit theorem add
up random effects and you get a Gaussian.
3. The statistical distribution is linear, meaning we will have the propagation
of errors as a linear mapping from the input statistics (data).
4. Well known statistical tests and confidence intervals can be obtained.
It has some disadvantages too. The main one is that the result is sensitive to
outliers (see figure 3.3).
Another popular norm used is the L1 norm. Some characteristics include
1. non-linear, solved by linear programming (to be seen later).
2. Less sensitive to outliers
3. confidence intervals and hypothesis testing are somewhat more difficult,
but can be done.

3.4

The L2 problem from 3 Perspectives

1. Geometry Orthogonality of the residual & predicted data


r = 0
d
(d Gm)

(Gm)
= 0
T

G (d Gm)
= GT r = 0
T

which leads to
= (GT G)1 GT d
m

34

Chapter 3. Least Squares & Normal Equations


2. Calculus where we want to minimize krk2
rT r =

T (d Gm)

(d Gm)

(rT r) = 0
m

GT (d Gm)
= 0
leading to
= (GT G)1 GT d
m
3. Maximum likelihood for a multivariate normal distribution
Maximize:


T 1 (d Gm)

exp (d Gm)
Minimize:
T 1 (d Gm)

(d Gm)
Led to:
= (GT 1 G)1 GT 1 d
m
which comes from the generalized normal equations.

3.5

Full Example: Line fit

We come back to the general line fit problem, where we have two unknows,
the intercept m1 and the slope m2 . We have M observations di . The inverse
problem is
d = Gm
and the indexed matrix is then


d1
d2


.. =
.
dM

1
1
..
.

x1
x2
..
.

xM



m1

m2

As you are already aware of, the least squares solution of this problem is
= (GT G)1 GT d
m
which we are doing explicitly.

3.5 Full Example: Line fit

35

The last term is

1
x1

G d =

1
x2

1
dM

d1
d2
..
.

dM
M
X

di

i=1
=
M
X

xi di

i=1

The first term (note typo in Book)

(GT G)1


1
=
x1

1
x2

=
M
X

xi
i=1


1

xM

M
X
i=1
M
X

M
X

xM

xi

x2i

i=1

1
M

x1
x2
..
.

M
X

1
1
..
.

x2i

M
X

xi

!2

i=1
M
X

i=1

xi

xi

i=1

i=1

i=1

x2i

M
X

leading to our final result

=
m
M

M
X
i=1

x2i

M
X
i=1

xi

!2

M
X

x2i

i=1
M
X

M
X

M
X

xi
di
i=1

M
X

M
xi di

i=1

xi

i=1

i=1

Using the concept for the covariance of the model parameters

cov(m)

= 2 (GT G)1
M
X
x2i

i=1
2
M
X

xi
i=1

M
X
i=1

xi

36

Chapter 3. Least Squares & Normal Equations

where 2 is the variance of the individual estimates. This equation shows that
even if the data di are uncorrelated, the model parameters can be correlated:
cov(m1 , m2 ) =

M
X

xi

i=1

A number of important observations


There is a negative correlation between intercept and slope
The magnitude of the correlation depends on the spread of the x axis.
How can we reduce the covariance between the model parameters? We define a
new axis
M
1 X
xi
yi = xi
M i=1
which is basically equivalent to shifting the origin in
is now

M
0
M
2
X

cov(m)
=
0
yi2

the x axis. The covariance


1

i=1

1
M

= 2
0

0
1
M
X

yi2

i=1

This new relation now shows independent intercept and slopes and if are the
standard errors in the observed data then
Standard error of intercept
/M,
with more data you reduce the variance of the intercept.
Standard error of slope

v
uM
uX
t
y2
i

i=1

showing that if the observation points on the x axis are close, the uncertainties in the slope estimates are greater.

Chapter 4

Tikhonov Regularization,
variance and resolution
4.1

Tikhonov Regularization

Tikhonov Regularization is one of the most common methods used for regularizing an inverse problem. The reason to do this is that in many cases the inverse
problem is ill-posed and small errors in the data will give very large errors in
the resultant model.
Another possible reason for using this method is if we have a mixed determined problem, where for example we might have a model null-space.For the
overdetermined part, we would like to minimize the residual vector
min krk

= (GT G)1 GT d
m

while for the underdetermined case we actually minimize the model norm
min kmk

= GT (GGT )1 d
m

and of course, for the mixed determined case, we will be trying something in
between
m = kd Gmk22 + 2 kmk22
and as we have seen before, we want to minimize


 2

G
d

min m = min
m

0 2
m
m I
or equivalently
min m
m

= (GT G + 2 I)1 GT d
m

So the question is, what do we choose for ? If we choose very large, we are
focusing our attention on minimizing the model norm kmk, while neglecting

38

Chapter 4. Tikhonov Regularazation

the residual norm. If we choose too small, we are dealing with the complete
contrary and we are trying to fit the data perfectly, which is probably not what
we want.
A graphical way to see how the two norms interact depending on the choice
of is shown in Figure 5.1 of our book. The idea is that as the residual norm
increases, the model norm decreases, leading to the so-called L-curve. This is
because kmk2 is a strictly decreasing function of , while kd Gmk2 is an
strictly increasing function of .
Our job now is to find an optimal value of . There are a few methods we are
going to see that get the optimal . This include the discrepancy criterion, the Lcurve criterion and cross-validation. Before going there, we want to understand
the effect of the choice of in the resolution of the estimate and well as the
covariance of the model parameters. Similarly we want to understand the choice
of the number of singular values used in solving the generalized inverse using
SVD and the SVD implementation of the Tikhonov regularization. Finally we
will see how other norms can be chosen in order to penalize models with excesive
roughness or curvature.

4.2

SVD Implementation

Using our previous expression, but using the SVD of the G nmatrix, namely
G = UVT
and from above
= GT d
(GT G + 2 I)m
we can replace

(VUT UVT + 2 I)m


2 T
2

(V V + I)m
and the solution is
=
m

X
i

where
fi =

2i

= VUT d
= VUT d

uTi d
2i
vi
+ 2 i

2i
2i + 2

are called the filter factors.


The filter factors have an obvious effect on the resultant model, such that
for i  , the factor fi 1 and the result would be like
T
= Vp 1
m
p Up d

where we had chosen the value of p for all singular values that are large. In
contrast, for i  , the factor fi 0 and this part of the solution will be
damped out, or downweighted.

4.3 Resolution vs variance, the choice of , or p

39

In matrix form we can write the expression as


= VF1 UT d
m
where
Fii =

2i

2i
+ 2

and zero elsewhere.


Unlike what we saw earlier, the truncation achieved by choosing an integer
value p, for the number of singular values and singular vectors to use, there is
going to be a smooth transition from the included and excluded singular values.
Other filter factors have been suggested
fi =

4.3

i
i +

Resolution vs variance, the choice of , or p

From previous lectures, we can now discuss the resolution and variance of our
resultant model using the generalized inverse, and in this case, the Tikhonov
regularization. We had

(GT G + 2 I)1 GT d

= G# d
= VF1 UT d
T
= Vp 1
p Up d

where the first and second equations use the general Tikhonov regularization,
the third equation is the SVD using filter factors and the last one is the result
if we choose a p number of singular values.
The model resolution matrix Rm was defined via
= Ggen d = Ggen Gmtrue = Rmtrue
m
is then defined for the three cases as
Rm,

= G# G

Rm,

= VFVT

Rm,p

= Vp VpT

6= mtrue . The
In all regularizations R 6= I, the estimate will be biased and m
bias introduced by regularizing is
mtrue = [R I]mtrue
m
but since we dont know mtrue , we dont know the sense of the bias. We cant
even bound the bias, since it depends on the true m as well.

40

Chapter 4. Tikhonov Regularazation

Finally, we also have to deal with uncertainties, so as we have seen before


the model covariance matrix m is


m
T
m = m
D
E
T
=
G# ddT G#
= G# d G#

and assuming d = d2 I, our three cases lead to


T

m,
m,

= 2 G# G#
= 2 VF2 2 VT

m,p

T
= 2 Vp 2
p Vp

We could use this to evaluate confidence intervals, ellipses on the model, but
since the model is biased by an unknown amount, the confidence intervals might
not be representative of the true deviation of the estimated model.

4.3.1

Example 1: Shaws problem

In this example I would like to use some practical application of the Tikhonov
regularazation using both the general approach (generalized matrix explicitly)
and using the SVD.
I take the examples from Asters book directly. In the Shaw problem, the
data that is measured is diffracted light intensity as a function of outgoing angle
d(s), where the angle is /2 6 s 6 /2. We use the discretized version of the
problem as outlined in the book, namely the mathematical model relating the
data observed d and the model vector m is
d = Gm
where d RM and m RN , but in our example we will have M = N . The G
matrix is defined for the discrete case
2


2 sin ((sin(si ) + sin(j )))


Gij = (cos(si ) + cos(j ))
N
(sin(si ) + sin(j ))
Note that the part inside the large brackets is the sinc function.
We discretize the model and data vectors at the same angles
si = i =

(i 0.5)

N
2

i = 1, 2, . . . , N.

which in theory would give us an even-determined linear inverse problem, but


as we will see, the problem is very ill-conditioned.
Similar to what was done in the book, we use a simple delta function for the
true model

1 i = 10
mi =
0 otherwise

4.3 Resolution vs variance, the choice of , or p

41

and generate synthetic data by using


d = Gm + 
where the errors are i N (0, 2 ) with a = 106 . Note that the errors are
quite small, but nevertheless due to the ill-posed inverse problem, will have a
significant effect on the resultant models.
In this section we will focus on two main ways to estimate an appropriate

model m,

(GT G + 2 I)1 GT d

T
= Vp 1
p Up d

where in the first case we need to choose a value of , while in the second case
(SVD) we need to choose a value of p, the num,ber of singular values and vectors
to use. Since the singular values are rarely exactly zero, the choice is not so
easy to make. In addition to making a particular choice, we need to understand
what the effect of our choice has on our model resolution and model covariance.
In the next figures I present the results graphically in order to get an intuitive
understanding of our choices.
2.5

10

real model
= 0.001
= 3.1623e06
p = 8e08

10

1.5

2
Intensity

||m||

10

10

0.5

0.5

10

1.5

10 6
10

10

||dGm||

10

10

2
1.5

0.5

Resolution 0.0017783

0.5

Resolution 1e05

1.5

Resolution 8e08

20

20

20

15

15

15

10

10

10

0.7
real model
= 0.001
= 3.1623e06
p = 8e08

0.6

Observed Intensity

0.5

10

15

20

5
5

10

15

20

10

15

20

0.4

Covariance 0.0017783

Covariance 1e05

Covariance 8e08

0.3

20

20

20

0.2

15

15

15

10

10

10

0.1

0
1.5

0.5

0
Outgoing angle

0.5

1.5

10

15

20

10

15

20

10

15

20

Figure 4.1: Some models using the Generalized inverse. Top-Left: The L-curve
for the residual norm and model norm. Various choices of are used, and the
colored dots are three choices made. Top-Right: True model (circles) and the
estimated models for the three choices on the left. Bottom-Left: The synthetic
data (circles) and the three predicted data. Bottom-Right: The resolution (top
panels) and covariance (bottom panels) matrices for the three choices. White
represents large amplitudes, black represents lower amplitudes.

42

Chapter 4. Tikhonov Regularazation


10

10

1.5

10

real model
p = 14
p=8
p=2

|uTid|
|uTd/s |
i

10

1
0

10

Intensity

||dGm||

10

10

0.5

10

10

10
0

15

10

10

5
10
15
Number of Singular Value (p)

20

20

0.5
1.5

0.5

0.5

1.5

Resolution 14

10

10
12
Singular Value

Resolution 8

14

16

18

20

Resolution 2

20

20

20

15

15

15

10

10

10

0.7
real model
p = 14
p=8
p=2

0.6

0.5

Observed Intensity

10

15

20

10

15

20

10

15

20

0.4

0.3

Covariance 14

Covariance 8

Covariance 2

20

20

20

15

15

15

10

10

10

0.2

0.1

0.1
1.5

0.5

0
Outgoing angle

0.5

1.5

10

15

20

10

15

20

10

15

20

Figure 4.2: Some models using the Generalized inverse. Top-Left: The L-curve
for the residual norm and model norm. Various choices of are used, and the
colored dots are three choices made. Top-Right: True model (circles) and the
estimated models for the three choices on the left. Bottom-Left: The synthetic
data (circles) and the three predicted data. Bottom-Right: The resolution (top
panels) and covariance (bottom panels) matrices for the three choices. White
represents large amplitudes, black represents lower amplitudes.

4.4

Smoothing Norms or Higher-Order Tikhonov

Very often we seek solution that minimize the misfit, but also some measure of
roughness of the solution. In some cases when we try to minimize the minimum
norm solution
Zb
2
kf k = f (x)2 dx
a

we may get the unwanted consequence of putting the estimated model only
where you happen to have data. Instead, our geophysical intuition might suggest
that the solution should not be very rough, so we minimize instead

Zb

kf k =

f 0 (x)2 dx,

f (a) = 0

where we need to add a boundary condition (right hand side). The boundary
condition is needed, since the derivative norm is insensitive to constants, that
is the norm of kf + bk is equal to kf k. This means we really have a semi-norm.

4.4 Smoothing Norms or Higher-Order Tikhonov

4.4.1

43

The discrete Case

Assuming the model parameters are ordered in physical space (e.g., with depth,
or lateral distance), then we can define differential operators of the form

1 1
0
0 ...
0 1 1
0 ...

D1 = 0
0 1 1 . . .

.. ..
.
.
and the second derivative

D2 =

2
1
0

1
2
1

0
1
2
..
.

0
0
1
..
.

...
...
...
..
.

There are a few ways to implement this in the discrete case, namely
1. minimize a functional of the for
2

m = kd Gmk2 + 2 kDmk2
which leads to


1
= GT G + 2 DT D GT d
m

Note the similarity with our previous results, where instead of the matrix
DT D we had the identity matrix I.
2. Alternatively, we can try and solve the coupled system of equations

 

d
G
=
m+
0
D
and you can rewrite this in a simplified way
d0 = Hm + 
where we have now the standard expression for the inverse problem to be
solved. Due to the effect of the D matrix, the ill-posedness of the original
expression can be significantly reduced (depending on the chosen value
of ). The advantage of this approach is that one can impose additional
constraints, like non-negativity.
3. We can also transform the system of equations in a similar way by
d = Gm + 
d = GD1 Dm + 
d = G0 m 0 + 

44

Chapter 4. Tikhonov Regularazation


where
G0
m

= GD1
= Dm

As you can see, we have not changed the condition of fitting the data, so
that
2
2
kd = G0 m0 k = kd Gmk
but we have also added the model norm of the form
2

km0 k = kDmk

Note that for this to actually work, the matrix D needs to be invertible.
Sometimes, it is possible to do it analytically. We can also use the SVD
at this stage.
As a cautonary note, it is important to keep in mind that the Tikhonov
regularization will recover the true model depending on whether the assumptions
of the additional norm (be it kmk, or kDmk) is correct. We would not expect
to get the right answer in the previous examples, since the true model mtrue is
a delta function.

4.5

Fitting within tolerance

In real life, the data that we have acquired has some level of uncertainty. This
means there is some random error  which we do not know, but we think we
know its statistical distribution (e.g., normally distributed with zero mean and
variance 2 ). So, in this respect we should not try to fit the data exactly, but
rather fit it to within the error bars.
This method is sometimes called the discrepance principle, but I prefer to use
the term fitting within tolerance. In our inverse problem we want to minimize a
functional with two norms
min kDmk
min kd Gmk
and to do that we were looking at the L-curve, using the Damped least squares
or the SVD approach, that is choosing an or a value of p non-zero singular
vectors.
In fact, for data with uncertainties we should actually be looking at a system
that looks like
min kDmk
min kd Gmk

6 T

4.5 Fitting within tolerance

45

where the value of the tolerance T we arrive at by a subjective decision about


what we regard as acceptable odds of being wrong. We will almost always use
the 2-norm on the data space, and thus the chi-squared statistic will be our
guide.
In contrast to our previous case, we dont have equality anymore. Under
certain assumptions
Dm = 0
where for model norm D = I, does not satisfy
min kd Gmk 6 T
we can try to find the equality constrained equation
h
i
2
2
m = T 2 kd Gmk2 + 2 kmk2
From a simple point of view, for a fixed value of T , minimization of the two
constraints can be regarded as seeking a compromise between two undesirable
properties of the solution: the first term represents model complexity, which
we wish to keep small; the second measures model misfit, also a quantity to be
suppressed as far as possible. By making > 0 but small we pay attention to
the penalty function at the expense of data misfit, while making large works
in the other direction, and allows large penalty values to secure a good match
to observation.
From a more quantitative perspective, when the residual norm kd Gmk2
is just above the tolerance T , we are not fitting the data to the level needed.
But we also dont want to over-fit the data. As can be seen from the figure, if
we know the threshold value, the problem is simpler, because we just need to
figure out what the value of the Lagrange multiplier is, such that the residual
norm tolerance is satisfied.
Choosing a value of to the left of this threshold, will fit the data better
and will result in a model with a larger norm (or rougher) than what is required
by the data. Choosing a value of to the right instead, will have a poor fit to
the data, even within uncertainties.

4.5.1

Example 2

First, we need to figure out the value of T . In our example, we said that the
errors were
 N (0, 2 ),
= 1e6
Since we have M = 20 points, we need to find a solution whose residual norm is
v
u 20
p
uX
i2 = 20 1e12 = 4.47e6
T = kk2 = t
i

46

Chapter 4. Tikhonov Regularazation

Now that we have our value of the tolerance T , we can go back to our initial
problem
i
h
2
2
m = T 2 kd Gmk2 + 2 kmk2
and find the value of or the ideal value of the p that satisfies our new functional.
In this example I will use the same graphical interface as in the previous
example. Now, in addition to the L-curve obtained for the SVD and the Damped
Least Squares methods, we have our threshold value T represented by a vertical
dash line. We pick the value on the L-curve that is closest to T . In the SVD
approach, since we have discrete singular values, we would choose the one that
is closest, while for the DLS we could get in fact really close. In both cases, I
just show approximate values, using my discretization of the used for plotting
the figure.
6

10

1.2

Damped LS
SVD (*10)

real model
= 0.0001
p=9
= 3.1623e05
p = 10
= 1e05
p = 10

10

10

0.8

10

yaxis

||m||

0.6

10

0.4

10

0.2

10

10

10

10

10

10
||dGm||

10

0.2

10

10

10
x axis

12

14

16

18

20

10

10

0.7

si

real model
= 0.0001
p=9
= 3.1623e05
p = 10
= 1e05
p = 10

0.6

T
|ui d|
|uTd/s |
i
i

10

0.5

Observed Intensity

10

0.4
5

10
0.3

10

10
0.2

15

10
0.1

20

10
12
Outgoing x axis

14

16

18

20

10

10
12
Singular Value

14

16

18

20

Figure 4.3: Fitting within tolerance with the DLS and SVD approaches. Our
preferred model is the blue colored one. Top-Left: The L-curve for the residual
norm and model norm. The SVD has been shifted upwards for clarity. Value of
T is shown as a vertical dashed line. Various choices of around T are chosen.
Top-Right: True model (circles) and estimated models for the choices on the left.
Bottom-Left: The synthetic data (circles) and predicted data. Bottom-Right:
For the SVD, the singular values and Picard criteria are shown.

4.5 Fitting within tolerance

Resolution 0.0001
20

47

Resolution 3.1623e05
20

Resolution 1e05
20

15

15

15

10

10

10

5 10 15 20
Covariance 0.0001
20

5 10 15 20
Covariance 3.1623e05
20

5 10 15 20
Covariance 1e05
20

15

15

15

10

10

10

5 10 15 20
Resolution 9

5 10 15 20
Resolution 10

5 10 15 20
Resolution 10

20

20

20

15

15

15

10

10

10

5 10 15 20
Covariance 9

5 10 15 20
Covariance 10

5 10 15 20
Covariance 10

20

20

20

15

15

15

10

10

10

5 10 15 20

5 10 15 20

5 10 15 20

Figure 4.4: Resolution matrix and Covariance matrix for the DLS (top 2 panels)
and SVD (bottom 2 panels) approaches, while fitting within tolerance. Note
that since the SVD approach is discrete in nature, we might not get an ideal
selection, hence the repeated value of p.. Using the filter factors approach might
lead to better results. Our preferred value is the column in the middle.

48

Chapter 4. Tikhonov Regularazation

Anda mungkin juga menyukai