Anda di halaman 1dari 12

J.

Harring

EDMS 779

Nonlinear Least Squares

1. Estimation and Optimization


Differential calculus has widespread application in statistics. In our discussions, we will eventually need derivatives of complex functions (i.e., likelihood function or least squares criterion function). The complexity of these functions is that they are functions of multiple parameters (some fundamental and others functions of parameters), and estimating the values of these parameters based on measured/empirical data is our chief concern. The parameters describe an underlying physical setting in such a way that the value of the parameters affects the distribution of the measured data. An estimator, the analytic derivation of parameter values, attempts to approximate the unknown parameters using the measurements. Estimation then is the process of determining values for parameters of a physical model based on measured data. There are many estimators in practice today based on models which characterize specific criterion to be satisfied (e.g., leastsquares, maximum likelihood, Bayes, maximum a posteriori, method of moments, etc.). Optimization on the other hand works within a chosen estimation scheme to facilitate finding out what the particular values of the parameters are that satisfy the specific criterion. This optimization is typically carried out using numerical methods. There are many methods of optimization that can be employed within an estimation framework (e.g., Gauss-Newton, Newton-Raphson, EM, Gibbs sampler, Nelder-Mead Simplex, etc.). Our task is to integrate estimation and optimization; and to that end, we will need some analytic skills namely finding derivatives of vectors and matrices. 1.1 Least Squares Estimation Many optimization procedures require the use of derivatives of matrices and vectors to facilitate estimation of model parameters. While it is possible to differentiate matrices, vectors, or scalars with respect to other matrices, vectors, or scalars, the most common involves differentiation of a scalar quantity (log-likelihood or residual sum of squares) with respect to a vector (often times a parameter vector). Instead of laying out the definitions (and examples) for all possible combinations of differentiation mentioned above, we will talk about specific differentiation components as they arise from working through some optimization procedures.

2. Gauss-Newton Optimization for Least Squares Problems


The Gauss-Newton algorithm is a method used to solve nonlinear least squares problems because it performs well and is easy to implement. It can be seen as a modification of Newton's method for finding a minimum of a function. However, unlike Newton's method, the Gauss-Newton algorithm can only be used to minimize a sum of squared function values, but it has the distinct advantage that second derivatives, which can be challenging to compute, are not required. On the other hand, this method does suffer from occasional non-convergence, depending on among other things, if the Jacobian matrix is ill-conditioned. The nonlinear least-squares problem arises most commonly from data fitting applications, where one is attempting to fit the data ( xi , yi ) , i = 1, , m , with a model f (, x) that is nonlinear in . In this case, ri = yi f ( , xi ) , and the nonlinear least squares

problem consists of choosing x so that the fit is as close as possible in the sense that the sum of the

J. Harring

EDMS 779
m

Nonlinear Least Squares

squares of the residuals is minimized. That is, = ri 2 is minimized. The information needed is
i =1

the model and the partial derivatives of the model with respect to the parameters. The minimum value of occurs when the gradient is zero. Since the model f contains p parameters then are p gradient equations to solve simultaneously.
m r = 2 ri i j j i =1

j = 1, , p

(1)

ri are functions of both the independent variable and the j parameters, so these gradient equations do not have an analytic (or closed form) solution like they do in linear least squares problems as we will see shortly. This is one of the ways to tell if the solution to a problem (like repeated measures or factor analysis) has a closed form solution the design matrix will be a function of data only.

In a non-linear system, the derivatives

The method uses a Taylor series expansion of the object function about k to facilitate updating the parameters. p p f ( k , xi ) k k k f (, xi ) f ( , xi ) + ( j j ) = f ( , xi ) + J ij j j j =1 j =1
Here, the Jacobian matrix, J, is a function of constants, the independent variable and model parameters, so it will change from one iteration to the next. In terms of the linearization of the r model, i = J ij and the residuals are given by j
ri = ( yi f ( , xi )) J ij j
k j =1 p

Substituting these expression into the gradient equations in (1) we get


p m 2 J ij ( yi f ( k , xi )) J ij j = 0 i =1 j =1

Rearranging terms we obtain p simultaneous linear equations the normal equations

J
i =1 k =1

ij

J ik k = J ij ( yi f (, xi ))
i =1

or in matrix notation

J. Harring

EDMS 779

Nonlinear Least Squares

( JJ ) = J(y f (, x))
It is clear then that the updated parameter vector, , can be found by pre-multiplying both sides of the equation by (J J ) 1 .
= ( JJ ) J(y f (, x))
1

The iterative scheme below will update parameters until some convergence criterion is met. k +1 = k + To reiterate, the Gauss-Newton algorithm requires residuals and the Jacobian matrix (the first derivative of the model with respect to the parameters evaluated at each data point).

3. Newton-Raphson Algorithm for Nonlinear Least Squares


Beside the Gauss-Newton method, several other approaches have been applied to nonlinear least squares problems. These alternatives generally are no faster that Gauss-Newton in terms of time to solution, or they may be more difficult to implement, or, in some cases, they may actually be both slower and more trouble to set up. The appeal of these methods is that they are appropriate to a wider class of problems than is Gauss-Newton. For example, although Gauss-Newton is attractive for least squares, it does not translate easily to other kinds of maximum likelihood problems. For this reason its valuable to introduce other optimization strategies. One approach that works well with a wide class of maximum likelihood models is the NewtonRaphson method. With nonlinear least squares problems, Newton-Raphson generally gets to the solution of a problem in the same amount of time as Gauss-Newton. It is more difficult to implement than Gauss-Newton and the amount of calculation per iteration is greater. However, many people actually prefer Newton-Raphson because it is more widely useful in statistical estimation than Gauss-Newton. No single approach always works satisfactorily; but if you want a general purpose algorithm, Newton-Raphson may be the best of a bad lot. Here is the way it is applied in nonlinear regression. Let fi (, xi ) , with parameters = 1 , , p , and independent variable xi denote a model for

response yi . The least squares criterion from above is

= ( yi f (, xi )) 2
i =1

The partial derivative of the criterion function (the least squares criterion in this case) with respect to the parameters is

J. Harring

EDMS 779

Nonlinear Least Squares

gj =

m f (, xi ) = 2 ( yi f (, xi )) j j i =1

(2)

Note that the partial derivative of the least squares criterion utilizes the partial derivatives of the response function, just like in the Gauss-Newton algorithm. Specifically, the partial derivative f (, xi ) is of the model with respect to the jth parameter, evaluated at the ith observation. The j gradient vector of is the collection of partial derivatives of the criterion function

g1 g= = gp
The symmetric matrix of second partial derivatives of the criterion function with respect to the parameters comprises the Hessian matrix of the function
h11 h21 2 H= = h p1 hpp

(3)

h22 hp 2

(4)

where the jk-th element of H is


h jk = = = 2 j k j k j m f ( , xi ) 2 ( yi f ( , xi ) ) k i =1

m f ( , xi ) m f ( , xi ) = 2 2 ( yi f ( , xi ) ) yi f ( , xi ) ) ( i =1 j i =1 k j k m f ( , xi ) f ( , xi ) 2 f ( , xi ) = 2 2 ( yi f ( , xi ) ) j k j k i =1 i =1 m

(5)

In words, the second partial derivatives of the criterion function depends on residuals, ri = yi f (, xi ) , plus first- and second-derivatives of the response function with respect to the

J. Harring

EDMS 779

Nonlinear Least Squares

parameters. Let ( n ) denote the values of the model parameters at the nth iteration. The NewtonRaphson update is
( n +1) = ( n ) ( H ( n ) ) g ( n )
1

To take advantage of the linear algebra functions of many computer languages, it is convenient to express g and H as simple vector products. Let the m x 1 vector of residuals be

r = y f (, x)
Define the m x p Jacobian matrix of the model with respect to the parameters as

f (, x) f (, x) F= , , 1 p The number of non-duplicated elements of a Hessian matrix of order p is p = p( p + 1) / 2 . Define the m x p matrix F as the matrix whose columns are second partial derivatives of the mode with respect to the jk-th element of . For example, if p = 3 then the p = 6 columns of F are
2f (, x) 2f (, x) 2f (, x) 2f (, x) 2f (, x) 2f (, x) F= , , , , , 2 2 1 2 2 31 3 2 23 1

Then (3) can be written as a product g = 2Fr and elements of H are


h jk = hkj = 2 FF jk 2 rF L

where the index function is

L( j , k ) = k + j ( j 1) / 2,

jk

The notation [ A ] L denotes taking the Lth column of A. For example, in computing h32 the appropriate column of rF is L = 2 + 3(2)/2 = 5. Note that the orders of FF and rF are p x p and 1 x p, respectively.

J. Harring

EDMS 779

Nonlinear Least Squares

4. A Growth Example
The table below has weight in grams of one mouse taken weekly from birth to 9 weeks weight: 13 21 28 36 36 39 39 41 42 weeks: 0 1 2 3 4 5 6 7 8 For this experiment, a version of the exponential growth model is considered
f = F + ( 0 F ) exp( x)

This form is attractive because its parameters have a nice interpretation

0 F

weight at birth weight at maturity parameter governing rate of growth

To implement the Newton-Raphson method for nonlinear regression, gobs of derivatives are used. For the gradient vector and the Hessian matrix of , derivatives of f (, xi ) are needed. Specifically, there are p = 3 first-order f = exp( x) 0
and p = 6 second-order

f = 1 exp( x) 1

f = x( 0 F ) exp( x)

2 f 2 f 2 f = = =0 2 0 2 F F 0 2 f = x exp( x) 0 4.1 Implementation in R


#Newton-Raphson Method for Estimating Parameters #in a Nonlinear Least Squares Application x = seq(9)-1 y = c(13, 21, 28, 36, 36, 39, 39, 41, 42) x; y #look at the data plot(x,y,xlim=c(0,9),ylim=c(0,50),xlab="Weeks", ylab="Weight (in grams)", pch=20, main="Weekly Weight of One Mouse", bty="l")

2 f = x exp( x) 1

2 f = x 2 ( 0 F ) exp( x) 2

J. Harring
th=c(10,47,.5)

EDMS 779
#Starting values for parameter vector

Nonlinear Least Squares

#Exponential model fun=function(th,x){ y = th[2]+((th[1]-th[2])*exp(-th[3]*x)) return(y) } fun(th,x) #try it out -- should give you fitted values #Calculate the residual sum of squares resid=function(th,x,y){ yhat=fun(th,x) rss=t(y-yhat)%*%(y-yhat) return(rss) } resid(th,x,y) #try it out -- should give a scalar for output

#Calulate the Jacobian matrix firder=function(th,x){ j1=exp(-th[3]*x) j2=1-j1 j3=-x*(th[1]-th[2])*exp(-th[3]*x) j=cbind(j1,j2,j3) return(j) } firder(th,x) #Calculate the Second Derivatives secder=function(th,x){ h11=rep(0,times=length(x)) h21=h11 h22=h11 h31=-x*exp(-th[3]*x) h32=-h31 h33=(x**2)*((th[1]-th[2])*exp(-th[3]*x)) hmat=cbind(h11,h21,h22,h31,h32,h33) return(hmat) } secder(th,x) #Gradient evaluated at "th" grad=function(th,x,y){ fd=firder(th,x) hm=secder(th,x) yhat=fun(th,x) r=y-yhat

J. Harring
g=-2*t(fd)%*%r return(g) } grad(th,x,y) #Hessian matrix evaluated at "th" hess=function(th,x,y){ np=length(th) H=matrix(nrow=np, ncol=np) fd=firder(th,x) hm=secder(th,x) yhat=fun(th,x) r=y-yhat #Gradient vector

EDMS 779

Nonlinear Least Squares

j=1; m=0 for(j in 1:np){ k=1 while(k <= j){ m = m + 1 t1=2*(t(fd[,j])%*%fd[,k]) t2=2*(t(r)%*%hm[,m]) H[j,k]=t1-t2 H[k,j]=H[j,k] k = k + 1 } j = j + 1 } return(H) } hess(th,x,y)

################################## # Main Newton-Raphson Loop ################################## final=function(maxiter=200,tol=1E-5){ cat("Iteration"," ","RSS"," ","Max_Gradient"," .................","\n") cat(" ",0," ",resid(th,x,y)," "," ",th[2]," ",th[3],"\n") it=1 while (it<=maxiter){ g=grad(th,x,y) H=hess(th,x,y) newth=th - (solve(H)%*%g) res=resid(newth,x,y) maxg=max(abs(g))

","Parameters ",th[1],"

J. Harring
cat(" ",it," ",res," ",newth[3],"\n") if(maxg<tol) break th=newth it=it+1 }

EDMS 779
",maxg," ",newth[1],"

Nonlinear Least Squares


",newth[2],"

plot(x,y,xlim=c(0,9),ylim=c(0,50),xlab="Weeks", ylab="Weight (in grams)", pch=20, main="Weekly Weight of One Mouse", bty="l") a=seq(0,9,.01) lines(a,fun(newth,a),lwd=3) } final()

R Output:
final() Iteration 0 1 2 3 4 5 6 7 RSS 199.2118 16.0327 13.4939 10.9301 10.8385 10.8380 10.8380 10.8380 Max_Gradient 1239.178 116.7324 229.2958 18.30822 2.761367 0.004395168 1.466024e-07 Parameters ................. 10 47 0.5 11.21155 41.58776 0.4780019 12.51038 43.28126 0.3548456 12.50574 43.43303 0.3723246 12.40910 43.15041 0.3837244 12.40831 43.15154 0.3840416 12.40829 43.15148 0.3840443 12.40829 43.15148 0.3840443

Weekly Weight of One Mouse


50 Weight (in grams) 0 0 10 20 30 40

4 Weeks

J. Harring

EDMS 779

Nonlinear Least Squares

5. Least Squares Analytically


In subsection 3.3 and sub-subsection 3.3.1 of the last set of notes, we derived an analytic solution to the simple linear regression problem using least squares estimation. In that case it was fairly easy to write down the gradient, a (2 x 1) vector of first partial derivatives of the criterion function and even the Hessian, a (2 x 2) matrix of second partial derivatives of the least squares function with respect to the parameters. For linear regression models with more than two variables, the idea is exactly the same, except that the numerical work becomes a little more tedious. In this case, we can circumvent the actual numerical work by employing matrix algebra and some differential calculus. If y is a function of 3 or more variables, the same procedure applies as before y will have slopes (partial derivatives) parallel to (in the direction of) each constituent variable (b0 , b1 , b2 ,) . Set all derivatives equal to zero and then solve for the unknowns. Lets start by looking at multiple linear regression from a matrix point of view. For a linear regression model, Let y and e be m x 1 vectors whose elements are given by the yi ' s and the ei ' s , for example, y1 e1 y e 2 y= e= 2 ym em Also define b to be the parameter vector of length (p + 1) x 1, including the intercept b0
b0 b 1 b= b p

Next, define X to be an m x (p + 1) matrix given by,

1 x 11 x 12 x 1 p X 1 x 21 x 22 x 2 p 1 x n 1 x n 2 x np
The matrix X gives all the observed values of the predictors, appended to a column of 1's as the leftmost column. The ith row of X corresponds to the values for the ith case in the data; the columns of X correspond to the different predictors. Using these quantities, multiple regression equation can be written in matrix terms as,

10

J. Harring

EDMS 779

Nonlinear Least Squares

y = Xb + e
Find the vector b that minimizes the residual sum of squared deviations. As we have seen plenty of times now, this means differentiate the least squares criterion function, set the derivative vector (gradient) equal to zero, and solve for the vector b.

(b) = ee
)(y y ) = (y y = (y Xb)(y Xb) = (y bX)(y Xb) = yy y Xb bXy + bXXb = yy 2bXy + bXXb Why can the middle two terms in line 5 of (6) be combined? Find the vector b such that

(6)

=0 b
Using some of the rules of differentiation learned thus far, we can write the gradient as

(yy ) (2bXy ) (bXXb) = + + b b b b = 0 2 Xy + 2 XXb b


We set the gradient equal to zero and solve. At this point, there is nothing in the rule books that says we could not find the least squares solution for vector b by using Gauss-Newton or the NewtonRaphson optimization process. The difference is that we dont have to here, because the system of equations making up the gradient is linear in b, which is not the case in nonlinear least squares. Because this is a linear system, a closed form solution is obtainable.
0 = 2 Xy + 2 XXb 2 XXb = 2Xy XXb = Xy ( XX) XXb = ( XX) 1 Xy = ( XX) 1 Xy b
1

The fitted values from the regression model is simply = Xb y = X( XX) 1 Xy

11

J. Harring

EDMS 779

Nonlinear Least Squares

The variance/covariance matrix of the regression coefficients can be found by finding


) = ( XX) 1 X var(y ) X( XX) 1 var(b = ( XX) 1 XX var(y )( XX) 1 2 ( XX) 1 =

minimizes the residual sum of squares? Right, we need the How do we know parameter vector b Hessian matrix to be positive definite (eigenvalues > 0).

H=

2 = 2XX b 2

As long as X is of full rank, then XX will have positive eigenvalues, parameter vector b minimizes the residual sum of squares criterion function.

12

Anda mungkin juga menyukai