Harring
EDMS 779
problem consists of choosing x so that the fit is as close as possible in the sense that the sum of the
J. Harring
EDMS 779
m
squares of the residuals is minimized. That is, = ri 2 is minimized. The information needed is
i =1
the model and the partial derivatives of the model with respect to the parameters. The minimum value of occurs when the gradient is zero. Since the model f contains p parameters then are p gradient equations to solve simultaneously.
m r = 2 ri i j j i =1
j = 1, , p
(1)
ri are functions of both the independent variable and the j parameters, so these gradient equations do not have an analytic (or closed form) solution like they do in linear least squares problems as we will see shortly. This is one of the ways to tell if the solution to a problem (like repeated measures or factor analysis) has a closed form solution the design matrix will be a function of data only.
The method uses a Taylor series expansion of the object function about k to facilitate updating the parameters. p p f ( k , xi ) k k k f (, xi ) f ( , xi ) + ( j j ) = f ( , xi ) + J ij j j j =1 j =1
Here, the Jacobian matrix, J, is a function of constants, the independent variable and model parameters, so it will change from one iteration to the next. In terms of the linearization of the r model, i = J ij and the residuals are given by j
ri = ( yi f ( , xi )) J ij j
k j =1 p
J
i =1 k =1
ij
J ik k = J ij ( yi f (, xi ))
i =1
or in matrix notation
J. Harring
EDMS 779
( JJ ) = J(y f (, x))
It is clear then that the updated parameter vector, , can be found by pre-multiplying both sides of the equation by (J J ) 1 .
= ( JJ ) J(y f (, x))
1
The iterative scheme below will update parameters until some convergence criterion is met. k +1 = k + To reiterate, the Gauss-Newton algorithm requires residuals and the Jacobian matrix (the first derivative of the model with respect to the parameters evaluated at each data point).
= ( yi f (, xi )) 2
i =1
The partial derivative of the criterion function (the least squares criterion in this case) with respect to the parameters is
J. Harring
EDMS 779
gj =
m f (, xi ) = 2 ( yi f (, xi )) j j i =1
(2)
Note that the partial derivative of the least squares criterion utilizes the partial derivatives of the response function, just like in the Gauss-Newton algorithm. Specifically, the partial derivative f (, xi ) is of the model with respect to the jth parameter, evaluated at the ith observation. The j gradient vector of is the collection of partial derivatives of the criterion function
g1 g= = gp
The symmetric matrix of second partial derivatives of the criterion function with respect to the parameters comprises the Hessian matrix of the function
h11 h21 2 H= = h p1 hpp
(3)
h22 hp 2
(4)
m f ( , xi ) m f ( , xi ) = 2 2 ( yi f ( , xi ) ) yi f ( , xi ) ) ( i =1 j i =1 k j k m f ( , xi ) f ( , xi ) 2 f ( , xi ) = 2 2 ( yi f ( , xi ) ) j k j k i =1 i =1 m
(5)
In words, the second partial derivatives of the criterion function depends on residuals, ri = yi f (, xi ) , plus first- and second-derivatives of the response function with respect to the
J. Harring
EDMS 779
parameters. Let ( n ) denote the values of the model parameters at the nth iteration. The NewtonRaphson update is
( n +1) = ( n ) ( H ( n ) ) g ( n )
1
To take advantage of the linear algebra functions of many computer languages, it is convenient to express g and H as simple vector products. Let the m x 1 vector of residuals be
r = y f (, x)
Define the m x p Jacobian matrix of the model with respect to the parameters as
f (, x) f (, x) F= , , 1 p The number of non-duplicated elements of a Hessian matrix of order p is p = p( p + 1) / 2 . Define the m x p matrix F as the matrix whose columns are second partial derivatives of the mode with respect to the jk-th element of . For example, if p = 3 then the p = 6 columns of F are
2f (, x) 2f (, x) 2f (, x) 2f (, x) 2f (, x) 2f (, x) F= , , , , , 2 2 1 2 2 31 3 2 23 1
L( j , k ) = k + j ( j 1) / 2,
jk
The notation [ A ] L denotes taking the Lth column of A. For example, in computing h32 the appropriate column of rF is L = 2 + 3(2)/2 = 5. Note that the orders of FF and rF are p x p and 1 x p, respectively.
J. Harring
EDMS 779
4. A Growth Example
The table below has weight in grams of one mouse taken weekly from birth to 9 weeks weight: 13 21 28 36 36 39 39 41 42 weeks: 0 1 2 3 4 5 6 7 8 For this experiment, a version of the exponential growth model is considered
f = F + ( 0 F ) exp( x)
0 F
To implement the Newton-Raphson method for nonlinear regression, gobs of derivatives are used. For the gradient vector and the Hessian matrix of , derivatives of f (, xi ) are needed. Specifically, there are p = 3 first-order f = exp( x) 0
and p = 6 second-order
f = 1 exp( x) 1
f = x( 0 F ) exp( x)
2 f = x exp( x) 1
2 f = x 2 ( 0 F ) exp( x) 2
J. Harring
th=c(10,47,.5)
EDMS 779
#Starting values for parameter vector
#Exponential model fun=function(th,x){ y = th[2]+((th[1]-th[2])*exp(-th[3]*x)) return(y) } fun(th,x) #try it out -- should give you fitted values #Calculate the residual sum of squares resid=function(th,x,y){ yhat=fun(th,x) rss=t(y-yhat)%*%(y-yhat) return(rss) } resid(th,x,y) #try it out -- should give a scalar for output
#Calulate the Jacobian matrix firder=function(th,x){ j1=exp(-th[3]*x) j2=1-j1 j3=-x*(th[1]-th[2])*exp(-th[3]*x) j=cbind(j1,j2,j3) return(j) } firder(th,x) #Calculate the Second Derivatives secder=function(th,x){ h11=rep(0,times=length(x)) h21=h11 h22=h11 h31=-x*exp(-th[3]*x) h32=-h31 h33=(x**2)*((th[1]-th[2])*exp(-th[3]*x)) hmat=cbind(h11,h21,h22,h31,h32,h33) return(hmat) } secder(th,x) #Gradient evaluated at "th" grad=function(th,x,y){ fd=firder(th,x) hm=secder(th,x) yhat=fun(th,x) r=y-yhat
J. Harring
g=-2*t(fd)%*%r return(g) } grad(th,x,y) #Hessian matrix evaluated at "th" hess=function(th,x,y){ np=length(th) H=matrix(nrow=np, ncol=np) fd=firder(th,x) hm=secder(th,x) yhat=fun(th,x) r=y-yhat #Gradient vector
EDMS 779
j=1; m=0 for(j in 1:np){ k=1 while(k <= j){ m = m + 1 t1=2*(t(fd[,j])%*%fd[,k]) t2=2*(t(r)%*%hm[,m]) H[j,k]=t1-t2 H[k,j]=H[j,k] k = k + 1 } j = j + 1 } return(H) } hess(th,x,y)
################################## # Main Newton-Raphson Loop ################################## final=function(maxiter=200,tol=1E-5){ cat("Iteration"," ","RSS"," ","Max_Gradient"," .................","\n") cat(" ",0," ",resid(th,x,y)," "," ",th[2]," ",th[3],"\n") it=1 while (it<=maxiter){ g=grad(th,x,y) H=hess(th,x,y) newth=th - (solve(H)%*%g) res=resid(newth,x,y) maxg=max(abs(g))
","Parameters ",th[1],"
J. Harring
cat(" ",it," ",res," ",newth[3],"\n") if(maxg<tol) break th=newth it=it+1 }
EDMS 779
",maxg," ",newth[1],"
plot(x,y,xlim=c(0,9),ylim=c(0,50),xlab="Weeks", ylab="Weight (in grams)", pch=20, main="Weekly Weight of One Mouse", bty="l") a=seq(0,9,.01) lines(a,fun(newth,a),lwd=3) } final()
R Output:
final() Iteration 0 1 2 3 4 5 6 7 RSS 199.2118 16.0327 13.4939 10.9301 10.8385 10.8380 10.8380 10.8380 Max_Gradient 1239.178 116.7324 229.2958 18.30822 2.761367 0.004395168 1.466024e-07 Parameters ................. 10 47 0.5 11.21155 41.58776 0.4780019 12.51038 43.28126 0.3548456 12.50574 43.43303 0.3723246 12.40910 43.15041 0.3837244 12.40831 43.15154 0.3840416 12.40829 43.15148 0.3840443 12.40829 43.15148 0.3840443
4 Weeks
J. Harring
EDMS 779
1 x 11 x 12 x 1 p X 1 x 21 x 22 x 2 p 1 x n 1 x n 2 x np
The matrix X gives all the observed values of the predictors, appended to a column of 1's as the leftmost column. The ith row of X corresponds to the values for the ith case in the data; the columns of X correspond to the different predictors. Using these quantities, multiple regression equation can be written in matrix terms as,
10
J. Harring
EDMS 779
y = Xb + e
Find the vector b that minimizes the residual sum of squared deviations. As we have seen plenty of times now, this means differentiate the least squares criterion function, set the derivative vector (gradient) equal to zero, and solve for the vector b.
(b) = ee
)(y y ) = (y y = (y Xb)(y Xb) = (y bX)(y Xb) = yy y Xb bXy + bXXb = yy 2bXy + bXXb Why can the middle two terms in line 5 of (6) be combined? Find the vector b such that
(6)
=0 b
Using some of the rules of differentiation learned thus far, we can write the gradient as
11
J. Harring
EDMS 779
minimizes the residual sum of squares? Right, we need the How do we know parameter vector b Hessian matrix to be positive definite (eigenvalues > 0).
H=
2 = 2XX b 2
As long as X is of full rank, then XX will have positive eigenvalues, parameter vector b minimizes the residual sum of squares criterion function.
12