Anda di halaman 1dari 4

Introduction

Linear regression is a useful technique for representing observed data by a


mathematical equation. For linear regression, the independent variable (data) is
assumed to be a linear function of various independent variables. Normally, the
regression problem is formulated as a least squares minimization problem. For the
case of linear least squares, the resulting analysis requires the solution of a set of
simultaneous equations that can be easily solved using Gaussian Elimination. The
applications of linear least squares and Gaussian elimination are well known
techniques.

Unfortunately, there are many times when one knows that the dependent variable is
not a linear function, but that a transformed variable might be. In addition, in many
cases, it is not only necessary to compute the best formula to represent the data,
but to also estimate the accuracy of the parameters. This article presents a C#
implementation of a weighted linear regression, using an efficient symmetric matrix
inversion algorithm to overcome the problem of nonlinearity of the dependent
variable and to compute the complete variance-covariance matrix to allow estimation
of confidence intervals in the estimated regression coefficients.

The files included with this article contain the source code for the linear regression
class, as well as an example program.

Weighted Linear Regression

The standard linear regression problem can be stated mathematically as follows,


where yj represents the jthmeasured or observed dependent variable value,
xi,j represents the jth measured independent variable value for the ith variable, and
Ci is the regression coefficient to be determined. M represents the number of data
points, and N represents the number of linear terms in the regression equation.

It is important to note that this statement assumes that all errors have the same
significance, but this is often not true. If there is a drift in the precision of the
measurements, for example, some errors may be more or less important than
others. Another important case arises if we need to transform the dependent
variable, y, in order to get a linear representation. For example, in many practical
cases, the logarithm of y (or some other function) may be much better at
representing the data. In the case of a logarithmic relationship, we can fit Log(y) =
C1x1+ C2x2 + ... to represent the relationship y = e(C1x1 + C2x2 + ...). In the case where a
transformation is needed, however, the errors in the transformed variable are no
longer necessarily all of the same significance. As a simple example using the Log
transformation, note that Log(1) +/- 0.1 represents a much different range than
Log(1000) +/- 0.1. In such cases, it is also possible to approximately represent the
variation in error significance by using a weighted regression, as shown in the
following modified equation.

In this formulation, the squared difference between each observed and predicted
value is multiplied by a weighting factor, wj to account for the variation in
significance of the errors. If the difference is due to variations in measurement
precision, the weight factors will need to be determined based on the precision drift.
If the differences in significance are due to a variable transformation, however, we
can often estimate them based on the functional form of the transformation.

In the case of a variable transformation, we can approximate the error in terms of


the derivative of the function. Assuming that we are measuring y and transforming
to f(y), the following relationship represents a first order approximation:

Essentially, the weight factor is used as a first order correction, and if the regression
yields small residual errors, the approximation is very close. As can be seen, for the
case where a Log(y) transformation is used, the weights for each data point would
be (dLog(y)/dy)-2 = y2.

It might seem more reasonable to define the weights as multiplying the difference in
the computed and measured values, rather than the difference squared. The reason
that they are defined as multiplying the difference squared is due to the relationship
between the statistical variance and the least squares terms. In the case of a
changing measurement precision, it makes sense to adjust the variance, which is
related to the square of the difference, rather than the difference itself.

For more information on regression analysis, including weighted regressions, please


refer to the book by Draper and Smith (1966) listed in the references. This book
should be considered one of the classical texts on practical regression analysis.

Solving the Weighted Regression

Solving the linear regression equation is straightforward. Since the terms are linear
and the objective is to compute the minimum with respect to all of the coefficients,
the standard derivation is to take the derivative of the least squares sum with
respect to each coefficient, Ci, and require that the derivatives are all exactly zero.
This yields a set of simultaneous equations with the coefficients, Ci, as unknowns
which can be solved using standard linear algebra. In the weighted least squares
case, the equations are the same as the standard, unweighted case, except the
weights are included in each of the sums. For reference, the equations are:

Most simple least squares algorithms use Gaussian Elimination to solve the
simultaneous equations, since it is fast and easy to program. In fact, if all you need
is the best set of coefficients, it's probably best to use Gaussian elimination. If,
however, you want to do some additional analyses, then Gaussian Elimination may
not be the best option.

An alternate method for solving the equations is to represent the simultaneous


equations as a matrix equation:
Solving the matrix equation can be accomplished by inverting the X matrix, then
multiplying by the B vector to determine the values of C. The reason that this is an
option worth considering is twofold:

1. the inverted X matrix is directly proportional to the variance-covariance


matrix that contains almost all of the information about the accuracy of the
coefficient estimates, and
2. X happens to be a symmetric matrix that can be efficiently inverted rapidly.

The heart of the algorithm for the weighted linear regression is actually the method
that inverts the symmetric matrix. The source code for this method is shown below.
Note that I can't claim this algorithm. I actually was given this algorithm while in
graduate school, back in about 1972, by an office mate by the name of Eric Griggs. I
don't know where Eric got it, but it has been floating around in various forms for a
long, long time. The original code that I obtained was written in Fortran 1, and has
since been translated into many other languages. Since it was published in at least
one report done for the State of Texas, it is public domain. If anyone knows who
originally came up with the algorithm, I'd be pleased to give them due credit.

The method bool SymmetricMatrixInvert(double[,] V) takes a square,


symmetric matrix, V, as its argument, and inverts the matrix in place. In other
words, when the method is called, V contains a symmetric matrix. If the inversion
fails due to the matrix being singular, the method returns false; if the inversion
succeeds, it returns true. After the call, V contains the inverted matrix.

Anda mungkin juga menyukai