Anda di halaman 1dari 12

Bus.

500
Todd Easton

Curvilinear Regression Models & Understanding Natural Logarithms


Curvilinear Regression Models
When one estimates a linear regression model like the following:
Y = b 0 b 1 X,
the presumption is that there is a linear relationship between X and Y.

If the relationship is instead curvilinear, then one can transform the X variable to take into account
the non-linearity of the relationship. However, there are many possible, non-linear ways that Y could
depend on X. How does one select the right transformation? One possibility is theory; in some
cases your knowledge of the nature of the relationship between X and Y will suggest that you should
square X or take its logarithm. Another possibility is inspection. If you look at a scattergram of X
and Y, the shape of the scatter might suggest a particular transformation.

Figure 1 illustrates how this might happen with a Y variable that depends on an X variable,
transformed in a logarithmic way (plus a normal disturbance term). [If you are interested in how
Figure 1 was created, see Appendix 1.] If you saw that scattergram, and especially if you fit a linear
function to it, you might decide that the appropriate relationship to estimate was between a
transformed Y and the X variable. Clearly, a linear function does not describe well the
relationship captured by the scattergram.

Figure 1, Y as a Function of X
40,000

35,000

30,000

25,000
Y values

20,000

15,000

10,000

5,000

0
0 2 4 6 8 10 12 14
X values
Natural logarithms
The next question would be, which transformation would best capture this curvilinear relationship?
One transformation frequently used by statisticians is the natural log transformation. It describes
well relationships where Y increases (or decreases) faster and faster as X gets bigger. But what is
a natural logarithm? The first thing to say is that a natural logarithm is a cousin of the base 10
logarithm.

The base 10 logarithm of a number is the power to which 10 would need to be raised to equal that
number. For example, the logarithm to the base 10 of 100 is 2, because:
102 = 100.

The numerical sentence that says, "the logarithm to the base 10 of 100 is 2," is:
log10 100 = 2.

A natural logarithm is like the base 10 logarithm, except the number being raised to a power is e
instead of 10. The number e is a very special number (like pi). It's equal to approximately 2.7183.
So, for example, the logarithm to the base e of 100 is (roughly) 4.61, because:

e4.61 = 100

The numerical sentence that says, "the logarithm to the base e of 100 is 4.61," is:
loge 100 = 4.61. We can also write:
ln 100 = 4.61, because "ln" is an abbreviation for "natural logarithm."

At this point, you might well ask why statisticians prefer natural logarithms to base 10 logarithms
when they transform independent variables. Imagine that you first transform 100 with a base 10
logarithm, and then with a natural logarithm. Most of us would find it easier to think about the
meaning of 2 in that context than the meaning of 4.61.
X log X X ln X
100 2 100 4.61

There is more than one anwer to that question, but one answer goes back to the idea, mentioned
on the previous worksheet, that the transformation you choose might be dictated by theory, by
your understanding of the phenomenon you are trying to model. There are a whole class of
phenomena connected to growth over time, for which the number e and natural logarithms have
an organic connection. One phenomenon is compound interest. It turns out that, if you compound
interest continuously at a rate of 100%, you will have e dollars (2.7183 dollars) at the end of a year!

A second answer is connected to the first. It turns out that changes in natural logarithms are
very close to percentage changes, so numbers that have been transformed with a natural log
function are easy to interpret in that context. The next worksheet provides an example of this.
Natural logarithms and proportional change
To see the connection between changes in natural logarithms and proportional change,
think about a variable X that is doubling yearly. Look at the relationship between the
change in X's logarithm from year to year and the proportional change in X from year to
year. Notice that the proportional change in question isn't the conventional one, but is
calculated using the mid-point formula, where the base is the average of the first value
and the second value being compared.
Figure 2 Proportional Change in X, Midpoint
Formula
Absolute Absolute proportional
change change change
X ln X in ln X in X Base in X
1.0 0.000
2.0 0.693 0.693 1.0 1.5 0.667
4.0 1.386 0.693 2.0 3.0 0.667
6.0 1.792 0.405 2.0 5.0 0.400

Examining the two blue columns, you can see that the absolute change in the natural log of X (ln X)
is very close to the proportional change in X, measured using the mid-point formula. This provides
a quick way to find approximate proportional changes in a variable that has been transformed
using natural logarithms (or, if you multiply by 100, approximate percentage changes).

Natural logarithmic transformation and regression


There's a neat application of this connection between changes in the logs and proportional change
if we transform a dependent variable and then try to predict it with another, untransformed,
variable. Suppose, for example, that we are trying to predict growth in sales for a new
product which has caught on with the public. In these kinds of settings, where one purchase
leads to multiple subsequent purchases (as positive word of mouth leads to new purchases), a
logarithmic model might be appropriate.

As an example, lets use the data from Figure 1. We'll make the X variable the number of years
since the product has been introduced and the Y variable the number of units sold in that year.
We know already, from Figure 1, that a line will not fit this data very well. [You can see a line
fit to the data if you look at Appendix 1.] However, transforming Sales with a natural log
transformation might help matters. Figure 3
X Y
You can see the transformed data in Years Sales ln Sales
the third column to the right. In the 1 4640 8.44
next worksheet you can see 2 3468 8.15
the regression on the transformed 3 5297 8.57
data. 4 6543 8.79
5 6468 8.77
6 7160 8.88
7 2599 7.86
8 5315 8.58
9 7935 8.98
10 8101 9.00
11 16147 9.69
12 35522 10.48
Regression with a logarithmically transformed dependent variable
If we transform the Y variable using natural logarithms and leave the X
variable untransformed, in linear form, the resulting expression is often called log-linear.

Years ln Sales
1 8.44
2 8.15
3 8.57
4 8.79
5 8.77
6 8.88
7 7.86
8 8.58
9 8.98
10 9.00
11 9.69
12 10.48
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.71
R Square 0.50
Adjusted R Square 0.45
Standard Error 0.51
Observations 12

ANOVA
df SS MS F Significance F
Regression 1 2.60 2.60 10.07 0.01
Residual 10 2.58 0.26
Total 11 5.18

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 7.97 0.31 25.50 0.00 7.28 8.67
Years 0.13 0.04 3.17 0.01 0.04 0.23

If this regression were in a linear format, the .13 coefficient on years would indicate that, with each
additional year after the release of the product, sales increase .13 units.

However, since the dependent variable is logarithmically transformed, the interpretation is


different. In the context of a transformed dependent variable, the .13 tells us that, on average,
sales increase .13 log units with each additional year. And that, as Figure 2 indicated, is
very close to saying that sales grew 13% with each additional year.

Years ln Sales ln Sales_Hat


The table to the right and Figure 5, below, 1 8.44 8.11
illustrate the relationship between actual and 2 8.15 8.24
predicted sales, logarithmically transformed. 3 8.57 8.38
One can see that linear regression provides a 4 8.79 8.51
better fit after the dependent varaible has been 5 8.77 8.65
transformed. 6 8.88 8.78
7 7.86 8.92
One can also look at the relationship in the 8 8.58 9.05
years, sales space--rather than the 9 8.98 9.19
years, logged sales space. See that illustration 10 9.00 9.32
below. 11 9.69 9.46
12 10.48 9.59
Figure 5, Pred & Actual ln Sales
12.00

10.00

8.00
ln Sales

6.00

4.00

2.00

0.00
0 2 4 6 8 10 12 14
Years

Years ln Sales ln Sales_Hat Sales


1 8.44 8.11 4640.27
2 8.15 8.24 3468.26
3 8.57 8.38 5297.13
4 8.79 8.51 6542.69
5 8.77 8.65 6467.70
6 8.88 8.78 7160.45
7 7.86 8.92 2599.02
8 8.58 9.05 5315.17
9 8.98 9.19 7934.64
10 9.00 9.32 8101.25
11 9.69 9.46 16146.58
12 10.48 9.59 35522.44

Years Sales Sales_Hat


1 4640.27 3320.07
2 3468.26 3799.39
3 5297.13 4347.90
4 6542.69 4975.60
5 6467.70 5693.93
6 7160.45 6515.95
7 2599.02 7456.65
8 5315.17 8533.16
9 7934.64 9765.09
10 8101.25 11174.86
11 16146.58 12788.16
12 35522.44 14634.38
Figure 6, Actual & Predicted Sales
40,000
35,000
30,000
25,000
Sales

20,000
15,000
10,000
5,000
0
0 2 4 6 8 10 12 14
Years
Sales_Hat
3320.07
3799.39
4347.90
4975.60
5693.93
6515.95
7456.65
8533.16
9765.09
11174.86
12788.16
14634.38
d Sales

10 12 14
Appendix
Creating a Y that's a logarithmic X & Y restated for
(but probabilistic) function of X scattergram
X Disturbance Y X Y
1 -360.28 4640.27 1 4640.27
2 -1533.22 3468.26 2 3468.26
3 293.11 5297.13 3 5297.13
4 1531.77 6542.69 4 6542.69
5 1438.02 6467.70 5 6467.70
6 2079.76 7160.45 6 7160.45
7 -2620.31 2599.02 7 2599.02
8 -281.02 5315.17 8 5315.17
9 1314.03 7934.64 9 7934.64
10 -1304.04 8101.25 10 8101.25
11 -828.24 16146.58 11 16146.58
12 -2028.52 35522.44 12 35522.44

Y as a Function of X
40,000

35,000

30,000

25,000
Y values

20,000 f(x) = 1651.67x - 1636.24


R² = 0.44
15,000

10,000

5,000

0
0 2 4 6 8 10 12 14
X values

Anda mungkin juga menyukai