500
Todd Easton
If the relationship is instead curvilinear, then one can transform the X variable to take into account
the non-linearity of the relationship. However, there are many possible, non-linear ways that Y could
depend on X. How does one select the right transformation? One possibility is theory; in some
cases your knowledge of the nature of the relationship between X and Y will suggest that you should
square X or take its logarithm. Another possibility is inspection. If you look at a scattergram of X
and Y, the shape of the scatter might suggest a particular transformation.
Figure 1 illustrates how this might happen with a Y variable that depends on an X variable,
transformed in a logarithmic way (plus a normal disturbance term). [If you are interested in how
Figure 1 was created, see Appendix 1.] If you saw that scattergram, and especially if you fit a linear
function to it, you might decide that the appropriate relationship to estimate was between a
transformed Y and the X variable. Clearly, a linear function does not describe well the
relationship captured by the scattergram.
Figure 1, Y as a Function of X
40,000
35,000
30,000
25,000
Y values
20,000
15,000
10,000
5,000
0
0 2 4 6 8 10 12 14
X values
Natural logarithms
The next question would be, which transformation would best capture this curvilinear relationship?
One transformation frequently used by statisticians is the natural log transformation. It describes
well relationships where Y increases (or decreases) faster and faster as X gets bigger. But what is
a natural logarithm? The first thing to say is that a natural logarithm is a cousin of the base 10
logarithm.
The base 10 logarithm of a number is the power to which 10 would need to be raised to equal that
number. For example, the logarithm to the base 10 of 100 is 2, because:
102 = 100.
The numerical sentence that says, "the logarithm to the base 10 of 100 is 2," is:
log10 100 = 2.
A natural logarithm is like the base 10 logarithm, except the number being raised to a power is e
instead of 10. The number e is a very special number (like pi). It's equal to approximately 2.7183.
So, for example, the logarithm to the base e of 100 is (roughly) 4.61, because:
e4.61 = 100
The numerical sentence that says, "the logarithm to the base e of 100 is 4.61," is:
loge 100 = 4.61. We can also write:
ln 100 = 4.61, because "ln" is an abbreviation for "natural logarithm."
At this point, you might well ask why statisticians prefer natural logarithms to base 10 logarithms
when they transform independent variables. Imagine that you first transform 100 with a base 10
logarithm, and then with a natural logarithm. Most of us would find it easier to think about the
meaning of 2 in that context than the meaning of 4.61.
X log X X ln X
100 2 100 4.61
There is more than one anwer to that question, but one answer goes back to the idea, mentioned
on the previous worksheet, that the transformation you choose might be dictated by theory, by
your understanding of the phenomenon you are trying to model. There are a whole class of
phenomena connected to growth over time, for which the number e and natural logarithms have
an organic connection. One phenomenon is compound interest. It turns out that, if you compound
interest continuously at a rate of 100%, you will have e dollars (2.7183 dollars) at the end of a year!
A second answer is connected to the first. It turns out that changes in natural logarithms are
very close to percentage changes, so numbers that have been transformed with a natural log
function are easy to interpret in that context. The next worksheet provides an example of this.
Natural logarithms and proportional change
To see the connection between changes in natural logarithms and proportional change,
think about a variable X that is doubling yearly. Look at the relationship between the
change in X's logarithm from year to year and the proportional change in X from year to
year. Notice that the proportional change in question isn't the conventional one, but is
calculated using the mid-point formula, where the base is the average of the first value
and the second value being compared.
Figure 2 Proportional Change in X, Midpoint
Formula
Absolute Absolute proportional
change change change
X ln X in ln X in X Base in X
1.0 0.000
2.0 0.693 0.693 1.0 1.5 0.667
4.0 1.386 0.693 2.0 3.0 0.667
6.0 1.792 0.405 2.0 5.0 0.400
Examining the two blue columns, you can see that the absolute change in the natural log of X (ln X)
is very close to the proportional change in X, measured using the mid-point formula. This provides
a quick way to find approximate proportional changes in a variable that has been transformed
using natural logarithms (or, if you multiply by 100, approximate percentage changes).
As an example, lets use the data from Figure 1. We'll make the X variable the number of years
since the product has been introduced and the Y variable the number of units sold in that year.
We know already, from Figure 1, that a line will not fit this data very well. [You can see a line
fit to the data if you look at Appendix 1.] However, transforming Sales with a natural log
transformation might help matters. Figure 3
X Y
You can see the transformed data in Years Sales ln Sales
the third column to the right. In the 1 4640 8.44
next worksheet you can see 2 3468 8.15
the regression on the transformed 3 5297 8.57
data. 4 6543 8.79
5 6468 8.77
6 7160 8.88
7 2599 7.86
8 5315 8.58
9 7935 8.98
10 8101 9.00
11 16147 9.69
12 35522 10.48
Regression with a logarithmically transformed dependent variable
If we transform the Y variable using natural logarithms and leave the X
variable untransformed, in linear form, the resulting expression is often called log-linear.
Years ln Sales
1 8.44
2 8.15
3 8.57
4 8.79
5 8.77
6 8.88
7 7.86
8 8.58
9 8.98
10 9.00
11 9.69
12 10.48
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.71
R Square 0.50
Adjusted R Square 0.45
Standard Error 0.51
Observations 12
ANOVA
df SS MS F Significance F
Regression 1 2.60 2.60 10.07 0.01
Residual 10 2.58 0.26
Total 11 5.18
If this regression were in a linear format, the .13 coefficient on years would indicate that, with each
additional year after the release of the product, sales increase .13 units.
10.00
8.00
ln Sales
6.00
4.00
2.00
0.00
0 2 4 6 8 10 12 14
Years
20,000
15,000
10,000
5,000
0
0 2 4 6 8 10 12 14
Years
Sales_Hat
3320.07
3799.39
4347.90
4975.60
5693.93
6515.95
7456.65
8533.16
9765.09
11174.86
12788.16
14634.38
d Sales
10 12 14
Appendix
Creating a Y that's a logarithmic X & Y restated for
(but probabilistic) function of X scattergram
X Disturbance Y X Y
1 -360.28 4640.27 1 4640.27
2 -1533.22 3468.26 2 3468.26
3 293.11 5297.13 3 5297.13
4 1531.77 6542.69 4 6542.69
5 1438.02 6467.70 5 6467.70
6 2079.76 7160.45 6 7160.45
7 -2620.31 2599.02 7 2599.02
8 -281.02 5315.17 8 5315.17
9 1314.03 7934.64 9 7934.64
10 -1304.04 8101.25 10 8101.25
11 -828.24 16146.58 11 16146.58
12 -2028.52 35522.44 12 35522.44
Y as a Function of X
40,000
35,000
30,000
25,000
Y values
10,000
5,000
0
0 2 4 6 8 10 12 14
X values