Anda di halaman 1dari 13

Estimating a demand function — it’s about time

Our earlier look at estimating a demand function demonstrated how multiple regres-
sion could be used to estimate the demand for gasoline as a function of various predictors,
including its price. The simple model described at the end of the case was based on the
price index of gasoline (logPG), per capita income (logI) and Year2 (YRSQ):

Regression Analysis

The regression equation is


logGpc = - 3.73 - 0.164 logPG + 1.21 logI -0.000139 YRSQ

Predictor Coef SE Coef T P VIF


Constant -3.7258 0.2082 -17.90 0.000
logPG -0.16397 0.01874 -8.75 0.000 5.8
logI 1.21090 0.05384 22.49 0.000 5.5
YRSQ -0.00013895 0.00001835 -7.57 0.000 1.5

S = 0.01207 R-Sq = 96.9% R-Sq(adj) = 96.6%

Analysis of Variance

Source DF SS MS F P
Regression 3 0.145410 0.048470 332.44 0.000
Residual Error 32 0.004666 0.000146
Total 35 0.150076

Although this model fits the data reasonably well, it does suffer from a difficulty —
it does not address the time ordering of the data. In fact, the residuals from this model
exhibit autocorrelation, as can be seen from this time series plot:

c 2014, Jeffrey S. Simonoff


1
2

1
SRES1

-1

-2

Index 10 20 30

The Durbin–Watson statistic supports this, as it equals 0.50; so does the runs test
(although a bit weaker):

Runs Test: SRES1

SRES1

K = 0.0100

The observed number of runs = 13


The expected number of runs = 18.7778
20 Observations above K 16 below
The test is significant at 0.0478

The ACF plot of the standardized residuals also indicates autocorrelation:

c 2014, Jeffrey S. Simonoff


2
Autocorrelation Function for SRES1
1.0
Autocorrelation

0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0

1 2 3 4 5 6 7 8 9

Lag Corr T LBQ Lag Corr T LBQ

1 0.68 4.06 17.93 8 -0.30 -1.13 37.47


2 0.34 1.47 22.55 9 -0.35 -1.24 43.54
3 0.13 0.52 23.23
4 -0.10 -0.40 23.64
5 -0.24 -0.97 26.18
6 -0.25 -0.98 29.01
7 -0.29 -1.11 32.96

As we’ve discussed, one approach for handling autocorrelation is to use a lagged version
of the target variable as a predictor (Lagged logGpc, saying that the previous year’s gas
consumption goes a long way to predicting this year’s consumption, due to basic stability
in the process). Also, in thinking about the dynamics of how people decide to use their
automobiles, it seems reasonable to consider also using a lagged version of the price index
of gasoline, Lagged logPG (saying that consumption might be affected not only by current
price, but previous price, because of the perception of people that prices are increasing
or decreasing). Generally speaking, using lagged versions of predictors is not designed to
specifically address autocorrelation (as the use of the lagged target as a predictor often is),
but rather based on such use making sense in context.
Here is a scatter plot of logged per capita consumption on the previous year’s logged
per capita consumption. We can see that there is a strong relationship, although it is
apparently weaker for the higher values. I haven’t bothered to give the plot of logged per
capita consumption versus previous year’s price index, since it looks very similar to the
one for current year’s price index that we saw earlier.

c 2014, Jeffrey S. Simonoff


3
Here is output for a regression using these variables, along with logPG, as predictors
(I could have used a best subsets regression here, but it’s clear that all three variables
provide a lot of predictive power):

Regression Analysis

The regression equation is


logGpc = - 0.0529 + 1.08 Lagged logGpc - 0.328 logPG
+ 0.290 Lagged logPG

35 cases used 1 cases contain missing values

Predictor Coef SE Coef T P VIF


Constant -0.05288 0.03063 -1.73 0.094
Lagged l 1.07507 0.03256 33.01 0.000 2.3
logPG -0.32784 0.03625 -9.04 0.000 45.4
Lagged l 0.29023 0.03358 8.64 0.000 39.4

S = 0.008158 R-Sq = 98.4% R-Sq(adj) = 98.3%

Analysis of Variance

Source DF SS MS F P
Regression 3 0.127284 0.042428 637.51 0.000
Residual Error 31 0.002063 0.000067
Total 34 0.129347

c 2014, Jeffrey S. Simonoff


4
The model fits very well, and the autocorrelation has apparently been removed. We
might also consider a further simplification of the model. Note that the estimated slopes
for logged price and lagged logged price are very similar in magnitude and of opposite sign;
that suggests that replacing the two variables with their difference could provide similar
fit, and would be easily interpretable as implying that it is simply the change in price,
along with the previous year’s consumption, that are related to current consumption. A
partial F -test of this hypothesis (βlogPG = −βlaglogPG ), however, does not support this
simplification (F = 22.8, p < .0001), so we will not pursue this further.
A time series plot of the residuals, however, shows that there is a clear outlier:

0
SRES2

-1

-2

-3

-4
Index 10 20 30

This outlier corresponds to 1991:

Row Year logGpc SRES2 HI2 COOK2

1 1960 0.85772 * * *
2 1961 0.85583 -2.19803 0.166530 0.241328
3 1962 0.86806 0.02270 0.174651 0.000027
4 1963 0.87579 -0.80749 0.145953 0.027857
5 1964 0.89125 0.07544 0.130924 0.000214
6 1965 0.90611 0.61353 0.113090 0.011999
7 1966 0.92591 0.88899 0.089974 0.019534
8 1967 0.93752 -0.14895 0.073089 0.000437

c 2014, Jeffrey S. Simonoff


5
9 1968 0.96368 1.35005 0.068043 0.033268
10 1969 0.98779 1.19743 0.067586 0.025983
11 1970 1.00721 0.01578 0.094891 0.000007
12 1971 1.02345 -0.61246 0.127361 0.013687
13 1972 1.03491 -1.30374 0.157625 0.079513
14 1973 1.05138 0.80589 0.135470 0.025443
15 1974 1.02465 -0.97127 0.237654 0.073521
16 1975 1.03286 0.15422 0.051697 0.000324
17 1976 1.04569 0.34341 0.060323 0.001893
18 1977 1.05460 0.08992 0.069092 0.000150
19 1978 1.07060 0.77236 0.081818 0.013289
20 1979 1.04468 0.09022 0.221465 0.000579
21 1980 0.99919 -1.22083 0.316834 0.172805
22 1981 0.99262 1.04975 0.148019 0.047863
23 1982 0.99460 -0.55284 0.118739 0.010295
24 1983 1.01066 1.50700 0.103501 0.065548
25 1984 1.01604 0.23891 0.080059 0.001242
26 1985 1.01414 -0.34561 0.073825 0.002380
27 1986 1.04995 -0.15029 0.292040 0.002329
28 1987 1.05783 0.63400 0.049303 0.005211
29 1988 1.05873 -0.78952 0.063589 0.010582
30 1989 1.06109 0.86471 0.058482 0.011611
31 1990 1.05335 0.55444 0.084320 0.007077
32 1991 1.03261 -3.51351 0.076899 0.257095
33 1992 1.04080 0.58789 0.068258 0.006330
34 1993 1.04596 0.00685 0.068665 0.000001
35 1994 1.04710 -0.29816 0.065460 0.001557
36 1995 1.05415 0.63187 0.064769 0.006913

This year was the year of a serious recession and the first Gulf War (Operation Desert
Storm), so apparently gasoline consumption decreased during this time period. As an
outlier, we could contemplate removing this case and reanalyzing the data. Unfortunately,
if we do that, we will disturb the natural time ordering in the data. An alternative approach
is to substitute a “reasonable” value, such as the average of the two neighboring values,
for the outlying value, and then reanalyze the entire adjusted data set. This is admittedly
an ad hoc solution, and more complex (and theoretically justified) substitution methods
are possible. Still, very simple techniques like this can work quite adequately.

c 2014, Jeffrey S. Simonoff


6
For these data, the gas consumption of 1.03261 is too low, relative to the values of
1.05335 for 1990 and 1.0408 for 1992, so the averaged value of 1.04708 is substituted (of
course, when discussing our results, we must note that they no longer apply to 1991,
or future years that might be like 1991; recessions, for example). Here is the resultant
regression output:

Regression Analysis

The regression equation is


logGpc = - 0.0565 + 1.08 Lagged logGpc - 0.334 logPG
+ 0.298 Lagged logPG

35 cases used 1 cases contain missing values

Predictor Coef SE Coef T P VIF


Constant -0.05651 0.02576 -2.19 0.036
Lagged l 1.07892 0.02739 39.39 0.000 2.3
logPG -0.33409 0.03049 -10.96 0.000 45.4
Lagged l 0.29759 0.02824 10.54 0.000 39.4

S = 0.006862 R-Sq = 98.9% R-Sq(adj) = 98.8%

Analysis of Variance

Source DF SS MS F P
Regression 3 0.128926 0.042975 912.79 0.000
Residual Error 31 0.001460 0.000047
Total 34 0.130386

The model fits slightly better, but the coefficients have changed little. More impor-
tantly, there is no autocorrelation, and no outliers are apparent:

c 2014, Jeffrey S. Simonoff


7
Normal Probability Plot of the Residuals
(response is logGpc)

1
Normal Score

-1

-2

-3 -2 -1 0 1 2

Standardized Residual

0
SRES3

-1

-2

-3
Index 10 20 30

c 2014, Jeffrey S. Simonoff


8
Autocorrelation Function for SRES2
1.0
Autocorrelation

0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0

1 2 3 4 5 6 7 8

Lag Corr T LBQ Lag Corr T LBQ

1 -0.19 -1.12 1.36 8 0.01 0.07 8.35


2 0.07 0.41 1.56
3 0.06 0.35 1.72
4 -0.22 -1.27 3.82
5 -0.10 -0.56 4.28
6 0.04 0.20 4.34
7 -0.29 -1.58 8.34

Runs Test: SRES2

SRES2

K = -0.0148

The observed number of runs = 22


The expected number of runs = 17.8000
21 Observations above K 14 below
The test is significant at 0.1328
Cannot reject at alpha = 0.05

Row Year logGpc SRES3 HI3 COOK3

1 1960 0.85772 * * *
2 1961 0.85583 -2.55989 0.166530 0.327330
3 1962 0.86806 0.09034 0.174651 0.000432
4 1963 0.87579 -0.90839 0.145953 0.035255
5 1964 0.89125 0.13495 0.130924 0.000686
6 1965 0.90611 0.78300 0.113090 0.019544
7 1966 0.92591 1.09185 0.089974 0.029467
8 1967 0.93752 -0.15210 0.073089 0.000456
9 1968 0.96368 1.61432 0.068043 0.047567
10 1969 0.98779 1.42412 0.067586 0.036752

c 2014, Jeffrey S. Simonoff


9
11 1970 1.00721 -0.00708 0.094891 0.000001
12 1971 1.02345 -0.76761 0.127361 0.021499
13 1972 1.03491 -1.59821 0.157625 0.119489
14 1973 1.05138 0.93727 0.135470 0.034414
15 1974 1.02465 -1.09989 0.237654 0.094282
16 1975 1.03286 0.12996 0.051697 0.000230
17 1976 1.04569 0.33488 0.060323 0.001800
18 1977 1.05460 0.02913 0.069092 0.000016
19 1978 1.07060 0.82485 0.081818 0.015157
20 1979 1.04468 0.10898 0.221465 0.000845
21 1980 0.99919 -1.44476 0.316834 0.242012
22 1981 0.99262 1.16183 0.148019 0.058629
23 1982 0.99460 -0.81402 0.118739 0.022320
24 1983 1.01066 1.64740 0.103501 0.078331
25 1984 1.01604 0.14236 0.080059 0.000441
26 1985 1.01414 -0.54445 0.073825 0.005907
27 1986 1.04995 -0.45074 0.292040 0.020952
28 1987 1.05783 0.63208 0.049303 0.005180
29 1988 1.05873 -1.08116 0.063589 0.019844
30 1989 1.06109 0.91787 0.058482 0.013083
31 1990 1.05335 0.55781 0.084320 0.007163
32 1991 1.04708 -2.15160 0.076899 0.096413
33 1992 1.04080 0.55002 0.068258 0.005541
34 1993 1.04596 -0.14783 0.068665 0.000403
35 1994 1.04710 -0.50620 0.065460 0.004487
36 1995 1.05415 0.60268 0.064769 0.006289

The residual versus fitted plot gives a slight indication of structure, but given the very
high R2 here, it is unlikely that any corrective action would make much of a difference.

c 2014, Jeffrey S. Simonoff


10
Residuals Versus the Fitted Values
(response is logGpc)

1
Standardized Residual

-1

-2

-3
0.85 0.95 1.05

Fitted Value

This new gas demand function has an appealing intuitive justification. Given the last
two years’ prices, gasoline demand is directly to last year’s demand (1% higher demand
last year is associated with 1.08% estimated expected increase this year). Given last year’s
demand and price, this year’s demand is inversely related to this year’s price, which is the
inverse demand / price relationship expected from economic theory (1% higher price is
associated with .33% estimated expected decrease in demand). Further, given this year’s
price and last year’s demand, this year’s demand is directly related to last year’s price
(1% higher price last year is associated with .30% estimated expected increase in demand
this year). This also makes sense, since a higher value of last year’s price, given this year’s
price is fixed, is consistent with a decreasing price trend, which would encourage additional
consumption. The standard error of the estimate implies that per capita gas demand can
be predicted to within 3% (10.013724 = 1.03).
The fill–in method for handling an outlier used here has two limitations that are
worth noting. First, adjusting the target (y) value will not fix leverage points, so they
are characterized by unusual predictor values, not unusual target values. Second, unusual
observations often occur in “patches” in time series data, reflecting a temporary change in
the underlying structure of the process; a constant fill–in for four or five (say) consecutive
time periods is obviously not accurately reflecting what we think the series really should
be.

c 2014, Jeffrey S. Simonoff


11
An alternative that addresses both of these points (and is thus the only alternative for
handling leverage points) is to create an indicator variable that defines the outlier or patch
of outliers, equaling one for all observations in the patch, and zero otherwise (isolated
outliers that are not in a consecutive patch of years have a 0/1 variable defined for each
of them). Including this variable in the regression will effectively remove the influence of
the unusual values from the regression fit. Here is how this works for these data (with
Year1991 defining only 1991).

Regression Analysis

The regression equation is


logGpc = - 0.0604 + 1.08 Lagged logGpc - 0.341 logPG
+ 0.305 Lagged logPG - 0.0298 Year1991

35 cases used 1 cases contain missing values

Predictor Coef SE Coef T P


Constant -0.06036 0.02421 -2.49 0.018
Lagged l 1.08300 0.02574 42.07 0.000
logPG -0.34073 0.02873 -11.86 0.000
Lagged l 0.30539 0.02670 11.44 0.000
Year1991 -0.029833 0.006696 -4.46 0.000

S = 0.006433 R-Sq = 99.0% R-Sq(adj) = 98.9%

Analysis of Variance

Source DF SS MS F P
Regression 4 0.128105 0.032026 773.86 0.000
Residual Error 30 0.001242 0.000041
Total 34 0.129347

The fitted coefficients are virtually the same as when the fill–in method is used. One
additional piece of information from this approach is the coefficient for Year1991: given
previous year’s gasoline consumption, and this and last year’s gasoline price index, the
observed logged per capita consumption for 1991 is seen to have been .0298 lower than

c 2014, Jeffrey S. Simonoff


12
expected (translating to a demand roughly 6.6% lower than expected that year), and this
amount is significantly different from zero (p < .001). Thus, the t–test for the indicator
variable is a formal test of whether the point is an outlier (but remember that it would
not necessarily be significant for a leverage point).
Two issues related to software in general and Minitab in particular if you use this
method:
(1) You should set the standardized residual for any observations identified by a single
0/1 variable equal to 0 (Minitab sets them equal to *, since they are technically 0/0).
(2) If you are doing model selection (using best subsets, for example), you must be careful
to force the indicator variables that define the unusual observations into all models,
to make sure that those points are effectively omitted from the sample. This is par-
ticularly important for a leverage point, since its corresponding indicator will not
necessarily be identified as an important predictor by best subsets, even if its inclu-
sion could greatly change the fitted regression.

c 2014, Jeffrey S. Simonoff


13