Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1530.181 105.041 14.57 <2e-16 ***
t 66.828 3.206 20.84 <2e-16 ***
related to each other in time.
This quarter’s sales is in proximity to last
4000
quarter’s sales.
Sales
inappropriate.
Correlation across time is called auto-
correlation.
2000
Examples:
This quarter’s sales is close to (in time) last quarter’s sales – temporal association.
A home’s value is close to (in space) home values nearby – spatial association.
If neighboring observations carry important information, then you should consider spatial
and/or temporal model components.
In addition to classical regression components.
The Statistical Challenge
How to incorporate the effect of neighboring (or close) observations?
Unlike multivariate regression, we are not predicting future values of Y based on independent
variables X, but based on previous values of the same variable Y.
Temporal dependency:
If this quarter’s sales depend on last quarter’s sales, then salest = f(salest-1)
Sales can be modeled as a lag-variable, using Time (t) as a predictor.
Spatial dependency:
If a home’s value depends on home values nearby, then the average value of all nearby homes
can be used as a predictor.
We can also incorporate the variance (volatility) of nearby homes, the max value, min value, etc,
of nearby homes.
Other Challenges of Spatial Models
Spatial models are usually more challenging than temporal models, because:
We have to define an appropriate distance metric, such as Euclidian distance.
Example: Home A is 5 miles from home B and 10 miles from home C.
Or more generally, define a similarity metric.
Example: Product A is closer to product B in the associated feature space.
We must also specify the reach of the spatial dependency:
Should we only include homes no further than 5 miles away? Or no further than 10 miles away?
Which is the better model? Why? Multiple R-squared: 0.9438, Adjusted R-squared: 0.9394
F-statistic: 214 on 4 and 51 DF, p-value: < 2.2e-16
Add seasonality model Linear trend model
0
0
10
10
20
20
t
t
30
30
40
40
50
50
Residual vs Time Residual vs Time
Res Res
0
10
10
20
20
Time
Time
30
30
40
40
Which Model Fits the Data Better
50
50
Residual distribution
Frequency
0 2 4 6 8
-500
0
Mod2
Residuals
500
1000
Predictive Accuracy On Holdout Set
How to measure predictive performance:
Create training data set and holdout (test) data set.
6000
Actual
Trend Model
Training: Q1-86 to Q4-99. Seasonal Model
5500
Estimate model using test data set and use
estimated model to predict sales for test data.
5000
Compute root mean square error for test data:
4500
4000
RMSE (Linear Trend) = $729 million. 56 57 58 59 60 61 62
1 .0
Patterns (exponential decay, positive/negative swings, etc.) are
0 .5
bad.
ACF
0 .0
Two ways of modeling autocorrelation:
-0 .5
Explicitly (as lag variable): 0 5
Lag
10 15
1.0
Model response variable as a function of lagged error terms.
0.6
ACF
Moving average (MA) models.
0.2
-0.2
Other problems:
0 5 10 15
Non-zero mean. Lag
“Lag” Variable
Interpreting Lag Model Results
Questions:
The coefficient 0.74 implies that:
A. Last quarter’s sales have no impact.
B. Sales decrease by 0.74 every quarter.
C. Seasonally adjusted and detrended sales increase by
Mult R-sq: 0.9761, Adj R-sq: 0.9736
0.74 every quarter.
D. None of the above.
How do the results compare against the linear trend and
seasonal models?
600
400
Residual vs Time
Residual vs Time
Residual vs Time
400
Actual vs Fitted
500
200
200
Res
Res
Res
0
0
0
-200
-200
-500
-400
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Trend + Lag + Seasonal Trend + Lag + Seasonal RMSE Trend = $729 million
RMSE Trend +Seas = $498 million
6000
Actual
5000
Trend Model
Seasonal Model
5500
4000
Sales
5000
3000
4500
2000
4000
0 10 20 30 40 50 56 57 58 59 60 61 62
t
ACF Plots
Series res Series res
1.0
1.0
1 .0
0.6
0.6
0 .5
ACF
ACF
0.2
ACF
0.2
0 .0
-0.2
-0.2
0 5 10 15 0 5 10 15
-0 .5
0 5 10 15
Lag
Lag Lag
0.2
Questions:
Partial ACF
Partial ACF
0.0
0.2
-0.2
-0.2
Modeling:
Run ARIMA model using p, d, and q values.
Diagnostic checking:
Compare model statistics (AIC, BIC, SBIC) to choose the best model.
Plot residual ACF: should be random (no pattern), i.e., white noise.
ACF and PACF
Autocorrelation Function (ACF):
Measures correlation between observation Yt and observation Yt-p located p periods apart.
�k = Corr (Yt , Yt-p) = Cov (Yt , Yt-p) / (√Var (Yt) * √Var (Yt-p) = �p/ �0
Estimates how many past values may be related to Yt, i.e., the lag p in AR(p) models.
Partial autocorrelation function (PACF):
Measures correlation between residual �t and �t-q located q periods apart.
Autocorrelation of a signal with itself at different points in time, when the linear dependency of that signal at
shorter lags removed.
Used to estimate lag q in MA(q) models.
AirPassengers Data
Monthly ticket sales counts (in thousands) for 1949-1960.
We wish to predict ticket sales for the next 5 years.
data(AirPassengers)
ts < AirPassengers
ts
start(ts)
end(ts)
class(ts)
frequency(ts)
cycle(ts)
plot(ts)
abline(lm(ts ~ time(ts)), col="red”)
boxplot(ts ~ cycle(ts))
Questions:
What are we trying to depict in this boxplot?
What inferences can we draw from this boxplot?
AirPassengers: Stationarity
plot(ts) Questions:
abline(lm(ts ~ time(ts)), col="red")
plot(log(ts)) What did the log transform do?
plot(diff(log(ts))) What did the differencing do?
Do we have a stationary time series?
Lag 0
Lag 0
Lag 1
Lag 2(p=2)
Lag 1 (q=1)
AirPassengers: ARIMA with Cross-Validation
model < arima(log(ts), c(2,1,1), seasonal=list(order=c(2,1,1), period=12))
predicted < predict(model, n.ahead=5*12)
predicted < 2.718282^predicted$pred
predicted < round(predicted,0)
predicted
ts.plot(ts, predicted, lty=c(1,3))
# Crossvalidation
train < ts(ts, frequency=12, start=c(1949,1), end=c(1958,12))
model < arima(log(train), c(2,1,1),
seasonal=list(order=c(2,1,1), period=12))
predicted < predict(model, n.ahead=2*12)
predicted < 2.718282^predicted$pred Training data: 1949-1958
predicted < round(predicted,0) Test data: 1959-1960
original < tail(ts, 24)
original – predicted
RMSE < sqrt(1/24*sum((original predicted)^2))
Spatial Models
Basic idea is similar to that of temporal models:
We want to account for the information due to neighboring observations.
More complex than temporal models since spatial dependency evolves continuously in a 2-
dimensional (or sometimes higher-dimensional) space.
Example: Baltimore House Prices
Goal: To estimate factors that determine price of real estate
We have data on house characteristics (bedrooms, bathrooms, patio, fireplace, AC, etc.) and
geographical location (longitude, latitude, coded as X and Y) for a sample of Baltimore homes.
We also have the selling price (in thousands of dollars).
580
580
150
570
560
560
550
100
Latitude
540
540
530
50
520
520
510
Longitude
The colors and diamond size are proportional to the price of a house.
What can we learn from this graph?
How can we model this effect?
How to Measure ‘Space’?
We must define space in order to measure its effects.
Naive method: Regional dummy variables, e.g., for zip codes.
Weight matrix: n x n neighborhood structure, where: 0 = not neighbor, 1 = neighbor.
Sample Region and Units Simple Neighborhood Matrix
1 2 3 4 5 6 7 8 9
1 2 3 1 0 1 0 1 0 0 0 0 0
2 1 0 1 0 1 0 0 0 0
3 0 1 0 0 0 1 0 0 0
4 5 6 4 1 0 0 0 1 0 1 0 0
5 0 1 0 1 0 1 0 1 0
6 0 0 1 0 1 0 0 0 1
7 0 0 0 1 0 0 0 1 0
7 8 9 8 0 0 0 0 1 0 1 0 1
9 0 0 0 0 0 1 0 0 0
Spatial Lag Model
Initial OLS Model (AIC OLS = 1793)
Spatial autocorrelation in response variable:
Y = ρWY + �X + ε
W: spatial weight; ρ: spatial coefficient
Incorporates spatial effects by including a spatially lagged
dependent variable as an additional predictor.
OLS vs. spatial lag results: Spatial Lag Model (AIC Spatial Lag = 1739)
Some of the estimates are smaller in the lag model. Estimate Std. Error Pvalue
(Intercept) -2.6764 4.8670 0.5824
Intercept term switched signs and is no longer significant. NBROOM 1.2673 1.0487 0.2269
NBATH 7.6529 1.8711 0.0000
What happened? Which model is better? PATIO 11.9579 2.9539 0.0001
FIREPL 11.1740 2.6072 0.0000
AC 8.4183 2.4014 0.0005
Spatial coefficient:
Rho: 0.49961; p-value: 6.6502e-14
OLS vs. Spatial Lag
Certain predictors (e.g., presence of patio or fireplace) lost their importance in predicting home
prices when neighboring homes are included (using spatial lag ρWy).
Why?
Houses located in the same area tend to have similar features, e.g., fireplaces and patios in
wealthy neighborhoods, no central AC in poorer neighborhoods.
Hence, prices of neighboring houses already factor in the price effect of these “expected” features.
Lack of these features may change the price a little but not by much.
Implication:
Better to buy a low-end house in an expensive neighborhood rather than a high-end house in an
inexpensive neighborhood?
Key Takeaways
Modeling temporal and spatial dependencies in data presents unique challenges such as
autocorrelation and location correlation.
Statistical models available to account for these dependencies.
Additive seasonality model.
Lag model.
AR, MA, ARMA, ARIMA models.
Spatial lag model.
Assessing model quality:
Use estimates from training set to predict values in test set (predictive accuracy).
Alternative measures of model fit such as AIC and visual examination of residuals are needed.
Such analysis provide better insight into relationship among data not available from OLS
models.