(Smith, Kohn, 2000) Nonparametric Seemingly Unrelated Regression

* Corresponding author. Tel.: #612-9351-2787; fax: #612-9351-6409.
E-mail address: mikes@econ.usyd.edu.au (M. Smith).

Journal of Econometrics 98 (2000) 257}281
Nonparametric seemingly unrelated regression
Michael Smith*, Robert Kohn'
Econometrics and Business Statistics, University of Sydney, Sydney, Australia
'Australian Graduate School of Management, University of New South Wales, Kensington, New South
Wales, Australia
Received 19 March 1998; received in revised form 9 February 2000; accepted 13 March 2000
Abstract
A method is presented for simultaneously estimating a system of nonparametric
regressions which may seem unrelated, but where the errors are potentially correlated
between equations. We show that the advantage of estimating such a &seemingly unre-
lated' system of nonparametric regressions is that less observations can be required to
obtain reliable function estimates than if each of the regression equations is estimated
separately and the correlation ignored. This increase in e$ciency is investigated empiric-
ally using both simulated and real data. The method uses a Bayesian hierarchical
framework where each regression function is represented as a linear combination of
a large number of basis terms. All the regression coe$cients, and the variance matrix of
the errors, are estimated simultaneously by their posterior means. The computation is
carried out using a Markov chain Monte Carlo sampling scheme that employs a &focused
sampling' step to combat the high-dimensional representation of the unknown regression
functions. The methodology extends easily to other nonparametric multivariate regres-
sion models. 2000 Elsevier Science S.A. All rights reserved.
JEL classixcation: C11; C14; C15; C31
Keywords: Nonparametric multivariate regression; Bayesian hierarchical SUR model;
Multivariate subset selection; Markov Chain Monte Carlo
0304-4076/00/$- see front matter 2000 Elsevier Science S.A. All rights reserved.
PII: S 0 3 0 4 - 4 0 7 6 ( 0 0 ) 0 0 0 1 8 - X
1. Introduction
The aim of nonparametric regression is to estimate regression functions
without assuming a priori knowledge of their functional forms. The price for this
#exibility is that appreciably larger sample sizes are required to obtain reliable
nonparametric estimators than for parametric estimators. In this paper, we
consider a system of regression equations that can seem unrelated, but actually
are because their errors are correlated. Such a system of equations is called a set
of &seemingly unrelated' regressions, or a SUR model (Zellner, 1962). This paper
provides a Bayesian framework for reliably estimating the regression functions
in a nonparametric manner, even for moderate sample sizes, by taking advant-
age of the correlation structure in the errors. The most important consequence
of this work is to show that if the errors are correlated, better nonparametric
estimators are obtained by taking advantage of this correlation structure com-
pared to ignoring the correlation and estimating the equations one at a time.
Speci"cally, we consider the system of m regression equations
yG"f G(xG)#eG, for i"1, 2,
2
, m. (1.1)
Here, the superscript denotes that this is the ith of m possible regressions, yG is
the dependent variable, xG is a vector of independent variables and f ,
2
, f K are
functions that require estimating in a nonparametric manner. As in the linear
Gaussian SUR model, the regressions are related through the correlation
structure of the Gaussian errors eG. That is,
e&N(0, I
L
), (1.2)
where e'"(e', e`',
2
, eK'), eG is the vector of errors for the n observations of the
ith regression and is a positive-de"nite (m;m) matrix that also requires
estimation. This paper provides a data-driven procedure for estimating the
unknown functions f G (for i"1,
2
, m) and covariance matrix in this model.
Such systems of regressions are frequently used in econometric, "nancial and
sociological modeling because taking into account the correlation structure in
the errors results in more e$cient estimates than ignoring the correlation and
estimating the equations one at a time. Most of the literature on estimating
a system of equations assumes that the f G are linear functions. For recent
examples, see Bartels et al. (1996), Min and Zellner (1993) and Mandy and
Martins-Filho (1993). However, in practice the functional forms of the f G in
many regression applications are unknown a priori, so that an approach that
estimates their form is preferable. We examine two such cases here. The "rst
concerns print advertisements in a women's magazine and estimates the rela-
tionship between three measures of advertising exposure and the physical
positioning of advertisements in the magazine. The second involves estimating
an intra-day model for average electricity load in two adjacent Australian states.
In this example, we estimate the daily and weekly periodic components of load,
258 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
along with a temperature e!ect. In both examples, signi"cant nonlinear relation-
ships are identi"ed that are di$cult to discern using a parametric SUR ap-
proach. In addition, substantial correlation is estimated between the regressions
and the function estimates di!er substantially from those obtained by estimating
each of the nonparametric regressions separately and ignoring the correlation
between the equations.
Our approach for estimating the system of equations de"ned at (1.1) and (1.2)
models each of the functions f G as a linear combination of basis terms. We
develop a Bayesian hierarchical model to explicitly parameterize the possibility
that these terms may be super#uous and have corresponding coe$cient values
that are exactly zero. A wide variety of bases can be used, including many with
a desired structure, such as periodicity or additivity, a point which is demon-
strated in the empirical examples. The unknown regression functions are esti-
mated by their posterior means which attach the proper posterior probability to
each subset of the basis elements, providing a nonparametric estimate that is
both #exible and smooth. We develop a Markov chain Monte Carlo (MCMC)
sampling scheme to calculate the posterior means because direct evaluation is
intractable. This sampling scheme is a correction of the &focused sampler'
discussed in Wong et al. (1997) and our empirical work shows it to be reliable
and much more e$cient than the Gibbs sampling alternative. We prove that the
iterates of the focused sampler converge to the correct posterior distribution.
The performance of the new estimator is investigated empirically with a set of
simulation experiments that cover a range of potential regression curves. These
demonstrate the improvement that can be obtained by exploiting the correla-
tion structure in a system of regressions. We note that the solution to the
nonparametric SUR model discussed in this paper is easily extended to other
nonparametric multivariate (or vector) regression models.
Zellner (1962, 1963) provides the seminal analysis of a system of regressions
when the unknown functions f G are assumed linear in the coe$cients. Srivastava
and Giles (1987) summarize much of the literature dealing with this linear SUR
model. However, recent advances in Markov chain Monte Carlo methods
enable Bayesian analyses of more complex variations of the SUR model. For
example, Chib and Greenberg (1995a) develop sampling schemes that estimate
a hierarchical linear SUR model with "rst-order vector autoregressive or vector
moving average errors and extend the analysis to a time varying parameter
model. Markov chain Monte Carlo methods also provide a solution to estima-
ting reliably nonparametric regressions in a variety of hitherto di$cult situ-
ations. For example, Smith and Kohn (1996) develop nonparametric regression
estimators for regression models where a data transformation may be required
and/or outliers may exist in the data. Yee and Wild (1996) use smoothing
splines to estimate a system of equations in a nonparametric manner, but they
do not have data-driven estimators for the smoothing parameters. In the
example in their paper they use values of the smoothing parameters based on the
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 259
independent variables, but not the dependent variable. Such an approach is an
unsatisfactory way of estimating the smoothing parameters because it does not
take into account the curvature exhibited by the dependent variable. Nor is it
fully automatic because the e!ective degrees of freedom is required as an input
from the user.
The paper is organized as follows. Section 2.1 discusses how to model the
unknown functions and why they are estimated using a hierarchical model. The
rest of Section 2 introduces the Bayesian hierarchical SUR model and develops
an e$cient MCMC sampling scheme to enable its estimation. Section 3 uses the
methodology to "t the print advertising and electricity load datasets. Section 4
contains simulation examples which investigate the improvements that can be
made using this estimation procedure over a series of separate nonparametric
regressions. Appendix A provides the conditional posterior distributions em-
ployed in the sampling scheme, while Appendix B proves that the focused
sampling step provides an iterate from the correct invariant distribution.
2. Methodology
2.1. Basis representation of functions
Each regression function is modeled as a linear combination of basis func-
tions, so that for a function f,
f (x)"
N
G
[
G
b
G
(x). (2.1)
Here, B"b
,
2
, b
N
is a basis of p functions, while the [
G
's are regression
parameters.
A large number of authors use such linear decompositions with a variety of
univariate and higher dimensional bases in the single equation case. For
example, Friedman (1991), Smith and Kohn (1996) and Denison et al. (1998) use
regression splines, Luo and Wahba (1997) use several reproducing kernel bases
and Wahba (1990) uses natural splines. In particular, orthonormal bases, such as
wavelet (Donoho and Johnstone, 1994) or Fourier bases have been used.
However, the computational advantage provided by such orthonormal bases
does not easily extend to the case where the errors are correlated, such as in the
SUR model. In the case of multiple regressors in an equation, additive models of
univariate bases or radial bases (Powell, 1987; Holmes and Mallick, 1998) can be
used.
Given a choice of a particular basis for the approximation at (2.1), the ith
regression at (1.1) can be written as the linear model
yG"XGG#eG. (2.2)
Here, yG is the vector of the n observations of the dependent variable, the design
matrix XG"[b
b
`
2
b
N
G
], b
H
is a vector of the values of the basis function
b
H
evaluated at the n observations and are the regression coe$cients. The
errors eG are correlated with those from the other regressions, as speci"ed in (1.2),
and we denote the number of basis terms in the ith equation as pG. Note that
many basis expansions employ pG*n basis terms and it is inappropriate to
estimate the regression coe$cients using existing SUR methodology because the
function estimates fKG, i"1,
2
, m, would interpolate the data (rather than pro-
duce smooth estimates that account for the existence of noise in the regression).
Therefore, we estimate the regression parameters using a Bayesian hierarchical
SUR model described below that explicitly accounts for the possibility that
many of these terms may be redundant.
2.2. A Bayesian hierarchical SUR model
Consider the ith regression of a linear SUR model given at Eq. (2.2), where the
design matrix XG is (n;pG) and the coe$cient vector G is of length pG. To account
explicitly for the notion that variables in this regression can be redundant, we
introduce a vector of binary indicator variables G"(G
, G
`
,
2
, G
N
G
)'. Here,
G
I
corresponds to the kth element of the coe$cient vector of the ith regression,
say [G
I
, with G
I
"0 if [G
I
"0 and G
I
"1 if [G
I
O0. By dropping the redundant
terms with zero coe$cients, the ith regression can be rewritten, conditional on G,
as
yG"XG
A
G
G
A
G
#eG. (2.3)
If qG
A
"N
G
H
G
H
, then the design matrix XG
A
G
is of size (n;qG
A
) and G
A
G
is a vector of
qG
A
elements.
By stacking the linear models for the m regressions, the SUR model can be
written, conditional on '"(', `',
2
, K'), as
y"X
A
A
#e. (2.4)
Here, y'"( y', y`',
2
, yK'), X
A
"diag(X
A
, X`
A
`
,
2
, XK
A
K
) and '
A
"('
A
,
2
, K'
A
K
).
If q
A
"K
G
qG
A
, then X
A
is an (n;q
A
) matrix and
A
a vector of q
A
elements. To
complete this Bayesian hierarchical model, we introduce the following priors on
the parameters.
(i) Following O'Hagan (1995) we construct a conditional prior for
A
by setting
p(
A
,)Jp(y
A
,,)L
so that
A
, &N((
A
, n(X'
A
AX
A
)), where A"I
L
and
(
A
"(X'
A
AX
A
)X'
A
Ay. This data-based fractional prior contains much less
information about
A
than the likelihood. The results obtained using this
prior are similar to those obtained using the prior
A
, &N(0, nI), which
does not depend on the data. However, we prefer the data-based prior
because the conditional posterior mean of
A
is unbiased.
(ii) The prior for is taken as independent of and is the commonly used
non-informative prior discussed in Zellner (1971), where
p()J'K>'`.
(iii) The indicator variables G
I
, k"1,
2
, pG, i"1,
2
, m, are taken a priori
independently distributed, with p(G
I
"1cG)"cG.
(iv) The hyperparameters cG, i"1,
2
, m, are taken as independent and given
a non-informative uniform prior on (0, 1).
We integrate the hyperparameters "(c,
2
, cK) out of our analysis, so that
p()"jp()p() d"_K
G
j
"
(cG)O
G
A
(1!cG)'N
G
O
G
A
' dcG"_K
G
B(qG
A
#1, pG!qG
A
#1), where B is the beta function. This is a di!erent prior on than that
suggested in Smith and Kohn (1996). Note that the model here is a hierarchical
SUR model as, conditional on , it is simply a linear SUR model.
2.3. Markov chain Monte Carlo sampling
To estimate this model we use the following Markov chain Monte Carlo
sampling scheme with the hyperparameter integrated out.
(1) Generate from
A
, , y
(2) Generate from , , y"
A
, , y
(3) For i"1, 2,
2
, m
For j"1, 2,
2
, pG
Generate from G
H
, G
H
, y using the sampling step described below.
In this sampling scheme
A
is generated from a multivariate normal. Generation
of the matrix directly from the posterior at step (2) is di$cult because the
fractional prior
A
, is centered at (
A
, which is a function of . Conse-
quently, we use a Metropolis}Hastings step where the proposal Wishart density
is the posterior under a #at conditional prior for
A
. This works well with
between 60% and 90% of those iterates that are generated being accepted.
Details of how to generate from the distributions at steps (1) and (2) are given in
Appendix A. It is important to note that care has been taken to generate
G
H
without conditioning on [G
H
at step (3), otherwise the sampling scheme would
be reducible because G
H
is known exactly given [G
H
.
Step (3) generates an iterate of one element at a time. As discussed in Wong
et al. (1997), using a Gibbs sampler is computationally demanding because has
K
G
pG elements, with generation from each conditional posterior density
a computationally intensive exercise. To speed up the generation we use the
following &sampling' step, which is an application of the Metropolis}Hasting
algorithm. Let (G
H
)"p(G
H
, G
H
, y) be the conditional posterior density of
G
H
and s(G
H
)"p(G
H
G
H
) be its conditional prior density, with integrated out in
both cases. Then, if ' is the previous value of G
H
, a new value "` can be
generated as follows.
Sampling Step
(a) Proposal
If '"1 then generate "` from the proposal density
Q('"1P"`"0)"s(G
H
"0)min
1,
(G
H
"0)
s(G
H
"0)
If '"0 then generate "` from the proposal density
Q('"0P"`"1)"s(G
H
"1)min
1,
(G
H
"1)
s(G
H
"1)
(b) Metropolis}Hastings acceptance probabilities
If '"0 and "`"1, then accept "` with probability
:
"
"min1, s(G
H
"0)/(G
H
"0); otherwise set "`"0.
If '"1 and "`"0, then accept "` with probability
:
"
"min1, s(G
H
"1)/(G
H
"1); otherwise set "`"1.
Appendix A calculates the posterior and prior s, with the latter being
a trivial calculation. Generating from the proposal in part (a) can be undertaken
e$ciently as follows.
First generate u from a uniform distribution on (0, 1), then
(i) if '"1 and u(s(G
H
"1), set "`"1.
(ii) if '"1 and u's(G
H
"1), generate "` from the density
p("`"0)"min(1, (G
H
"0)/s(G
H
"0)).
(iii) if '"0 and u(s(G
H
"0), then set "`"0.
(iv) if '"0 and u's(G
H
"0), generate "` from the density
p("`"1)"min(1, (G
H
"1)/s(G
H
"1)).
Generation is e$cient because in the nonparametric regression problem most
of the indicators will be zero and s(G
H
"0) will be close to one. Hence, most of the
time '"0 and u(s(G
H
"0), so that case (iii) is undertaken most frequently
and is calculated infrequently.
Appendix B proves that the sampling method described above is a correct
application of the Metropolis}Hastings method and therefore that
(G
H
)"p(G
H
, G
H
, y) is the invariant distribution of each step. The appendix
also includes a lemma demonstrating that the Metropolis}Hastings ratios at
part (b) of the sampling step will usually be one, or close to one. Because steps
(1)}(3) of the sampling scheme either generate directly from the conditional
posterior distributions, or use a Metropolis}Hastings step, the scheme con-
verges to its invariant distribution, which is the posterior distribution
, , y (Tierney, 1994).
The sampling scheme is much faster because step (3) involves a much smaller
number of complex calculations than the full Gibbs sampler. Moreover, focus-
ing the generations is especially important in nonparametric SUR models
compared to a single equation regression, because there are more basis terms.
We have found this sampler to have strong empirical convergence properties,
a point that is demonstrated in the examples in Sections 3 and 4. These sections
also compare the sampler to the Gibbs alternative and demonstrate that it is
more computationally e$cient.
A sampler that generates solely from the parameter space of is not con-
sidered as it is di$cult to calculate the posterior distribution G
H
G
H
, y. Similarly,
samplers that generate from either the parameter space of (, ) or (, ) are
not considered because it is di$cult to recognize the conditional posterior
distribution , y, or calculate G
H
G
H
, [
, y.
2.4. Parameter estimation
Given an initial state for the Markov chain and a &warmup period', after
which the sampler is assumed to have converged to the joint posterior distribu-
tion, we can collect iterates ('', '', ''),
2
, ('(', '(', '(') which form
a Monte Carlo sample from the joint posterior distribution. It is this sample that
is used for inference.
The posterior mean E[y] is estimated by the histogram estimate K"
((1/J)(
H
'H'). We do not use a mixture estimate because the distribution
of
, , y is di$cult to identify (which is also the reason a Metropolis}

Hastings step is used at step (2) of the sampler).
The posterior mean of the regression parameters, E[ y], is estimated using
the mixture estimate
K"
1
J
(
H
E['H', 'H', y] (2.5)
Each of the conditional expectations in the sum is simple to calculate because
E[
, , y]"(
, while elements of that are not common to
are set to
zero.
The posterior means E[ f G(z) y] of the functions at Eq. (1.1) at any point z in
the domain of xG is estimated using the mixture estimate
fKG(z)"
1
J
(
H
E[ f G(z)'H', 'H', y]
"*'
1
J
(
H
E[G'H', 'H', y]
"*'KG.
Here, *"(b
(z),
2
, b
Q
(z))' is a vector containing the basis function expansion of
f G evaluated at the point z. The vector KG is made up of the elements of K that
correspond to G. If the function is univariate, so that xG is a scalar, then fKG is an
estimate of a curve, while for higher dimensions it is a surface. For additive
nonparametric models the component function estimates can easily be cal-
culated separately by identifying the basis terms and regression coe$cient
estimates that correspond to each function and forming the inner product of
these.
2.5. Standardized residuals
A common diagnostic in parameteric models are estimates of the standard-
ized errors r"(RI
L
)e, with R'R", which can be shown to be distributed
N(0, I
LK
). We estimate the posterior mean E[r y] of these using the histogram
estimate
r("
1
J
(
G
(R'H'I
L
)e'H' (2.6)
In the above, R'H' is a Cholesky factor, such that R'H''R'H'"'H', while
e'H'"y!X'H'. After calculation, these standardized residuals can be separated
into m sets, one for each regression, so that r("(r(',
2
, r(K'). Such residuals
should be approximately N(0, I
LK
).
3. Real data examples
In this section we apply the approach to two systems where the errors are
likely to be correlated across equations and the functional forms of the relation-
ships f G are unknown a priori and require estimation. In such cases, we show
that the function estimates di!er substantially from those obtained using the
equation-by-equation equivalent estimator. We also show that the sampling
scheme results in a major decrease in the computational burden of generating
over the full one-at-a-time Gibbs sampler.
3.1. Print advertising data
We demonstrate our procedure using n"457 observations of data from six
issues of an Australian monthly women's magazine collected by Starch INRA
Hooper. Each observation corresponds to an advertisement placed in the
magazine and the following three advertisement exposure scores, which are
recorded from an experimental audience, are used as measures of the various
levels of e!ectiveness of the print advertisement.
y (noted score): Proportion of respondents who claim to recognize the ad as
having been seen by them in that issue.
y` (associated score): Proportion of the respondents who claim to have
noticed the advertiser's brand or company name or logo.
y` (read-most score): Proportion of respondents who claim to have read
half or more of the copy.
These scores are considered to measure advertisement exposure at increasing
levels of depth. It has long been thought that the positioning of an advertisement
within an issue a!ects its exposure to an audience (Hanssens and Weitz, 1980).
To quantify this we constructed the variable P"(page number)/(number of
pages in issue) to represent the position in the issue in which each advertisement
appeared.
To estimate how the exposure of a print advertisement is a!ected by its
position in the magazine, we considered the three nonparametric regressions
yG"f G(P)#eG for i"1, 2, 3. (3.1)
To model each unknown function we use a thin plate spline basis (Powell, 1987),
with basis terms b
H
(x)"x!k
H
` log(x!k
H
), j"1,
2
, n, where k
,
2
, k
L
are
the so-called &knots' which we set equal to the n observations of the independent
variable. Following previous authors (Wahba, 1990) we also augment these
basis terms with an intercept and linear term, so that p"p`"p`"459.
Expected features in the functions f G include high casual attention to advertise-
ments placed in the front (and to a lesser extent back) of the magazine, while the
pre-editorial slots (where P is about 0.7) are thought to attract more indepth
attention. The three scores y, y` and y` are highly positively correlated and it is
likely that the errors are correlated, so a SUR model appears appropriate.
In a parametric SUR model with the same independent variables the general-
ized and ordinary least-squares estimate of the regression coe$cients are the
same. However, this result does not extend to the case where variable selection is
undertaken on the terms in the regressions. In particular, this applies to the
nonparametric function estimators used in this paper as they are based on
variable selection and model averaging applied to linear basis decompositions.
Fig. 1. (a)}(c) Function estimates of f , f ` and f `, respectively. Bold lines are the nonparametric
SUR estimates, while the dashed lines are the single equation estimates. Panels (d)}(f) contain scatter
plots of P versus the standardized uncorrelated residuals resulting from the nonparametric SUR "t.
The equations at (3.1) were estimated both as a SUR system and individually
with the analogous single equation estimator that ignores any correlation
between the equations. The resulting function estimates are plotted in
Figs. 1(a)}(c) and it can be seen that the SUR and equation-by-equation
estimates di!er substantially. The estimate K from the SUR model is given
below (correlations in italics)
K"
2.166;10` 2.103;10` 1.205;10`

0.956 2.237;10` 1.275;10`
0.822 0.856 0.992;10`
con"rming the existence of high correlation in the errors. The SUR function
estimates suggest that the front (and to a lesser extent) back of the magazine are
areas in which advertisements achieve higher average exposure; though this is
more prominent for the noted and associated scores, y, y`, than for the read-
most score y`. The pre-editorial slots also result in increased exposure, with
a particularly positive e!ect on indepth exposure, as measured by y`. The
relationships are distinctly nonlinear and would be hard to discern a priori using
parametric SUR estimation. To help con"rm that the nonparameteric SUR
Table 1
Ratio of variances for the proposed sampler over the Gibbs sampler for the function estimates of f ,
f ` and f ` calculated at four points on the domain of P
Eq. (1) Eq. (2) Eq. (3)
P"0.2 2.028 3.027 2.195
P"0.4 6.564 4.684 6.484
P"0.6 2.546 3.407 7.102
P"0.8 4.174 7.634 6.660
(NSUR) estimates correctly capture the nonlinear relationships between
y, y`, y` and P, we calculated the standardized uncorrelated residuals
r( G, i"1, 2, 3. Figs. 1(d)}(f) plot these residuals against P, and these con"rm that
there is no signi"cant nonlinear relationship between the residuals and P.
We also estimated the NSUR model using the sampling scheme outlined in
Section 2.3, but using a Gibbs sampler where the elements G
H
were generated at
step (3) directly from their posteriors. With both sampling schemes, we used
a warmup period of 2000 iterations and a further 1000 iterates for inference,
resulting in 459;3;3000 generations from the conditional posterior at (A.1). In
comparison, the proposed sampler took only 91,511 computationally equivalent
generations, 2.2% of the number required by the Gibbs sampler. Because the
Gibbs sampler involves more generations, it would be expected to result in more
random iterates than the proposed sampler. To measure this, we calculated the
variances of the function estimates. Table 1 provides the ratio of the variance for
the proposed sampler over that of the Gibbs sampler. It does so for function
estimates calculated at P"0.2, 0.4, 0.6, 0.8 for all three regression equations. It
reveals that the proposed sampler needs to be run for two to seven times as
many iterations as the Gibbs sampler to obtain function estimates with the same
variances. However, this increased number of iterations still takes only 4}14%
of the time required by the Gibbs sampler and is therefore substantially more
e$cient overall. We note that function estimates are averages of terms in a
stationary sequence whose variances are computed as in Box et al. (1994, p. 34).
3.2. Intra-day electricity load model
Short-term forecasting of regional electricity load based on intra-day data is
an important problem for electricity utilities and electricity regulatory authori-
ties. Load forecasts are required both to determine electricity dispatch schedules
and to price electricity in wholesale markets. Harvey and Koopman (1993) and
Smith (2000) consider nonparametric regressions with daily, weekly and possible
temperature e!ects as models for electricity load. We consider estimating such
a model for three weeks of half-hourly electricity load data from the adjacent
Australian states of New South Wales (NSW) and Victoria (VIC). Let and
` be the half-hourly average electricity load of NSW and VIC, respectively,
OD and O= be the time of day and week of the half-hourly observations
normalized to [0, 1) and be the temperature in the NSW state capital of
Sydney. We estimate the additive nonparametric SUR:
"f
(OD)#f
`
(O=)#f
`
()#e,
`"f `
(OD)#f `
`
(O=)#e`, (3.2)
where the periodic pro"le of electricity load has been decomposed into periodic
daily ( f G
) and weekly ( f G
`
) e!ects for both states. An additive temperature e!ect is
considered for NSW, but we do not include one for VIC because half-hourly
temperature data were not available to us for this state.
We model both f G
and f G
`
with a periodic quadratic reproducing basis (Luo
and Wahba, 1997) where b
G
(x)"((x!k
G
!
`
)`!1/12)/2. We place the knots
k
G
at all the possible observations, so that for the time of day e!ect
k
G
"i/48, i"1,
2
, 48, and for the weekly e!ect k
G
"i/336, i"1,
2
, 336. This
basis was chosen because it ensures that the functions f
, f
`
, f `
and f `
`
are all
periodic on [0, 1). The temperature e!ect was modeled using a thin plate spline
with knots at all observations.
The estimated covariance is (correlation in italics)
K"
29088.13 26093.55
0.5208 86299.88
,
where var(e)(var(e`) because we control for the variation in temperature in
the NSW equation. The positive correlation between the states is likely to be
due, in part, to common weather e!ects and television programming. Figs. 2(a)
and (b) plot the sum of the estimated daily and weekly e!ects for both states
against O=, while the estimated additive Sydney temperature e!ect is given in
Fig. 3. Because the regressions are additive, we present the functions normalized
so that fG
H
(0)"0. The data were collected during March 1998 (late Summer) and
higher temperatures result in increased load due to increased usage of air
conditioning. Estimates of the periodic pro"le are more detailed for NSW as the
model controls for temperature, while the periodic pro"les are more smooth for
VIC. The functions are highly nonlinear and, as Harvey and Koopman (1993)
highlight, are extremely di$cult to model using parametric nonlinear models.
The two equations at (3.2) were also estimated using the analogous single
equation estimator. Fig. 4 plots the di!erence between the SUR and single
equation estimates of the sum of the daily and weekly periodic e!ects. The range
Fig. 2. Sum of the estimated daily and weekly periodic e!ects for VIC (panel (a)) and NSW (panel
(b)). The dashed curves are those resulting from estimation using the proposed sampler and the bold
curves are those resulting from estimation using a full Gibbs sampler. Both estimates are plotted
together for comparison, although they are almost identical and are hard to visually distinguish.
of the di!erence is about 250 and 400 megawatts for VIC and NSW, respectively,
which is substantial compared to the estimates of (var(e) and (var(e`).
Because the load pro"les devolve slowly during the year and are e!ectively only
static over periods of around two to three weeks, larger sample sizes cannot be
used and it is useful to exploit the correlation between equations.
To investigate the performance of the sampling scheme in Section 2.3 we also
estimated the nonparametric SUR model for the electricity load data using a full
Gibbs sampler which generated directly from each of the K
G
pG posterior
distributions of the binary indicators G
H
at step (3) of the sampling scheme. The
Fig. 3. Same as in Fig. 2, but for the estimated Sydney additive temperature e!ect fK
`
.
full Gibbs sampler was run using the same number of iterations for the warmup
and sampling periods as the proposed sampler. Figs. 2 and 3 compare the
resulting curve estimates for the Gibbs sampler (bold lines) and proposed
sampler (dashed lines). The curve estimates are virtually identical and sometimes
the two curve estimates are so similar that the dashed lines cannot be made out.
It is di$cult to see how this could occur if the proposed sampler was not
converging to the same distribution as the Gibbs sampler and the sampling
scheme was not mixing well with respect to the indicator variables . The results
are identical regardless of initial starting state for either sampler.
4. Simulation experiments
4.1. Positively correlated univariate regressions
This simulation is concerned with the case where the errors are highly
correlated between regressions. It highlights the improvement in the quality of
Fig. 4. The di!erence between the SUR and single equation estimates of the sum of the daily and
weekly periodic e!ects. Panel (a) is for the Victorian equation and panel (b) is for the New South
Wales equation.
the estimates that can be obtained when such correlation is modeled rather than
ignored. There are m"4 univariate regressions, with the covariance matrix
speci"ed below at (4.1). Note that the standard deviation of the errors
(var(eG))`"1 is high compared to the range of the functions.
Four true functions were carefully chosen to represent a wide variety of
possible relationships. These are f (x)"sin(8x) (which is highly oscillatory),
f `(x)"((x, 0.2, 0.25)#(x, 0.6, 0.2))/4, with (x, a, b) being a normal density
of mean a and standard deviation b, (which requires a locally adaptive estimator
as there are di!erent degrees of smoothness on the left and right of the function),
f `(x)"1.5x (which was chosen as many relationships are often thought to
be linear) and f "(x)"cos(2x) (which is a smooth nonlinear function). The
independent variables for the four univariate regressions were x&U(0, 1),
x`&U(0, 1) and
x`
x"
&N
0.5
0.5
, 0.3
1 0.6
0.6 1

We generated n"100 data points from this true SUR model and applied the
nonparametric SUR estimator to this data. We use a quartic reproducing kernel
basis (Luo and Wahba, 1997) with basis terms
b
G
(x)"
1
24
x!x
G
!
1
2
"
!
1
2
x!x
G
!
1
2
`
#
7
240
for i"1,
2
, n,
augmented with an intercept and linear term.
To assess the quality of the resulting estimates of the four functions, we
calculated the mean squared di!erence between the function estimates and the
true functions. This measure of distance between the two is de"ned as
MSD
G
"
1
200
`""
I
( fKG(z
I
)!f G(z
I
))`,
where min(xG)"z
(z
`
(2(z
`""
"max(xG) is an evenly spaced grid over
the domain of xG. For the same data we also "t an analogous single equation
univariate nonparametric estimator to the four regressions. The mean squared
di!erence was also calculated for each of these four function estimates.
The entire process was repeated one hundred times. Figs. 5(a)}(d) give box-
plots of the one hundred resulting values of log(MSD
G
) for each of the four
functions (i"1, 2, 3, 4) and for both the nonparametric SUR estimator
(NSUR) and individual nonparametric estimators (NR). Fig. 5 shows that
taking into account the correlation between the errors has substantially and
consistently improved the resulting estimates of all the regression functions.
To examine the qualitative improvement that occurs, we focus on the single
data set corresponding to the 50th sorted value of "
G
MSD
G
for the non-
parametric SUR estimator. This data set can be regarded as providing a &typical'
example of the procedure and is plotted as four scatter plots in Figs. 5(e)}(h) and
again in Figs. 5(i)}(l). The nonparametric SUR estimates of the four functions for
this data appear in Figs. 5(e)}(h) and the estimates for the separate nonparamet-
ric regressions appear in Figs. 5(i)}(l). These "gures show that the nonparametric
SUR estimator signi"cantly outperforms the separate nonparametric estimators
which ignore the correlation between the separate regressions. The variance of
the errors and its estimate K for this data set are given below.
"
1 0.96 0.64 0.93

1 0.98 0.90
1 0.85
1

K"
0.824 0.707 0.489 0.741

0.845 0.849 0.674 0.755
0.570 0.773 0.895 0.656
0.850 0.854 0.722 0.922
(4.1)
Fig. 5. (a)}(d) Boxplots of the log(MSD
G
) for i"1, 2, 3 and 4, respectively. The left-hand boxplot is
for the NSUR estimator, while the right-hand boxplot is for the NR estimation procedure. Panels
(e)}(h) contain scatter plots of xG against yG, along with the function estimates fKG (bold line) and true
functions f G (dotted line) for i"1, 2, 3, 4 that result from applying the NSUR estimator. Panels
(i)}(l) plot the function estimates fKG (bold line) and true functions f G (dotted line) for i"1, 2, 3, 4 that
result from applying the NR procedure to the same data.
Table 2
Time taken by the NSUR procedure to complete a "t to data generated form the model in Section
4.1 for four di!erent sample sizes. In the sampling procedure 2000 iterates were used for the warmup
and the Monte Carlo sample consisted of a subsequent 1000 iterates. The machine used was a low
end workstation
Sample size n"100 n"200 n"400 n"800
Time (s) 43 58 213 3850
The estimate compares favorably to the &best possible' estimate K
'''
that arises
from the sample variance of the true errors themselves, which are known
because this is a simulated example.
K
'''
"var Y((e, e`, e`, e")')"
0.842 0.702 0.449 0.698

0.821 0.869 0.673 0.732
0.522 0.770 0.879 0.603
0.823 0.850 0.697 0.854
.
To demonstrate that the NSUR estimator is practical to implement, we report
the time required to "t this four equation system for a variety of sample sizes in
Table 2. The computer used was a low-end modern workstation while the code
was written e$ciently in FORTRAN. Although these timings are implementa-
tion dependent, they do indicate that this computationally intensive Markov
chain Monte Carlo procedure runs in a reasonable time.
4.2. Unrelated regressions
The previous simulation considers a related set of regressions and demon-
strates the improvements in the regression function estimates that can occur
when correlation between the regressions is modeled and estimated, rather than
ignored. However, is there a risk of degrading the function estimates by
modeling correlation that does not exist?
To investigate this case, we repeat the simulation undertaken above, except
that the regressions are unrelated, with "0.5I
"
. Fig. 6 provides the equivalent
output for this case as was produced in Fig. 5. It can be seen from the boxplots in
Figs. 6(a)}(d) that, in general, there is a slight deterioration in the log(MSD
G
) for
the NSUR estimator compared to the NR estimation procedure. This is ex-
pected as the regressions are actually not related and the NSUR procedure also
estimates . However, the loss in the e$ciency of the function estimates is very
small and the function estimates from the NSUR estimator (Figs. 6(e)}(h)) are
almost identical to those from the NR procedure (Figs. 6(i)}(l)) for the median
dataset.
Fig. 6. As in Fig. 5, but for the uncorrelated model in Section 4.2.
Nevertheless, in examples when m is large, reliably estimating the m(m#1)/2
free parameters of in addition to the unknown functions may prove di$cult.
In such cases, a more parsimonious representaton for can be used and
estimated in combination with the unknown functions using MCMC. For
example, Shephard and Pitt (1998) discuss the MCMC estimation of factor
decompositions for , while Smith and Kohn (1999) examine parsimonious
representations for based on the Cholesky decomposition of .
Acknowledgements
The authors would like to thank Denzil Fiebig, Paul Kofman, Steve Marron,
Gael Martin and Tom Smith for useful comments. They would also like to
thank two anonymous referees for improving the clarity of the presentation and
for pointing out an error in a previous implementation of the sampling scheme.
Both Michael Smith and Robert Kohn are grateful for the support of Australian
Research Council grants.
Appendix A
A.1. Generating from
A
, , y
This conditional distribution can be calculated exactly, as
p(
A
, , y)Jp( y
A
, , )p(
A
, )
Jexp
!
1
2
n#1
n
(
A
!(
A
)'X'
A
AX
A
(
A
!(
A
)
,
so that
A
, , y&N
(
A
,
n
n#1
(X'
A
AX
A
)
.
Here, (
A
and A are de"ned in Section 2.2.
A.2. Generating from
A
, , y
This conditional distribution is di$cult to recognize as is embedded
in the conditional prior for [
A
. Therefore, to obtain an iterate we use a
Metropolis}Hastings step; see Chib and Greenberg (1995b) for an introduction
to this method. The proposal density from which we generate a candidate iterate
is given by
q()Jp( y
A
, , )p()
J'LK'` exp
!
1
2
tr()
which is a Wishart(, n, m) density. Here, is an (m;m) matrix with ijth

element c
GH
"( yG!XG
A
G
G
A
G
)'( yH!XH
A
H
H
A
H
). A newly generated iterate
"`
is ac-
cepted over the old value
'
with probability
:"min
p(
"`
A
, , y)q(
'
)
p(
'
A
, , y)q(
"`
)
, 1
"min
p(
A
"`
, )
p(
A
'
, )
, 1
.
High acceptance rates of 60}90%are obtained because the proposal density q( ) )
is equal to the correct conditional density except for the factor p(
A
, ).
A.3. Calculating G
H
, G
H
, y
This density requires calculation to enable generation of G
H
in the sampling
scheme.
p(G
H
, G
H
, y)J
p(y, ,
A
)p(
A
, ) d
A
p()
J(n#1)O
A
` exp
!
1
2
S(, )
p(G
H
G
H
) (A.1)
where S(, )"y'Ay!y'AX
A
(X'
A
AX
A
)X'
A
Ay and in Eq. (A.1) the regres-
sion coe$cient is integrated out using
A
&N
(
A
,
n
n#1
(X'
A
AX
A
)
.
The conditional posterior can then be calculated by evaluating (A.1) for G
H
"1
and G
H
"0 and normalizing.
A.4. Calculating p(G
H
G
H
)
The conditional prior can be calculated by integrating out the hyperprior cG,
with p(G
H
G
H
)Jj
"
(cG)O
G
A
(1!cG)'N
G
O
G
A
' dcG"B(qG
A
#1, pG!qG
A
#1), so that
p(G
H
"1G
H
)"1/(1#h), where h"(pG!aG)/(aG#1) and aG"
IH
G
I
is the
number of elements of GG
H
that are one.
Appendix B
This appendix provides a proof that the invariant distribution of the proposed
sampling step in Section 2.3 is (G
H
)"p(G
H
, G
H
, y).
Proof is as follows. As "` is a binary variable, the proof is undertaken by
calculating the Metropolis}Hastings ratio for the two transitions
('"0P"`"1) and ('"1P"`"0) and showing that it is applied
correctly in step (b) of the sampling step. Let Q('P"`) be the proposal
density, then
Q(1P0)"s(0)min1, (0)/s(0) and Q(0P1)"s(1)min1, (1)/s(1)
(B.1)
The Metropolis}Hastings ratios are
:
"
"min
1,
(1)Q(1P0)
(0)Q(0P1)
, :
"
"min
1,
(0)Q(0P1)
(1)Q(1P0)
.
Because (1)"1!(0) and s(1)"1!s(0), (1)/s(1)'1 i! (0)/s(0)(1 and
(1)/s(1)(1 i! (0)/s(0)'1. Using these inequalities and substituting the pro-
posal densities at (B.1) into the ratios, it can be seen that :
"
"min(1, s(0)/(0))
and :
"
"min(1, s(1)/(1)). These correspond to the ratios applied in part (b) of
the sampling step. Hence the sampling step is reversible and its invariant
distribution is (G
H
).
Lemma. Let (0)"p( yG
H
"0, G
H
, ) and (1)"p( yG
H
"1, G
H
, ) be the likeli-
hoods for G
H
"0 and G
H
"1. Then,
:
"
"s(0)#s(1)min(1, (1)/(0)) and :
"
"s(1)#s(0)min(1, (0)/(1))
(B.2)
Proof Note that
(1)"
s(1)(1)
s(1)(1)#s(0)(0)
and (0)"
s(0)(0)
s(1)(1)#s(0)(0)
.
Substituting these into the expressions for :
"
and :
"
found in part (b) results in
the expressions at (B.2).
The importance of this lemma is it reveals that the Metropolis}Hastings
ratios will usually be one, or close to one. Because s(1)#s(0)"1, :
"
"1 if
(1)/(0)*1, while :
"
"1 if (0)/(1)*1. Also, even if (1)/(0)(1, because
in the nonparametric regression problem s(0) is usually very close to 1,
:
"
's(0) is close to 1.
References
Bartels, R., Fiebig, D., Plumb, M., 1996. Gas or Electricity, which is cheaper?: an econometric
approach with an application to Australian expendicture data. Energy Journal 17, 33}58.
Box, G., Jenkins, G., Reinsel, G., 1994. Time Series Analysis, 3rd Edition. Prentice-Hall, New Jersey.
Chib, S., Greenberg, E., 1995a. Hierarchical analysis of SUR models with extensions to correlated
serial errors and time varying parameter models. Journal of Econometrics 68, 339}360.
Chib, S., Greenberg, E., 1995b. Understanding the Metropolis}Hastings algorithm. The American
Statisitician 49, 327}335.
Denison, D., Mallick, B., Smith, A., 1998. Automatic bayesian curve "tting. Journal of the Royal
Statistical Society, Series B 60, 333}350.
Donoho, D., Johnstone, I., 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika 81,
425}455.
Friedman, J., 1991. Multivariate adaptive regression splines. The Annals of Statistics (with dis-
cussion) 19, 1}141.
Hanssens, D., Weitz, B., 1980. The e!ectiveness of industrial print advertisements across product
categories. Journal of Marketing Research 17, 294}306.
Harvey, A., Koopman, S., 1993. Forecasting hourly electricity demand using time-varying splines.
Journal of the American Statistical Association 88 (424), 1228}1253.
O'Hagan, A., 1995. Fractional Bayes factors for model comparison (with discussion). Journal of
Royal Statistical Society Series B 57, 99}138.
Holmes, C., Mallick, B., 1998. Bayesian radial basis functions of variable dimension. Neural
Computation 10, 1217}1233.
Luo, Z., Wahba, G., 1997. Hybrid adaptive splines. Journal of the American Statistical Association
92, 107}116.
Mandy, D., Martins-Filho, C., 1993. Seemingly unrelated regressions under additive heteroskedas-
ticity: theory and share equation applications. Journal of Econometrics 58, 315}346.
Min, C., Zellner, A., 1993. Bayesian and non-Bayesian methods for combining models and forecasts
with applications to forecasting international growth rates. Journal of Econometrics 56, 89}118.
Powell, M., 1987. Radial basis functions for multivariate interpolation: a review. In: Mason, J., Cox,
M. (Eds.), Algorithms for Approximation.
Shephard, N., Pitt, M., 1998. Analysis of time varying covariances: a factor stochastic volatility
approach. Preprint.
Smith, M., 2000. Modeling and short term forecasting of new south Wales electricity system load.
Journal of Business and Economic Statistics, to appear.
Smith, M., Kohn, R., 1996. Nonparametric regression via Bayesian variable selection. Journal of
Econometrics 75 (2), 317}344.
Smith, M., Kohn, R., 1999. Bayesian parsimonious covariance matrix estimation. Preprint.
Srivastava, V., Giles, D., 1987. Seemingly Unrelated Regression Equations Models. Marcel Dekker,
New York.
Tierney, L., 1994. Markov chains for exploring posterior distributions. The Annals of Statistics 22,
1701}1762.
Wahba, G., 1990. Spline Models for Observational Data. SIAM, Philadelphia.
Wong, F., Hansen, M., Kohn, R., Smith, M., 1997. Focused sampling and its application to
nonparametric and robust regression. Preprint.
Yee, T., Wild, C., 1996. Vector generalised additive models. Journal of the Royal Statistical Society,
Series B 58, 481}493.
Zellner, A., 1962. An e$cient method of estimating seemingly unrelated regression equations and
tests for aggregation bias. Journal of the American Statistical Association 57, 500}509.
Zellner, A., 1963. Estimators for seemingly unrelated regression equations: some exact "nite sample
results. Journal of the American Statistical Association 58, 977}992.
Zellner, A., 1971. An Introduction to Bayesian Inference in Econometrics. Wiley, New York.

(Smith, Kohn, 2000) Nonparametric Seemingly Unrelated Regression

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

(Smith, Kohn, 2000) Nonparametric Seemingly Unrelated Regression

Diunggah oleh

Hak Cipta:

Format Tersedia

* Corresponding author. Tel.: #612-9351-2787; fax: #612-9351-6409.

E-mail address: mikes@econ.usyd.edu.au (M. Smith).

, , y is di$cult to identify (which is also the reason a Metropolis}

, while elements of that are not common to

2.166;10` 2.103;10` 1.205;10`

1 0.96 0.64 0.93

0.824 0.707 0.489 0.741

0.842 0.702 0.449 0.698

which is a Wishart(, n, m) density. Here, is an (m;m) matrix with ijth

Anda mungkin juga menyukai