G
[
G
b
G
(x). (2.1)
Here, B"b
,
2
, b
N
is a basis of p functions, while the [
G
's are regression
parameters.
A large number of authors use such linear decompositions with a variety of
univariate and higher dimensional bases in the single equation case. For
example, Friedman (1991), Smith and Kohn (1996) and Denison et al. (1998) use
regression splines, Luo and Wahba (1997) use several reproducing kernel bases
and Wahba (1990) uses natural splines. In particular, orthonormal bases, such as
wavelet (Donoho and Johnstone, 1994) or Fourier bases have been used.
However, the computational advantage provided by such orthonormal bases
does not easily extend to the case where the errors are correlated, such as in the
SUR model. In the case of multiple regressors in an equation, additive models of
univariate bases or radial bases (Powell, 1987; Holmes and Mallick, 1998) can be
used.
Given a choice of a particular basis for the approximation at (2.1), the ith
regression at (1.1) can be written as the linear model
yG"XGG#eG. (2.2)
260 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
Here, yG is the vector of the n observations of the dependent variable, the design
matrix XG"[b
b
`
2
b
N
G
], b
H
is a vector of the values of the basis function
b
H
evaluated at the n observations and are the regression coe$cients. The
errors eG are correlated with those from the other regressions, as speci"ed in (1.2),
and we denote the number of basis terms in the ith equation as pG. Note that
many basis expansions employ pG*n basis terms and it is inappropriate to
estimate the regression coe$cients using existing SUR methodology because the
function estimates fKG, i"1,
2
, m, would interpolate the data (rather than pro-
duce smooth estimates that account for the existence of noise in the regression).
Therefore, we estimate the regression parameters using a Bayesian hierarchical
SUR model described below that explicitly accounts for the possibility that
many of these terms may be redundant.
2.2. A Bayesian hierarchical SUR model
Consider the ith regression of a linear SUR model given at Eq. (2.2), where the
design matrix XG is (n;pG) and the coe$cient vector G is of length pG. To account
explicitly for the notion that variables in this regression can be redundant, we
introduce a vector of binary indicator variables G"(G
, G
`
,
2
, G
N
G
)'. Here,
G
I
corresponds to the kth element of the coe$cient vector of the ith regression,
say [G
I
, with G
I
"0 if [G
I
"0 and G
I
"1 if [G
I
O0. By dropping the redundant
terms with zero coe$cients, the ith regression can be rewritten, conditional on G,
as
yG"XG
A
G
G
A
G
#eG. (2.3)
If qG
A
"N
G
H
G
H
, then the design matrix XG
A
G
is of size (n;qG
A
) and G
A
G
is a vector of
qG
A
elements.
By stacking the linear models for the m regressions, the SUR model can be
written, conditional on '"(', `',
2
, K'), as
y"X
A
A
#e. (2.4)
Here, y'"( y', y`',
2
, yK'), X
A
"diag(X
A
, X`
A
`
,
2
, XK
A
K
) and '
A
"('
A
,
2
, K'
A
K
).
If q
A
"K
G
qG
A
, then X
A
is an (n;q
A
) matrix and
A
a vector of q
A
elements. To
complete this Bayesian hierarchical model, we introduce the following priors on
the parameters.
(i) Following O'Hagan (1995) we construct a conditional prior for
A
by setting
p(
A
,)Jp(y
A
,,)L
so that
A
, &N((
A
, n(X'
A
AX
A
)), where A"I
L
and
(
A
"(X'
A
AX
A
)X'
A
Ay. This data-based fractional prior contains much less
information about
A
than the likelihood. The results obtained using this
prior are similar to those obtained using the prior
A
, &N(0, nI), which
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 261
does not depend on the data. However, we prefer the data-based prior
because the conditional posterior mean of
A
is unbiased.
(ii) The prior for is taken as independent of and is the commonly used
non-informative prior discussed in Zellner (1971), where
p()J'K>'`.
(iii) The indicator variables G
I
, k"1,
2
, pG, i"1,
2
, m, are taken a priori
independently distributed, with p(G
I
"1cG)"cG.
(iv) The hyperparameters cG, i"1,
2
, m, are taken as independent and given
a non-informative uniform prior on (0, 1).
We integrate the hyperparameters "(c,
2
, cK) out of our analysis, so that
p()"jp()p() d"_K
G
j
"
(cG)O
G
A
(1!cG)'N
G
O
G
A
' dcG"_K
G
B(qG
A
#1, pG!qG
A
#1), where B is the beta function. This is a di!erent prior on than that
suggested in Smith and Kohn (1996). Note that the model here is a hierarchical
SUR model as, conditional on , it is simply a linear SUR model.
2.3. Markov chain Monte Carlo sampling
To estimate this model we use the following Markov chain Monte Carlo
sampling scheme with the hyperparameter integrated out.
(1) Generate from
A
, , y
(2) Generate from , , y"
A
, , y
(3) For i"1, 2,
2
, m
For j"1, 2,
2
, pG
Generate from G
H
, G
H
, y using the sampling step described below.
In this sampling scheme
A
is generated from a multivariate normal. Generation
of the matrix directly from the posterior at step (2) is di$cult because the
fractional prior
A
, is centered at (
A
, which is a function of . Conse-
quently, we use a Metropolis}Hastings step where the proposal Wishart density
is the posterior under a #at conditional prior for
A
. This works well with
between 60% and 90% of those iterates that are generated being accepted.
Details of how to generate from the distributions at steps (1) and (2) are given in
Appendix A. It is important to note that care has been taken to generate
G
H
without conditioning on [G
H
at step (3), otherwise the sampling scheme would
be reducible because G
H
is known exactly given [G
H
.
Step (3) generates an iterate of one element at a time. As discussed in Wong
et al. (1997), using a Gibbs sampler is computationally demanding because has
K
G
pG elements, with generation from each conditional posterior density
a computationally intensive exercise. To speed up the generation we use the
following &sampling' step, which is an application of the Metropolis}Hasting
262 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
algorithm. Let (G
H
)"p(G
H
, G
H
, y) be the conditional posterior density of
G
H
and s(G
H
)"p(G
H
G
H
) be its conditional prior density, with integrated out in
both cases. Then, if ' is the previous value of G
H
, a new value "` can be
generated as follows.
Sampling Step
(a) Proposal
If '"1 then generate "` from the proposal density
Q('"1P"`"0)"s(G
H
"0)min
1,
(G
H
"0)
s(G
H
"0)
If '"0 then generate "` from the proposal density
Q('"0P"`"1)"s(G
H
"1)min
1,
(G
H
"1)
s(G
H
"1)
(b) Metropolis}Hastings acceptance probabilities
If '"0 and "`"1, then accept "` with probability
:
"
"min1, s(G
H
"0)/(G
H
"0); otherwise set "`"0.
If '"1 and "`"0, then accept "` with probability
:
"
"min1, s(G
H
"1)/(G
H
"1); otherwise set "`"1.
Appendix A calculates the posterior and prior s, with the latter being
a trivial calculation. Generating from the proposal in part (a) can be undertaken
e$ciently as follows.
First generate u from a uniform distribution on (0, 1), then
(i) if '"1 and u(s(G
H
"1), set "`"1.
(ii) if '"1 and u's(G
H
"1), generate "` from the density
p("`"0)"min(1, (G
H
"0)/s(G
H
"0)).
(iii) if '"0 and u(s(G
H
"0), then set "`"0.
(iv) if '"0 and u's(G
H
"0), generate "` from the density
p("`"1)"min(1, (G
H
"1)/s(G
H
"1)).
Generation is e$cient because in the nonparametric regression problem most
of the indicators will be zero and s(G
H
"0) will be close to one. Hence, most of the
time '"0 and u(s(G
H
"0), so that case (iii) is undertaken most frequently
and is calculated infrequently.
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 263
Appendix B proves that the sampling method described above is a correct
application of the Metropolis}Hastings method and therefore that
(G
H
)"p(G
H
, G
H
, y) is the invariant distribution of each step. The appendix
also includes a lemma demonstrating that the Metropolis}Hastings ratios at
part (b) of the sampling step will usually be one, or close to one. Because steps
(1)}(3) of the sampling scheme either generate directly from the conditional
posterior distributions, or use a Metropolis}Hastings step, the scheme con-
verges to its invariant distribution, which is the posterior distribution
, , y (Tierney, 1994).
The sampling scheme is much faster because step (3) involves a much smaller
number of complex calculations than the full Gibbs sampler. Moreover, focus-
ing the generations is especially important in nonparametric SUR models
compared to a single equation regression, because there are more basis terms.
We have found this sampler to have strong empirical convergence properties,
a point that is demonstrated in the examples in Sections 3 and 4. These sections
also compare the sampler to the Gibbs alternative and demonstrate that it is
more computationally e$cient.
A sampler that generates solely from the parameter space of is not con-
sidered as it is di$cult to calculate the posterior distribution G
H
G
H
, y. Similarly,
samplers that generate from either the parameter space of (, ) or (, ) are
not considered because it is di$cult to recognize the conditional posterior
distribution , y, or calculate G
H
G
H
, [
, y.
2.4. Parameter estimation
Given an initial state for the Markov chain and a &warmup period', after
which the sampler is assumed to have converged to the joint posterior distribu-
tion, we can collect iterates ('', '', ''),
2
, ('(', '(', '(') which form
a Monte Carlo sample from the joint posterior distribution. It is this sample that
is used for inference.
The posterior mean E[y] is estimated by the histogram estimate K"
((1/J)(
H
'H'). We do not use a mixture estimate because the distribution
of
H
E['H', 'H', y] (2.5)
Each of the conditional expectations in the sum is simple to calculate because
E[
, , y]"(
are set to
zero.
264 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
The posterior means E[ f G(z) y] of the functions at Eq. (1.1) at any point z in
the domain of xG is estimated using the mixture estimate
fKG(z)"
1
J
(
H
E[ f G(z)'H', 'H', y]
"*'
1
J
(
H
E[G'H', 'H', y]
"*'KG.
Here, *"(b
(z),
2
, b
Q
(z))' is a vector containing the basis function expansion of
f G evaluated at the point z. The vector KG is made up of the elements of K that
correspond to G. If the function is univariate, so that xG is a scalar, then fKG is an
estimate of a curve, while for higher dimensions it is a surface. For additive
nonparametric models the component function estimates can easily be cal-
culated separately by identifying the basis terms and regression coe$cient
estimates that correspond to each function and forming the inner product of
these.
2.5. Standardized residuals
A common diagnostic in parameteric models are estimates of the standard-
ized errors r"(RI
L
)e, with R'R", which can be shown to be distributed
N(0, I
LK
). We estimate the posterior mean E[r y] of these using the histogram
estimate
r("
1
J
(
G
(R'H'I
L
)e'H' (2.6)
In the above, R'H' is a Cholesky factor, such that R'H''R'H'"'H', while
e'H'"y!X'H'. After calculation, these standardized residuals can be separated
into m sets, one for each regression, so that r("(r(',
2
, r(K'). Such residuals
should be approximately N(0, I
LK
).
3. Real data examples
In this section we apply the approach to two systems where the errors are
likely to be correlated across equations and the functional forms of the relation-
ships f G are unknown a priori and require estimation. In such cases, we show
that the function estimates di!er substantially from those obtained using the
equation-by-equation equivalent estimator. We also show that the sampling
scheme results in a major decrease in the computational burden of generating
over the full one-at-a-time Gibbs sampler.
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 265
3.1. Print advertising data
We demonstrate our procedure using n"457 observations of data from six
issues of an Australian monthly women's magazine collected by Starch INRA
Hooper. Each observation corresponds to an advertisement placed in the
magazine and the following three advertisement exposure scores, which are
recorded from an experimental audience, are used as measures of the various
levels of e!ectiveness of the print advertisement.
y (noted score): Proportion of respondents who claim to recognize the ad as
having been seen by them in that issue.
y` (associated score): Proportion of the respondents who claim to have
noticed the advertiser's brand or company name or logo.
y` (read-most score): Proportion of respondents who claim to have read
half or more of the copy.
These scores are considered to measure advertisement exposure at increasing
levels of depth. It has long been thought that the positioning of an advertisement
within an issue a!ects its exposure to an audience (Hanssens and Weitz, 1980).
To quantify this we constructed the variable P"(page number)/(number of
pages in issue) to represent the position in the issue in which each advertisement
appeared.
To estimate how the exposure of a print advertisement is a!ected by its
position in the magazine, we considered the three nonparametric regressions
yG"f G(P)#eG for i"1, 2, 3. (3.1)
To model each unknown function we use a thin plate spline basis (Powell, 1987),
with basis terms b
H
(x)"x!k
H
` log(x!k
H
), j"1,
2
, n, where k
,
2
, k
L
are
the so-called &knots' which we set equal to the n observations of the independent
variable. Following previous authors (Wahba, 1990) we also augment these
basis terms with an intercept and linear term, so that p"p`"p`"459.
Expected features in the functions f G include high casual attention to advertise-
ments placed in the front (and to a lesser extent back) of the magazine, while the
pre-editorial slots (where P is about 0.7) are thought to attract more indepth
attention. The three scores y, y` and y` are highly positively correlated and it is
likely that the errors are correlated, so a SUR model appears appropriate.
In a parametric SUR model with the same independent variables the general-
ized and ordinary least-squares estimate of the regression coe$cients are the
same. However, this result does not extend to the case where variable selection is
undertaken on the terms in the regressions. In particular, this applies to the
nonparametric function estimators used in this paper as they are based on
variable selection and model averaging applied to linear basis decompositions.
266 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
Fig. 1. (a)}(c) Function estimates of f , f ` and f `, respectively. Bold lines are the nonparametric
SUR estimates, while the dashed lines are the single equation estimates. Panels (d)}(f) contain scatter
plots of P versus the standardized uncorrelated residuals resulting from the nonparametric SUR "t.
The equations at (3.1) were estimated both as a SUR system and individually
with the analogous single equation estimator that ignores any correlation
between the equations. The resulting function estimates are plotted in
Figs. 1(a)}(c) and it can be seen that the SUR and equation-by-equation
estimates di!er substantially. The estimate K from the SUR model is given
below (correlations in italics)
K"
(OD)#f
`
(O=)#f
`
()#e,
`"f `
(OD)#f `
`
(O=)#e`, (3.2)
where the periodic pro"le of electricity load has been decomposed into periodic
daily ( f G
) and weekly ( f G
`
) e!ects for both states. An additive temperature e!ect is
considered for NSW, but we do not include one for VIC because half-hourly
temperature data were not available to us for this state.
We model both f G
and f G
`
with a periodic quadratic reproducing basis (Luo
and Wahba, 1997) where b
G
(x)"((x!k
G
!
`
)`!1/12)/2. We place the knots
k
G
at all the possible observations, so that for the time of day e!ect
k
G
"i/48, i"1,
2
, 48, and for the weekly e!ect k
G
"i/336, i"1,
2
, 336. This
basis was chosen because it ensures that the functions f
, f
`
, f `
and f `
`
are all
periodic on [0, 1). The temperature e!ect was modeled using a thin plate spline
with knots at all observations.
The estimated covariance is (correlation in italics)
K"
29088.13 26093.55
0.5208 86299.88
,
where var(e)(var(e`) because we control for the variation in temperature in
the NSW equation. The positive correlation between the states is likely to be
due, in part, to common weather e!ects and television programming. Figs. 2(a)
and (b) plot the sum of the estimated daily and weekly e!ects for both states
against O=, while the estimated additive Sydney temperature e!ect is given in
Fig. 3. Because the regressions are additive, we present the functions normalized
so that fG
H
(0)"0. The data were collected during March 1998 (late Summer) and
higher temperatures result in increased load due to increased usage of air
conditioning. Estimates of the periodic pro"le are more detailed for NSW as the
model controls for temperature, while the periodic pro"les are more smooth for
VIC. The functions are highly nonlinear and, as Harvey and Koopman (1993)
highlight, are extremely di$cult to model using parametric nonlinear models.
The two equations at (3.2) were also estimated using the analogous single
equation estimator. Fig. 4 plots the di!erence between the SUR and single
equation estimates of the sum of the daily and weekly periodic e!ects. The range
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 269
Fig. 2. Sum of the estimated daily and weekly periodic e!ects for VIC (panel (a)) and NSW (panel
(b)). The dashed curves are those resulting from estimation using the proposed sampler and the bold
curves are those resulting from estimation using a full Gibbs sampler. Both estimates are plotted
together for comparison, although they are almost identical and are hard to visually distinguish.
of the di!erence is about 250 and 400 megawatts for VIC and NSW, respectively,
which is substantial compared to the estimates of (var(e) and (var(e`).
Because the load pro"les devolve slowly during the year and are e!ectively only
static over periods of around two to three weeks, larger sample sizes cannot be
used and it is useful to exploit the correlation between equations.
To investigate the performance of the sampling scheme in Section 2.3 we also
estimated the nonparametric SUR model for the electricity load data using a full
Gibbs sampler which generated directly from each of the K
G
pG posterior
distributions of the binary indicators G
H
at step (3) of the sampling scheme. The
270 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
Fig. 3. Same as in Fig. 2, but for the estimated Sydney additive temperature e!ect fK
`
.
full Gibbs sampler was run using the same number of iterations for the warmup
and sampling periods as the proposed sampler. Figs. 2 and 3 compare the
resulting curve estimates for the Gibbs sampler (bold lines) and proposed
sampler (dashed lines). The curve estimates are virtually identical and sometimes
the two curve estimates are so similar that the dashed lines cannot be made out.
It is di$cult to see how this could occur if the proposed sampler was not
converging to the same distribution as the Gibbs sampler and the sampling
scheme was not mixing well with respect to the indicator variables . The results
are identical regardless of initial starting state for either sampler.
4. Simulation experiments
4.1. Positively correlated univariate regressions
This simulation is concerned with the case where the errors are highly
correlated between regressions. It highlights the improvement in the quality of
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 271
Fig. 4. The di!erence between the SUR and single equation estimates of the sum of the daily and
weekly periodic e!ects. Panel (a) is for the Victorian equation and panel (b) is for the New South
Wales equation.
the estimates that can be obtained when such correlation is modeled rather than
ignored. There are m"4 univariate regressions, with the covariance matrix
speci"ed below at (4.1). Note that the standard deviation of the errors
(var(eG))`"1 is high compared to the range of the functions.
Four true functions were carefully chosen to represent a wide variety of
possible relationships. These are f (x)"sin(8x) (which is highly oscillatory),
f `(x)"((x, 0.2, 0.25)#(x, 0.6, 0.2))/4, with (x, a, b) being a normal density
of mean a and standard deviation b, (which requires a locally adaptive estimator
as there are di!erent degrees of smoothness on the left and right of the function),
f `(x)"1.5x (which was chosen as many relationships are often thought to
be linear) and f "(x)"cos(2x) (which is a smooth nonlinear function). The
independent variables for the four univariate regressions were x&U(0, 1),
x`&U(0, 1) and
x`
x"
&N
0.5
0.5
, 0.3
1 0.6
0.6 1
272 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
We generated n"100 data points from this true SUR model and applied the
nonparametric SUR estimator to this data. We use a quartic reproducing kernel
basis (Luo and Wahba, 1997) with basis terms
b
G
(x)"
1
24
x!x
G
!
1
2
"
!
1
2
x!x
G
!
1
2
`
#
7
240
for i"1,
2
, n,
augmented with an intercept and linear term.
To assess the quality of the resulting estimates of the four functions, we
calculated the mean squared di!erence between the function estimates and the
true functions. This measure of distance between the two is de"ned as
MSD
G
"
1
200
`""
I
( fKG(z
I
)!f G(z
I
))`,
where min(xG)"z
(z
`
(2(z
`""
"max(xG) is an evenly spaced grid over
the domain of xG. For the same data we also "t an analogous single equation
univariate nonparametric estimator to the four regressions. The mean squared
di!erence was also calculated for each of these four function estimates.
The entire process was repeated one hundred times. Figs. 5(a)}(d) give box-
plots of the one hundred resulting values of log(MSD
G
) for each of the four
functions (i"1, 2, 3, 4) and for both the nonparametric SUR estimator
(NSUR) and individual nonparametric estimators (NR). Fig. 5 shows that
taking into account the correlation between the errors has substantially and
consistently improved the resulting estimates of all the regression functions.
To examine the qualitative improvement that occurs, we focus on the single
data set corresponding to the 50th sorted value of "
G
MSD
G
for the non-
parametric SUR estimator. This data set can be regarded as providing a &typical'
example of the procedure and is plotted as four scatter plots in Figs. 5(e)}(h) and
again in Figs. 5(i)}(l). The nonparametric SUR estimates of the four functions for
this data appear in Figs. 5(e)}(h) and the estimates for the separate nonparamet-
ric regressions appear in Figs. 5(i)}(l). These "gures show that the nonparametric
SUR estimator signi"cantly outperforms the separate nonparametric estimators
which ignore the correlation between the separate regressions. The variance of
the errors and its estimate K for this data set are given below.
"
(4.1)
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 273
Fig. 5. (a)}(d) Boxplots of the log(MSD
G
) for i"1, 2, 3 and 4, respectively. The left-hand boxplot is
for the NSUR estimator, while the right-hand boxplot is for the NR estimation procedure. Panels
(e)}(h) contain scatter plots of xG against yG, along with the function estimates fKG (bold line) and true
functions f G (dotted line) for i"1, 2, 3, 4 that result from applying the NSUR estimator. Panels
(i)}(l) plot the function estimates fKG (bold line) and true functions f G (dotted line) for i"1, 2, 3, 4 that
result from applying the NR procedure to the same data.
274 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
Table 2
Time taken by the NSUR procedure to complete a "t to data generated form the model in Section
4.1 for four di!erent sample sizes. In the sampling procedure 2000 iterates were used for the warmup
and the Monte Carlo sample consisted of a subsequent 1000 iterates. The machine used was a low
end workstation
Sample size n"100 n"200 n"400 n"800
Time (s) 43 58 213 3850
The estimate compares favorably to the &best possible' estimate K
'''
that arises
from the sample variance of the true errors themselves, which are known
because this is a simulated example.
K
'''
"var Y((e, e`, e`, e")')"
.
To demonstrate that the NSUR estimator is practical to implement, we report
the time required to "t this four equation system for a variety of sample sizes in
Table 2. The computer used was a low-end modern workstation while the code
was written e$ciently in FORTRAN. Although these timings are implementa-
tion dependent, they do indicate that this computationally intensive Markov
chain Monte Carlo procedure runs in a reasonable time.
4.2. Unrelated regressions
The previous simulation considers a related set of regressions and demon-
strates the improvements in the regression function estimates that can occur
when correlation between the regressions is modeled and estimated, rather than
ignored. However, is there a risk of degrading the function estimates by
modeling correlation that does not exist?
To investigate this case, we repeat the simulation undertaken above, except
that the regressions are unrelated, with "0.5I
"
. Fig. 6 provides the equivalent
output for this case as was produced in Fig. 5. It can be seen from the boxplots in
Figs. 6(a)}(d) that, in general, there is a slight deterioration in the log(MSD
G
) for
the NSUR estimator compared to the NR estimation procedure. This is ex-
pected as the regressions are actually not related and the NSUR procedure also
estimates . However, the loss in the e$ciency of the function estimates is very
small and the function estimates from the NSUR estimator (Figs. 6(e)}(h)) are
almost identical to those from the NR procedure (Figs. 6(i)}(l)) for the median
dataset.
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 275
Fig. 6. As in Fig. 5, but for the uncorrelated model in Section 4.2.
276 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
Nevertheless, in examples when m is large, reliably estimating the m(m#1)/2
free parameters of in addition to the unknown functions may prove di$cult.
In such cases, a more parsimonious representaton for can be used and
estimated in combination with the unknown functions using MCMC. For
example, Shephard and Pitt (1998) discuss the MCMC estimation of factor
decompositions for , while Smith and Kohn (1999) examine parsimonious
representations for based on the Cholesky decomposition of .
Acknowledgements
The authors would like to thank Denzil Fiebig, Paul Kofman, Steve Marron,
Gael Martin and Tom Smith for useful comments. They would also like to
thank two anonymous referees for improving the clarity of the presentation and
for pointing out an error in a previous implementation of the sampling scheme.
Both Michael Smith and Robert Kohn are grateful for the support of Australian
Research Council grants.
Appendix A
A.1. Generating from
A
, , y
This conditional distribution can be calculated exactly, as
p(
A
, , y)Jp( y
A
, , )p(
A
, )
Jexp
!
1
2
n#1
n
(
A
!(
A
)'X'
A
AX
A
(
A
!(
A
)
,
so that
A
, , y&N
(
A
,
n
n#1
(X'
A
AX
A
)
.
Here, (
A
and A are de"ned in Section 2.2.
A.2. Generating from
A
, , y
This conditional distribution is di$cult to recognize as is embedded
in the conditional prior for [
A
. Therefore, to obtain an iterate we use a
Metropolis}Hastings step; see Chib and Greenberg (1995b) for an introduction
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 277
to this method. The proposal density from which we generate a candidate iterate
is given by
q()Jp( y
A
, , )p()
J'LK'` exp
!
1
2
tr()
p(
"`
A
, , y)q(
'
)
p(
'
A
, , y)q(
"`
)
, 1
"min
p(
A
"`
, )
p(
A
'
, )
, 1
.
High acceptance rates of 60}90%are obtained because the proposal density q( ) )
is equal to the correct conditional density except for the factor p(
A
, ).
A.3. Calculating G
H
, G
H
, y
This density requires calculation to enable generation of G
H
in the sampling
scheme.
p(G
H
, G
H
, y)J
p(y, ,
A
)p(
A
, ) d
A
p()
J(n#1)O
A
` exp
!
1
2
S(, )
p(G
H
G
H
) (A.1)
where S(, )"y'Ay!y'AX
A
(X'
A
AX
A
)X'
A
Ay and in Eq. (A.1) the regres-
sion coe$cient is integrated out using
A
&N
(
A
,
n
n#1
(X'
A
AX
A
)
.
The conditional posterior can then be calculated by evaluating (A.1) for G
H
"1
and G
H
"0 and normalizing.
A.4. Calculating p(G
H
G
H
)
The conditional prior can be calculated by integrating out the hyperprior cG,
with p(G
H
G
H
)Jj
"
(cG)O
G
A
(1!cG)'N
G
O
G
A
' dcG"B(qG
A
#1, pG!qG
A
#1), so that
278 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
p(G
H
"1G
H
)"1/(1#h), where h"(pG!aG)/(aG#1) and aG"
IH
G
I
is the
number of elements of GG
H
that are one.
Appendix B
This appendix provides a proof that the invariant distribution of the proposed
sampling step in Section 2.3 is (G
H
)"p(G
H
, G
H
, y).
Proof is as follows. As "` is a binary variable, the proof is undertaken by
calculating the Metropolis}Hastings ratio for the two transitions
('"0P"`"1) and ('"1P"`"0) and showing that it is applied
correctly in step (b) of the sampling step. Let Q('P"`) be the proposal
density, then
Q(1P0)"s(0)min1, (0)/s(0) and Q(0P1)"s(1)min1, (1)/s(1)
(B.1)
The Metropolis}Hastings ratios are
:
"
"min
1,
(1)Q(1P0)
(0)Q(0P1)
, :
"
"min
1,
(0)Q(0P1)
(1)Q(1P0)
.
Because (1)"1!(0) and s(1)"1!s(0), (1)/s(1)'1 i! (0)/s(0)(1 and
(1)/s(1)(1 i! (0)/s(0)'1. Using these inequalities and substituting the pro-
posal densities at (B.1) into the ratios, it can be seen that :
"
"min(1, s(0)/(0))
and :
"
"min(1, s(1)/(1)). These correspond to the ratios applied in part (b) of
the sampling step. Hence the sampling step is reversible and its invariant
distribution is (G
H
).
Lemma. Let (0)"p( yG
H
"0, G
H
, ) and (1)"p( yG
H
"1, G
H
, ) be the likeli-
hoods for G
H
"0 and G
H
"1. Then,
:
"
"s(0)#s(1)min(1, (1)/(0)) and :
"
"s(1)#s(0)min(1, (0)/(1))
(B.2)
Proof Note that
(1)"
s(1)(1)
s(1)(1)#s(0)(0)
and (0)"
s(0)(0)
s(1)(1)#s(0)(0)
.
Substituting these into the expressions for :
"
and :
"
found in part (b) results in
the expressions at (B.2).
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 279
The importance of this lemma is it reveals that the Metropolis}Hastings
ratios will usually be one, or close to one. Because s(1)#s(0)"1, :
"
"1 if
(1)/(0)*1, while :
"
"1 if (0)/(1)*1. Also, even if (1)/(0)(1, because
in the nonparametric regression problem s(0) is usually very close to 1,
:
"
's(0) is close to 1.
References
Bartels, R., Fiebig, D., Plumb, M., 1996. Gas or Electricity, which is cheaper?: an econometric
approach with an application to Australian expendicture data. Energy Journal 17, 33}58.
Box, G., Jenkins, G., Reinsel, G., 1994. Time Series Analysis, 3rd Edition. Prentice-Hall, New Jersey.
Chib, S., Greenberg, E., 1995a. Hierarchical analysis of SUR models with extensions to correlated
serial errors and time varying parameter models. Journal of Econometrics 68, 339}360.
Chib, S., Greenberg, E., 1995b. Understanding the Metropolis}Hastings algorithm. The American
Statisitician 49, 327}335.
Denison, D., Mallick, B., Smith, A., 1998. Automatic bayesian curve "tting. Journal of the Royal
Statistical Society, Series B 60, 333}350.
Donoho, D., Johnstone, I., 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika 81,
425}455.
Friedman, J., 1991. Multivariate adaptive regression splines. The Annals of Statistics (with dis-
cussion) 19, 1}141.
Hanssens, D., Weitz, B., 1980. The e!ectiveness of industrial print advertisements across product
categories. Journal of Marketing Research 17, 294}306.
Harvey, A., Koopman, S., 1993. Forecasting hourly electricity demand using time-varying splines.
Journal of the American Statistical Association 88 (424), 1228}1253.
O'Hagan, A., 1995. Fractional Bayes factors for model comparison (with discussion). Journal of
Royal Statistical Society Series B 57, 99}138.
Holmes, C., Mallick, B., 1998. Bayesian radial basis functions of variable dimension. Neural
Computation 10, 1217}1233.
Luo, Z., Wahba, G., 1997. Hybrid adaptive splines. Journal of the American Statistical Association
92, 107}116.
Mandy, D., Martins-Filho, C., 1993. Seemingly unrelated regressions under additive heteroskedas-
ticity: theory and share equation applications. Journal of Econometrics 58, 315}346.
Min, C., Zellner, A., 1993. Bayesian and non-Bayesian methods for combining models and forecasts
with applications to forecasting international growth rates. Journal of Econometrics 56, 89}118.
Powell, M., 1987. Radial basis functions for multivariate interpolation: a review. In: Mason, J., Cox,
M. (Eds.), Algorithms for Approximation.
Shephard, N., Pitt, M., 1998. Analysis of time varying covariances: a factor stochastic volatility
approach. Preprint.
Smith, M., 2000. Modeling and short term forecasting of new south Wales electricity system load.
Journal of Business and Economic Statistics, to appear.
Smith, M., Kohn, R., 1996. Nonparametric regression via Bayesian variable selection. Journal of
Econometrics 75 (2), 317}344.
Smith, M., Kohn, R., 1999. Bayesian parsimonious covariance matrix estimation. Preprint.
Srivastava, V., Giles, D., 1987. Seemingly Unrelated Regression Equations Models. Marcel Dekker,
New York.
Tierney, L., 1994. Markov chains for exploring posterior distributions. The Annals of Statistics 22,
1701}1762.
280 M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281
Wahba, G., 1990. Spline Models for Observational Data. SIAM, Philadelphia.
Wong, F., Hansen, M., Kohn, R., Smith, M., 1997. Focused sampling and its application to
nonparametric and robust regression. Preprint.
Yee, T., Wild, C., 1996. Vector generalised additive models. Journal of the Royal Statistical Society,
Series B 58, 481}493.
Zellner, A., 1962. An e$cient method of estimating seemingly unrelated regression equations and
tests for aggregation bias. Journal of the American Statistical Association 57, 500}509.
Zellner, A., 1963. Estimators for seemingly unrelated regression equations: some exact "nite sample
results. Journal of the American Statistical Association 58, 977}992.
Zellner, A., 1971. An Introduction to Bayesian Inference in Econometrics. Wiley, New York.
M. Smith, R. Kohn / Journal of Econometrics 98 (2000) 257}281 281