Anda di halaman 1dari 23

CIRJE-F-971

A Variant of AIC Using Bayesian Marginal


Likelihood
Yuki Kawakubo
Graduate School of Economics, The University of Tokyo
Tatsuya Kubokawa
The University of Tokyo
Muni S. Srivastava
University of Toronto
April 2015

CIRJE Discussion Papers can be downloaded without charge from:


http://www.cirje.e.u-tokyo.ac.jp/research/03research02dp.html

Discussion Papers are a series of manuscripts in their draft form.


circulation or distribution except as indicated by the author.

They are not intended for

For that reason Discussion Papers

may not be reproduced or distributed without the written consent of the author.

A Variant of AIC Using Bayesian Marginal Likelihood


Yuki Kawakubo, Tatsuya Kubokawaand Muni S. Srivastava
April 27, 2015

Abstract
We propose an information criterion which measures the prediction risk of the predictive density based on the Bayesian marginal likelihood from a frequentist point
of view. We derive the criteria for selecting variables in linear regression models by
putting the prior on the regression coecients, and discuss the relationship between
the proposed criteria and other related ones. There are three advantages of our method.
Firstly, this is a compromise between the frequentist and Bayesian standpoint because
it evaluates the frequentists risk of the Bayesian model. Thus it is less influenced by
prior misspecification. Secondly, non-informative improper prior can be also used for
constructing the criterion. When the uniform prior is assumed on the regression coefficients, the resulting criterion is identical to the residual information criterion (RIC)
of Shi and Tsai (2002). Lastly, the criteria have the consistency property for selecting
the true model.
Key words and phrases: AIC, BIC, consistency, KullbackLeibler divergence, linear
regression model, residual information criterion, variable selection.

Introduction

The problem of selecting appropriate models has been extensively studied in the literature
since Akaike (1973, 1974), who derived so called the Akaike information criterion (AIC). Since
the AIC and their variants are based on the risk of the predictive densities with respect to the
KullbackLeibler (KL) divergence, they can select a good model in the light of prediction. It
is known, however, that the AIC-type criteria do not have the consistency property, namely,
the probability that the criteria select the true model does not converges to 1. Another
approach to model selection is Bayesian procedures such as Bayes factors and the Bayesian

Graduate School of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN,
Research Fellow of Japan Society for the Promotion of Science, E-Mail: y.k.5.58.2010@gmail.com

Faculty of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN,
E-Mail: tatsuya@e.u-tokyo.ac.jp

Department of Statistics, University of Toronto, 100 St George Street, Toronto, Ontario, CANADA M5S
3G3, E-Mail: srivasta@utstat.toronto.edu

information criterion (BIC) suggested by Schwarz (1978), both of which are constructed
based on the Bayesian marginal likelihood. Bayesian procedures for model selection have
the consistency property in some specific models, while they do not select models in terms
of prediction. In addition, it is known that Bayes factors do not work for improper prior
distributions and that the BIC does not use any specific prior information. In this paper,
we provide a unified framework to derive an information criterion for model selection so that
it can produce various information criteria including AIC, BIC and the residual information
criterion (RIC) suggested by of Shi and Tsai (2002). Especially, we propose an intermediate
criterion between AIC and BIC using the empirical Bayes method.
To explain the unified framework in the general setup, let y be an n-variate observable
random vector whose density is f (y|) for a vector of unknown parameters . Let f(e
y ; y) is
e is an independent replication of y. We here evaluate
a predictive density for f (e
y |), where y

the predictive performance of f (e


y ; y) in terms of the following risk:
{
}
]
[
f
(e
y
|)
R(; f) =
log
f (e
y |)de
y f (y|)dy.
(1.1)
f(e
y ; y)
Since this is interpreted as a risk with respect to the KL divergence, we call it the KL risk.
The spirit of AIC suggests that we can provide an information criterion for model selection
as an (asymptotically) unbiased estimator of the information

I(; f ) =
2 log{f(e
y ; y)}f (e
y |)f (y|)de
y dy
(1.2)
[
]

=E 2 log{f (e
y ; y)} ,
which is a part of (1.1) (multiplied by 2), where E denotes the expectation with respect
to the distribution of f (e
y , y|) = f (e
y |)f (y|). Let = I(; f) E [2 log{f(y; y)}].
Then, the AIC variant based on the predictor f(e
y ; y) is defined by
b
IC(f) = 2 log{f(y; y)} + ,
b is an (asymptotically) unbiased estimator of .
where
It is interesting to point out that IC(f) produces AIC and BIC for specific predictors.
b of . Then,
(AIC) Put f(e
y ; y) = f (e
y |b
) for the maximum likelihood estimator
b )) is the exact AIC or the corrected AIC suggested by Sugiura (1978) and Hurvich
IC(f (e
y |
and Tsai (1989), which is approximated by AIC of Akaike (1973, 1974) as 2 log{f (y|b
)} +
2 dim().

(BIC) Put f(e


y ; y) = f0 (e
y ) = f (e
y |)0 ()d for a proper prior distribution 0 ().
Since it can be easily seen that I(; f0 ) = E [2 log{f0 (y)}], we have = 0 in this case,
so that IC(f0 ) = 2 log{f0 (y)}, which is the Bayesian marginal likelihood. It is noted that
b )} + log(n) dim().
2 log{f0 (y)} is approximated by BIC = 2 log{f (y|
2

The criterion IC(f) can produce not only the conventional information criteria AIC and
BIC, but also various criteria between AIC and BIC. For example, it is supposed that
is divided as = ( t , t )t for a p-dimensional parameter vector of interest and a qdimensional nuisance parameter vector . We assume that has a prior density (|, )
with hyperparameter . The model is described as
y| f (y|, ),
(|, ),
and and are estimated by data. Inference based on such a model is called an empirical
b )
b = f (e
b
b )d
b
b
Bayes procedure. Put f(e
y ; y) = f (e
y |,
y |, )(|
,
for some estimators
b Then, the information in (1.2) is
and .

b )}f
b (e
I(; f ) =
2 log{f (e
y |,
y |, )f (y|, )de
y dy,
(1.3)
and the resulting information criterion is
b )}
b + ,
b
IC(f ) = 2 log{f (y|,

(1.4)

b )}].
b
b is an (asymptotically) unbiased estimator of = I(; f )E [2 log{f (y|,
where
There are three motivations to consider the information I(; f ) in (1.3) and the information criterion IC(f ) in (1.4).
b )
b is evaluated by the risk R(; f )
Firstly, it is noted that the Bayesian predictor f (e
y |,
in (1.1), which is based on a frequentist point of view. On the other hand, the Bayesian risk
is

r(; f ) = R(; f)(|, )d,


(1.5)
which measures the prediction error of f(e
y ; y) under the assumption that the prior information is correct, where = (t , t )t . The resulting Bayesian criteria such as PIC (Kitagawa,
1997) and DIC (Spiegelhalter et al., 2002) are sensitive to the prior misspecification, since
they depend on the prior information. Because R(; f ) can measure the prediction error of
the Bayesian model from a standpoint of frequentists, however, the resulting criterion IC(f )
is less influenced by the prior misspecification.
Secondly, we can construct the information criterion IC(f ) when the prior distribution
of is improper, since the information I(; f ) in (1.3) can be defined formally for the
corresponding improper marginal likelihood. Because the Bayesian risk r(; f ) does not
exist for the improper prior, however, we cannot obtain the corresponding Bayesian criteria
and cannot use the Bayesian risk. Objective Bayesians want to avoid informative prior
and many non-informative priors are improper. The suggested criterion IC(f ) can respond
to such a request. For example, objective Bayesians assume the uniform improper prior on
regression coecients in linear regression models. It is interesting to note that the resulting
variable selection criterion (1.4) is identical to the residual information criterion (RIC) of Shi
and Tsai (2002), which is shown in the next section.
3

Lastly, this criterion has the consistency property. We derive the criterion for the variable
selection problem in general linear regression model and prove that the criterion selects the
true model with probability tending to one. The BIC or marginal likelihood are known to
have the consistency (Nishii, 1984), while most AIC-type criteria are not consistent. But
AIC-type criteria have the property to choose a good model in the sense of minimizing the
prediction error (Shibata, 1981; Shao, 1997). Our proposed criterion is consistent for selection
of the parameters of interest and selects a good model in the light of prediction based on
the empirical Bayes model.
The rest of the paper is organized as follows. In Section 2, we obtain the information
criterion (1.4) in linear regression model with general covariance structure and compare it
with other related criteria. In Section 3, we prove the consistency of the criteria. In Section
4, we investigate the performance of the criteria through simulations. Section 5 concludes
the paper with some discussions.

2
2.1

Proposed Criteria
Variable selection criteria for linear regression model

Consider the linear regression model


y = X + ,

(2.1)

where y is an n 1 observation vector of the response variables, X is an n p matrix of the


explanatory variables, is a p1 vector of the regression coecients, and is an n1 vector
of the random errors. Throughout the paper, we assume that X has full column rank p. Here,
the random error has the distribution Nn (0, 2 V ), where 2 is an unknown scalar and V
is a known positive definite matrix. We consider the problem of selecting the explanatory
variables and assume that the true model is included in the family of the candidate models,
that is the common assumption to obtain the criterion.
We shall construct the variable selection criteria for the regression model (2.1) which is
of the form (1.4). We consider the following two situations.
[i] Normal prior for . We first assume the prior distribution of ,
(| 2 ) N (0, 2 W ),
where W is a p p matrix suitably chosen with full rank. Examples of W are W =
(X t X)1 for > 0 when V is identity matrix, which is introduced by Zellner (1986), or
more simply W = 1 I p . Because the likelihood is f (y|, 2 ) N (X, 2 V ), the marginal
likelihood function is

2
f (y| ) = f (y|, 2 )(| 2 )d
{
}
=(2 2 )n/2 |V |1/2 |W |1/2 |X t V 1 X + W 1 |1/2 exp y t Ay/(2 2 ) ,
4

where A = V 1 V 1 X(X t V 1 X + W 1 )1 X t V 1 . Note that A = (V + B)1 for


B = XW X t , namely f (y| 2 ) N (0, 2 (V + B)). Then we take the predictive density as
f(e
y ; y) = f (e
y |
2 ) and the information (1.3) can be written as
[
]
e t Ae
I,1 () = E n log(2
2 ) + log |V | + log |W X t V 1 X + I p | + y
y /
2 ,
(2.2)
where
2 = y t P y/n and E denotes the expectation with respect to the distribution of
f (e
y , y|, 2 ) = f (e
y |, 2 )f (y|, 2 ) for = ( t , 2 )t . Note that is the parameter of
interest and 2 is the nuisance parameter, which corresponds to in the previous section.
Then we propose the information criterion.
Proposition 2.1 The information I,1 () in (2.2) is unbiasedly estimated by the information criterion
2n
IC,1 = 2 log{f (y|
2 )} +
,
(2.3)
np2
where
2 log{f (y|
2 )} = n log
2 + log |V | + log |W X t V 1 X + I p | + y t Ay/
2 + (const),
namely, E (IC,1 ) = I,1 ().
If n1 W 1/2 X t V 1 XW 1/2 converges to a p p positive definite matrix as n ,
log |W X t V 1 X + I p | can be approximated to p log n. In that case, IC,1 is approximately
expressed as
IC,1 = n log
2 + log |V | + p log n + 2 + y t Ay/
2,
when n is large.
Alternatively, the KL risk r(; f) in (1.5) can be also used for evaluating the risk of the
predictive density f (e
y |
2 ), since the prior distribution is proper. The resulting criterion is
IC,2 = n log
2 + log |V | + p log n + p,

(2.4)

which is an asymptotically unbiased estimator of I,2 ( 2 ) = E [I,1 ()] up to constant


where E denotes the expectation with respect to the prior distribution (| 2 ), namely
E E (IC,2 ) I,2 ( 2 ) + (const) as n . It is interesting to point out that IC,2 is
analogous to the criterion proposed by Bozdogan (1987) known as the consistent AIC, who
suggested to replace the penalty term 2p in the AIC with p + p log n.
[ii] Uniform prior for . We next assume the uniform prior for , namely unif orm(Rp ).
Though this is improper prior distribution, we can obtain the marginal likelihood function
formally:

2
fr (y| ) = f (y|, 2 )d
{
}
=(2 2 )(np)/2 |V |1/2 |X t V 1 X|1/2 exp y t P y/(2 2 ) ,
5

which is known as the residual likelihood (Patterson and Thompson, 1971), where P = V 1
V 1 X(X t V 1 X)1 X t V 1 . Then we take the predictive density as f(e
y ; y) = fr (e
y |
2 ) and
the information (1.3) can be written as
[
]
etP y
e /
Ir () = E (n p) log(2
2 ) + log |V | + log |X t V 1 X| + y
2 ,
(2.5)
where
2 = y t P y/(n p), which is the residual maximum likelihood (REML) estimator of
2 based on the residual likelihood fr (y| 2 ). Then we propose the information criterion.
Proposition 2.2 The information Ir () in (2.5) is unbiasedly estimated by the infomation
criterion
2(n p)
ICr = 2 log{fr (y|
2 )} +
,
(2.6)
np2
where
2 log{fr (y|
2 )} = (n p) log
2 + log |V | + log |X t V 1 X| + y t P y/
2 + (const),
namely, E (ICr ) = Ir ().
Note that y t P y/
2 = np. If n1 X t V 1 X converges to pp positive definite matrix as
t 1
n , log |X V X| can be approximated to p log n. In that case, we can approximately
write
(n p)2
ICr = (n p) log
,
(2.7)
2 + log |V | + p log n +
np2
when n is large. It is important to note that ICr is identical to the RIC proposed by Shi
and Tsai (2002) up to constant. Since (n p)2 /(n p 2) = (n + 2) + {4/(n p 2) p}
and n + 2 is irrelevant to the model, we can subtract n + 2 from ICr in (2.7), which results
in the RIC exactly. Note the criterion based on fr (y| 2 ) and r(; fr ) cannot be constructed
because the KL risk of it diverges to infinity.

2.2

Extension to the case of unknown covariance

In the derivation of the criteria, we have assumed that the scaled covariance matrix V of
the error terms vector are known. However, it is often the case that V is unknown and is
some function of the unknown parameter , namely V = V (). In that case, V in each
b where
b is some consistent estimator
criterion is replaced with its plug-in estimator V (),
of . This strategy is also used in many other studies, for example in Shi and Tsai (2002),
who proposed the RIC. We suggest that the is estimated based on the full model. The
method to estimate the nuisance parameters by the full model is similar to the Cp criterion
by Mallows (1973). The scaled covariance matrix W of the prior distribution of is also
assumed to be known. In practice, its structure should be specified and we have to estimate
the parameters involved in W from the data. In the same manner as V , W in each
b We propose that is estimated based on each candidate
criterion is replaced with W ().
model under consideration because the structure of W depends on the model.
6

We here give three examples for the regression model (2.1), a regression model with
constant variance, a variance components model, and a regression model with ARMA errors,
where the second and the third ones include the unknown parameter in the covariance matrix.
[1] regression model with constant variance. In the case where V = I n , (2.1) represents
a multiple regression model with constant variance. In this model, the scaled covariance
matrix V does not contain any unknown parameters.
[2] variance components model. Consider a variance components model (Henderson,
1950) described by
y = X + Z 2 v 2 + + Z r v r + ,
(2.8)
where Z i is an n mi matrix with V i = Z i Z ti , v i is an mi 1 random vector having the distribution Nmi (0, i I mi ) for i 2, is an n 1 random vector with Nn (0, V 0 + 1 V 1 ) for
known n n matrices V 0 and V 1 , and , v 2 , . . . , v r are mutually independently distributed.
The nested error regression model (NERM) is a special case of variance components model
given by
yij = xtij + vi + ij , (i = 1, . . . , m; j = 1, . . . , ni ),
(2.9)
where vi s and ij
s are mutually independently distributed as vi N (0, 2 ) and ij
2
2
N (0, 2 ) and n = m
i=1 ni . Note that the NERM in (2.9) is given by 1 = , 2 = , V 1 =
I n and Z 2 = diag (j n1 , . . . , j nm ), where j k is the k-dimensional vector of ones, for variance
components model (2.8). This model is often used for the clustered data and vi can be
seen as the random eect of the cluster (Battese et al., 1988). For such a model, when one
is interested in the specific cluster or predicting the random eects, the conditional AIC
proposed by Vaida and Blanchard (2005), which is based on the conditional likelihood given
the random eects, is appropriate. However, when the aim of the analysis is focused on the
population, the NERM can be seen as linear regression model and the random eects are
involved in the error term, namely we can treat = Z 2 v 2 + , V = V () = V 2 + I n for
(2.1), where = 2 / 2 and V 2 = Z 2 Z t2 = diag (J n1 , . . . , J nm ) for J k = j k j tk . In that case,
our proposed variable selection procedure is valid.
[3] regression model with autoregressive moving average errors. Consider the regression model (2.1), assuming the random errors are generated by an ARMA(q, r) process
defined by
i 1 i1 q iq = ui 1 ui1 r uir ,
where {ui } is a sequence of independent normal random variables having mean 0 and variance
2 . A special case of this model is the regression model with AR(1) errors satisfying 1
N (0, 2 /(1 2 )), i = i1 + ui , ui N (0, 2 ) for i = 2, 3, . . . , n. When we define
2 = 2 /(1 2 ), (i, j)-element of the scaled covariance matrix V in (2.1) is |ij| .

Consistency of the Criteria

In this section, we prove that the proposed criteria have the consistency property. Our
asymptotic framework is that n goes to infinity and the true dimension of the regression
7

coecients p is fixed. Following Shi and Tsai (2002), we first show the criteria are consistent
for the regression model with constant variance and the prespecified W , and then extend
the result to the regression model with general covariance matrix and the case where W is
estimated.
To discuss the consistency, we define the class of the candidate models and the true model
more formally. Let n p matrix X consist of all the explanatory variables and assume that
rank (X) = p. To define the candidate model by the index j, suppose that j denotes a subset
of = {1, . . . , p} containing pj elements, namely pj = #(j), and X j consists of pj columns
of X indexed by the elements of j. Note that X = X and p = p. We define the index set
by J = P(), namely the power set of . Then the model j is
y = X j j + j ,

j Nn (0, j2 V ),

where X j is n pj and j is pj 1. The prior distribution of j is j Npj (0, 2 W j ).


The true model is defined by j0 and X j0 j0 is abbreviated to X 0 0 , which is the true mean
of y. J is the collection of all the candidate models and divide J into two subsets J+ and
J , where J+ = {j J : j0 j} and J = J \ J+ . Note that the true model j0 is the
smallest model in J+ . Let denote the model selected by some criterion. Following Shi and
Tsai (2002), we make the assumptions.
(A1) E(4s
1 ) < for some positive integer s.
(A2) 0 < lim inf min |X tj X j /n| and lim sup max |X tj X j /n| < .
n jJ
1

(A3) lim inf n


n

inf X 0 0

jJ

n
H j X 0 0 2

jJ

> 0, where H j = X j (X tj X j )1 X tj .

We can now obtain asymptotic properties of the criteria for the regression model with constant
variance.
Theorem 1 If assumptions (A1)(A3) are satisfied, J+ is not empty, the i s are independent and identically distributed (iid) and W j in the prior distribution of j is prespecified,
then the criteria IC,1 , IC,1 , IC,2 , ICr and ICr are consistent, namely P (
= j0 ) 1 as
n .
The proof of Theorem 1 is given in Appendix B.
We next consider the regression model with a general covariance structure and the case
where W j is estimated by the data. In this case, V and W j are replaced with their plug-in
b and W j (
b j ), respectively.
estimators V ()
b and
b j j,0 tend to 0 in probability as n for all
Theorem 2 Assume that
0
j J . In addition, assume that the elements of V () and W j (j ) are continuous functions
of and j , and V () and W j (j ) is positive definite in the neighborhood of 0 and j,0 for
all j J . If assumptions (A1)(A3) are satisfied when X j and are replaced with V 1/2 X j
and = V 1/2 respectively, J+ is not empty and the i s are iid, then the criteria IC,1 ,
IC,1 , IC,2 , ICr and ICr are consistent.
For the proof of Theorem 2, we can use the same techniques as those for the proof of
Theorem 1.
8

Simulations

In this section, we compare the numerical performance of the proposed criteria with some
other conventional ones, which are AIC, BIC, the corrected AIC (AICC) by Sugiura (1978)
and Hurvich and Tsai (1989). We shall consider the three regression modelsregression
model with constant variance, NERM, and regression model with AR(1) errorswhich are
taken as examples of (2.1) in Section 2.2. For the NERM, we consider the balanced sample
case, namely n1 = = nm (= n0 ). In each simulation, 1000 realizations are generated
from (2.1) with = (1, 1, 1, 1, 0, 0, 0)t , namely the full model is seven-dimensional and the
true model is four-dimensional. All explanatory variables are randomly generated from the
1/2
standard normal distribution. The signal-to-noise ratio (SNR = {var(xti )/var(i )} ) is
controlled at 1, 3, and 5. In the NERM, three cases of variance ratio = 2 / 2 are considered
with = 0.5, 1 and 2. In the regression model with AR(1) errors, three correlation structures
are considered with AR parameter = 0.1, 0.5 and 0.8.
When deriving the criteria IC,1 , IC,1 and IC,2 , we set the prior distribution of as
Np (0, 2 1 I p ), namely W = 1 I p . The hyperparameter is estimated by maximizing
the marginal likelihood f (y|
2 ), where the estimate
2 = y t P y/n of 2 is plugged in. The
unknown parameter involved in V is estimated by some consistent estimator based on the
full model. In the NERM, = 2 / 2 is estimated by 2PR /
2PR , where 2PR and
2PR are
t
t
unbiased estimators proposed by Prasad and Rao (1990). Let S0 = y {I n X(X X)1 X t }y
and S1 = y t {E EX(X t EX)1 X t E}y where E = diag (E 1 , . . . , E m ), E i = I n0 n1
0 J n0
for i = 1, . . . , m. Then, the PrasadRao estimators of 2 and 2 are
{
}

2PR = S1 /(n m p), 2PR = S0 (n p)


2PR /n
where n = n tr [Z t2 X(X t X)1 X t Z 2 ]. In the regression model with AR(1) errors, the
AR parameter is estimated by the maximum likelihood estimator based. Note that is
estimated based on the full model and that 2 and is estimated based on each candidate

model using the plug-in version of V ().


The candidate models include all the subsets of the full model and select the model by
the criteria. The performance of the criteria is measured by the number of selecting the
true model and the prediction error of the selected model based on quadratic loss, namely
b X 0 2 /n.
X

0
Tables 13 give the number of selecting the true model by the criteria and the average
prediction error of the selected model by each criterion is shown in Tables 46 for each of
the regression models. From these tables, we can see the following three facts. Firstly, the
number of selecting the true model approaches 1000 for all the proposed criteria, that is the
numerical evidence of the consistency of the criteria. Though the BIC is also consistent, the
small sample performance is not as good as our criteria. Secondly, the proposed criteria are
not only consistent but also have smaller prediction error even when the sample size is small.
Especially, IC,1 is the best for the most of the experiments except when both the sample
size and SNR are small. AIC and AICC have good performance in that situation in terms of
9

prediction error. Thirdly, IC,1 and ICr have better performance than their approximation
IC,1 and ICr , respectively, but the dierence gets smaller as n becomes larger.
Table 1: The number of selecting the true model by the criteria in 1000 realizations of the
regression model with constant variance.
SNR = 1 SNR = 3 SNR = 5
n = 20

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

130
118
89
115
73
73
143
147

428
587
749
843
732
731
797
828

428
588
755
905
738
737
882
898

n = 40

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

419
424
470
472
353
352
462
478

536
800
687
900
876
876
895
899

536
800
687
938
876
876
934
941

n = 80

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

546
827
604
750
839
838
722
739

553
872
613
934
928
928
937
936

553
872
613
968
928
928
968
969

10

Table 2: The number of selecting the true model by the criteria in 1000 realizations of the
nested error regression model.
= 0.5
=1
=2
SNR
1
3
5
1
3
5
1
3
5
n0 = 5
m=4

AIC
75 382
BIC
60 546
AICC 57 694
IC,1
71 821
IC,1
35 646
IC,2
34 642
ICr
244 789
ICr
78 723

385 104 387


564 90 539
736 86 693
902 119 829
702 66 656
701 70 653
888 310 818
912 106 723

393
574
742
915
715
715
900
922

137
134
140
181
115
116
432
149

393
556
691
835
662
657
838
698

403
586
748
941
721
720
924
941

n0 = 5
m=8

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

220
169
219
319
107
112
436
209

458
731
607
891
838
838
890
901

458
731
607
936
839
839
934
944

235
208
251
369
151
158
512
230

465
739
612
913
836
836
903
901

465
741
612
943
841
841
943
949

259
240
284
436
199
202
593
247

473
746
625
928
843
841
925
905

473
750
627
953
853
853
953
962

n0 = 5 AIC
m = 16 BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

418
394
452
622
299
300
691
443

522
853
594
926
911
910
925
932

522
853
594
955
910
910
954
959

417
407
447
618
332
331
695
438

528
859
603
941
913
913
942
946

528
859
603
959
913
913
960
964

418
416
443
624
354
358
709
421

545
866
616
951
915
915
953
956

545
866
616
968
915
915
968
972

11

Table 3: The number of selecting the true model by the criteria in 1000 realizations of the
regression model with AR(1) errors.
SNR

= 0.1
1
3

= 0.5
1
3

= 0.8
1
3

n = 20 AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

110
91
84
101
64
63
123
122

347
482
642
741
620
618
698
702

346 125 373 372


482 118 523 542
646 117 652 688
834 138 774 862
625 83 625 667
624 88 621 667
801 224 769 846
790 144 715 826

193
209
228
274
194
197
562
241

377
502
584
742
554
553
808
646

414
587
715
901
688
685
901
825

n = 40 AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

365
369
416
422
315
314
430
412

483
756
642
877
844
844
866
865

483
756
642
917
844
844
917
918

356
334
373
450
281
282
507
377

544
791
671
909
859
858
903
881

544
793
672
949
862
862
945
932

283
286
308
427
247
255
685
316

533
737
668
901
754
752
918
785

551
797
700
976
856
854
974
932

n = 80 AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

516
789
586
738
795
792
713
714

521
851
593
926
912
912
921
920

521
851
593
949
912
912
951
951

483
598
525
691
560
559
724
600

552
865
614
936
905
905
938
911

552
865
614
962
905
905
961
952

333
334
367
550
289
296
692
382

553
859
637
961
899
889
966
908

553
868
638
979
919
919
980
958

12

Table 4: The prediction error of the best model selected by the criteria for the regression
model with constant variance.
SNR
1
3
5
n = 20

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

1.59
1.74
1.77
1.70
1.92
1.92
1.56
1.57

0.141
0.131
0.120
0.111
0.122
0.122
0.116
0.114

0.0504
0.0468
0.0421
0.0372
0.0429
0.0430
0.0383
0.0374

n = 40

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

0.708
0.862
0.732
0.754
1.05
1.05
0.716
0.718

0.0660
0.0568
0.0609
0.0523
0.0534
0.0534
0.0524
0.0522

0.0238
0.0205
0.0219
0.0180
0.0192
0.0192
0.0181
0.0179

n = 80

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

0.292
0.265
0.283
0.265
0.285
0.285
0.270
0.267

0.0321
0.0260
0.0310
0.0244
0.0245
0.0245
0.0243
0.0243

0.0115
0.00936
0.0112
0.00841
0.00883
0.00883
0.00841
0.00840

13

Table 5: The prediction error of the best model selected by the criteria for the nested error
regression model.
= 0.5
=1
=2
SNR
1
3
5
1
3
5
1
3
5
n0 = 5
m=4

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

1.80
1.96
2.00
1.93
2.14
2.14
1.48
1.91

0.150
0.159
0.172
0.151
0.190
0.192
0.135
0.251

0.0524
0.0496
0.0458
0.0417
0.0470
0.0470
0.0422
0.0413

1.61
1.76
1.78
1.72
1.88
1.87
1.37
1.72

0.145
0.159
0.174
0.156
0.188
0.190
0.137
0.271

0.0494
0.0473
0.0443
0.0410
0.0452
0.0452
0.0413
0.0434

1.44
1.50
1.53
1.47
1.60
1.59
1.21
1.52

0.140
0.158
0.179
0.159
0.186
0.186
0.139
0.316

0.0463
0.0447
0.0428
0.0400
0.0434
0.0434
0.0404
0.0471

n0 = 5
m=8

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

0.983
1.25
1.10
0.989
1.41
1.40
0.743
1.16

0.0696
0.0622
0.0655
0.0567
0.0594
0.0594
0.0568
0.0609

0.0251
0.0224
0.0236
0.0197
0.0211
0.0211
0.0197
0.0196

0.907
1.11
0.989
0.888
1.21
1.20
0.666
1.07

0.0659
0.0615
0.0627
0.0554
0.0613
0.0613
0.0557
0.0691

0.0237
0.0216
0.0226
0.0196
0.0207
0.0207
0.0196
0.0195

0.824
1.01
0.899
0.804
1.07
1.07
0.602
1.01

0.0619
0.0613
0.0610
0.0565
0.0635
0.0652
0.0565
0.0841

0.0223
0.0209
0.0215
0.0194
0.0202
0.0202
0.0194
0.0193

n0 = 5 AIC
m = 16 BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

0.451
0.740
0.489
0.433
0.864
0.862
0.348
0.658

0.0341
0.0294
0.0333
0.0280
0.0283
0.0283
0.0280
0.0278

0.0123
0.0106
0.0120
0.00980
0.0102
0.0102
0.00981
0.00977

0.440
0.711
0.483
0.435
0.812
0.811
0.346
0.664

0.0325
0.0289
0.0319
0.0277
0.0281
0.0281
0.0277
0.0276

0.0117
0.0104
0.0115
0.00983
0.0101
0.0101
0.00982
0.00979

0.434
0.684
0.481
0.438
0.770
0.766
0.344
0.683

0.0308
0.0284
0.0304
0.0275
0.0279
0.0279
0.0274
0.0281

0.0111
0.0102
0.0109
0.00980
0.0100
0.0100
0.00980
0.00978

14

Table 6: The prediction error of the best model selected by the criteria for the regression
model with AR(1) errors.
SNR

n = 20

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

1.85
2.01
2.04
1.96
2.17
2.17
1.79
1.84

n = 40

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

n = 80

AIC
BIC
AICC
IC,1
IC,1
IC,2
ICr
ICr

= 0.1
3

= 0.5
3

0.175
0.170
0.163
0.153
0.164
0.164
0.154
0.159

0.0628
0.0595
0.0549
0.0492
0.0559
0.0559
0.0505
0.0507

1.71
1.82
1.85
1.78
1.95
1.95
1.59
1.72

0.783
0.973
0.824
0.839
1.15
1.15
0.782
0.813

0.0733
0.0639
0.0681
0.0587
0.0601
0.0601
0.0592
0.0592

0.0264
0.0230
0.0245
0.0204
0.0216
0.0216
0.0203
0.0203

0.812
0.992
0.881
0.832
1.10
1.10
0.711
0.857

0.311
0.301
0.303
0.286
0.331
0.333
0.290
0.292

0.0342 0.0123 0.365 0.0329 0.0118


0.0282 0.0102 0.500 0.0290 0.0105
0.0331 0.0119 0.377 0.0322 0.0116
0.0262 0.00920 0.365 0.0279 0.00983
0.0266 0.00958 0.566 0.0284 0.0102
0.0266 0.00958 0.566 0.0284 0.0102
0.0264 0.00918 0.330 0.0278 0.00984
0.0264 0.00919 0.414 0.0283 0.00991

15

= 0.8
3

0.173
0.192
0.203
0.182
0.211
0.216
0.157
0.212

0.0613
0.0583
0.0577
0.0538
0.0566
0.0566
0.0516
0.0557

1.75
1.73
1.69
1.66
1.71
1.72
1.71
1.77

0.256
0.305
0.343
0.312
0.355
0.354
0.224
0.361

0.0759
0.0802
0.0909
0.0831
0.0962
0.0972
0.0720
0.114

0.0685
0.0640
0.0662
0.0595
0.0638
0.0639
0.0598
0.0631

0.0246
0.0225
0.0236
0.0207
0.0218
0.0218
0.0208
0.0209

1.09
1.13
1.12
1.07
1.14
1.15
0.919
1.12

0.124
0.162
0.135
0.136
0.202
0.202
0.118
0.195

0.0365
0.0408
0.0360
0.0347
0.0432
0.0441
0.0347
0.0446

0.666
0.861
0.693
0.643
0.914
0.910
0.514
0.778

0.0520
0.0613
0.0526
0.0530
0.0692
0.0692
0.0499
0.0662

0.0187
0.0182
0.0186
0.0179
0.0181
0.0181
0.0179
0.0180

Discussion

We have derived the variable selection criteria for linear regression model relative to the
frequentist KL risk of the predictive density based on the Bayesian marginal likelihood. We
have proved the consistency of the criteria and have showed that they perform well also in
the sense of the prediction through simulations.
We gave some advantages of the approach based on frequentists risk R(; f) in (1.1). We
here explain them more clearly through comparison of the related Bayesian criteria. When
the prior distribution (|, ) is proper, we can treat the Bayesian prediction risk

r(; f) = R(; f)(|, )d


in (1.5). When and are known, the predictive density f(e
y ; y) which minimizes r(; f)
is the Bayesian predictive density (posterior predictive density) f (e
y |y, , ) given by

f (e
y |, )f (y|, )(|, )d

f (e
y |, )(|y, , )d =
.
f (y|, )(|, )d
When and are unknown, we can consider the Bayesian risk of the plug-in predictive
b ).
b
density f (e
y |y, ,
Then the resulting criterion is known as the predictive likelihood
(Akaike, 1980a) or the PIC (Kitagawa, 1997). The deviance information criterion (DIC) of
Spiegelhalter et al. (2002) and the Bayesian predictive information criterion (BPIC) of Ando
(2007) are related criteria based on the Bayesian prediction risk r(; f).
The Akaikes Bayesian information criterion (ABIC) (Akaike, 1980b) is another information criterion based on the Bayesian marginal likelihood, given by
b + 2 dim(),
ABIC = 2 log{f (y|)}
where the nuisance parameter is not considered. The ABIC measures the following KL
risk:
{
}
]
[
f (e
y |)
log
f (e
y |)de
y f (y|)dy,
b
f (e
y |)
which is not the same as either R(; f) or r(; f). The ABIC is the criterion for choosing the
hyperparameter in the same sense as the AIC. However, it is noted that the ABIC works
as a model selection criterion for because it is based on the Bayesian marginal likelihood.
A drawback of such Bayesian criteria is that we cannot construct them for improper prior
distributions (|, ), since the corresponding Bayesian prediction risks do not exist. On
the other hand, we can construct the corresponding criteria based on R(; f), because the
approach suggested in this paper measures the prediction risk in the framework of frequentists. In fact, putting the uniform improper prior on regression coecients in the linear
regression model, we get the RIC of Shi and Tsai (2002). Note that the criteria based on
16

improper marginal likelihood works as variable selection only when the marginal likelihood
itself does. For the case where the improper priors cannot be used for model selection, intrinsic prior was proposed in the literature (Berger and Pericchi, 1996; Casella and Moreno, 2006,
and others), which is an objective and automatic procedure. As future work, it is worthwhile
to consider such an automatic procedure in the framework of our proposed criteria.
Acknowledgments.
The first author was supported in part by Grant-in-Aid for Scientific Research (26-10395)
from Japan Society for the Promotion of Science (JSPS). The second author was supported
in part by Grant-in-Aid for Scientific Research (23243039 and 26330036) from JSPS. The
third author was supported in part by NSERC of Canada.

Derivations of the Criteria

In this section, we show the derivations of the criteria. To this end, we obtain the following
lemma, which was shown in Section A.2 of Srivastava and Kubokawa (2010).
Lemma A.1 Assume that C is an n n symmetric matrix, M is an idempotent matrix of
rank p and that u N (0, I n ). Then,
[
]
ut Cu
tr (C)
2tr [C(I n M )]
E
=

.
t
u (I n M )u
n p 2 (n p)(n p 2)

A.1

Derivation of IC,1 in (2.3)

It is sucient to show that the bias correction ,1 = I,1 () E [2 log{f (y|


2 )}] is
2n/(n p 2), where I,1 () is given by (2.2). It follows that
,1 =E (e
y t Ae
y /
2 ) E (y t Ay/
2)
=E (e
y t Ae
y ) E (1/
2 ) E (y t Ay/
2 ).
Firstly,
E (e
y t Ae
y ) =E [(e
y X + X)t A(e
y X + X)]
t
t
2
= tr (AV ) + X AX.

(A.1)

Secondly, noting that n


2 = y t P y = 2 ut (I n M )u for
u =V 1/2 (y X)/,
M =I n V 1/2 X(X t V 1 X)1 X t V 1/2 ,
and that P X = 0, we can obtain

)
[
]
1
1
= nE 2 t
E (1/
) =nE
ytP y
u (I n M )u
n
= 2
.
(n p 2)

(A.2)

17

(A.3)

Finally,

]
t
1/2
t
2 t 1/2

u
V
AV
u
+

X
AX
= nE
E (y t Ay/
2 ) =nE
2 ut (I n M )u
{
}
tr (AV )
2tr (AV P V )
t X t AX
=n

+
.
n p 2 (n p)(n p 2) 2 (n p 2)
(

y t Ay
ytP y

(A.4)

The last equation in the above can be derived by Lemma A.1. Combining (A.1), (A.3) and
(A.4), we get
2n tr (AV P V )
,1 =
.
(n p)(n p 2)
We can see that
tr (AV P V ) =tr {(V + B)1 (V + B B)P V }
=tr (P V ) tr {(V + B)1 BP V }
=tr (I n M ) = n p,

(A.5)

since BP = XW X t P = 0, then we obtain ,1 = 2n/(n p 2).

A.2

Derivation of IC,2 in (2.4)

From the fact that E (IC,1 ) = I,1 () and that E E (IC,1 ) = E [I,1 ()] = I,2 ( 2 ), it
suces to show that E E (IC,1 ) is approximated to
E E (IC,1 ) E E [n log
2 + log |V | + p log n + 2 + E E (y t Ay/
2 )]
E E [n log
2 + log |V | + p log n + p] + (n + 2) = E E (IC,2 ) + (n + 2),
when n is large. Note that n + 2 is irrelevant to the model. It follows that
)
( t
y Ay
E

2
]
[ t 1
y {V V 1 X(X t V 1 X + W 1 )1 X t V 1 }y
=n E
y t {V 1 V 1 X(X t V 1 X)1 X t V 1 }y
]
[ t 1
y V X(X t V 1 X + W 1 )1 W 1 (X t V 1 X)1 X t V 1 y
=n + n E
y t {V 1 V 1 X(X t V 1 X)1 X t V 1 }y
[
]
n
=n + 2
E y t V 1 X(X t V 1 X + W 1 )1 W 1 (X t V 1 X)1 X t V 1 y
(n p 2)
[
n
=n + 2
2 tr {(X t V 1 X + W 1 )1 W 1 }
(n p 2)
]
+ t X t V 1 X(X t V 1 X + W 1 )1 W 1 ,
and that
E [ t X t V 1 X(X t V 1 X + W 1 )1 W 1 ] = 2 tr [X t V 1 X(X t V 1 X + W 1 )1 ].
18

If n1 X t V 1 X converges to p p positive definite matrix as n , tr [(X t V 1 X +


W 1 )1 W 1 ] 0 and tr [X t V 1 X(X t V 1 X + W 1 )1 ] p. Then we can obtain
E E (y t Ay/
2 n) p, which we want to show.

A.3

Derivation of ICr in (2.6)

We shall show that the bias correction r = Ir () E [2 log{fr (y|


2 )}] is 2(n p)/(n
p 2), where Ir () is given by (2.5). Then,
e /
r =E (e
ytP y
2 ) E (y t P y/
2)
e ) E (1/
=E (e
ytP y
2 ) (n p).
e ) = (n p) 2 and E (1/
Since E (e
ytP y
2 ) = (n p)/{ 2 (n p 2)}, we get r =
2(n p)/(n p 2).

Proof of Theorem 1

We only prove the consistency of IC,1 . The proof of the consistency of the other criteria can
be done in the same manner. Because we see that
P (
= j) P {IC,1 (j) < IC,1 (j0 )}
for any j J \ {j0 }, it suces to show that P {IC,1 < IC,1 (j0 )} 0, or equivalently
P {IC,1 (j) IC,1 (j0 ) > 0} 1 as n . When V = I n , we obtain
IC,1 (j) IC,1 (j0 ) = I1 + I2 + I3 ,
where
I1 =n log(
j2 /
02 ) + y t Aj y/
j2 y t A0 y/
02 ,
t
1
I2 = log |X tj X j + W 1
j | log |X 0 X 0 + W 0 |,
2n
2n
I3 = log{|W j |/|W 0 |} +

,
n pj 2 n p0 2
t
1
j20 , Aj = I n X j (X tj X j + W 1
02 =
for
j2 = y t (I n H j )y/n,
j ) X j and H 0 = H j0 .
We evaluate asymptotic behaviors of I1 , I2 and I3 for j J and j J+ \ {j0 }, separately.

[Case of j J ]. Firstly, we evaluate I1 . We decompose I1 = I11 + I12 , where I11 =


02 . It follows that
j2 y t A0 y/
02 ) and I12 = y t Aj y/
n log(
j2 /
02 =(X 0 0 + )t (I n H j )(X 0 0 + )/n t (I n H 0 )/n

j2
=X 0 0 H j X 0 0 2 /n + op (1),

19

Then we can see that


(
}
)
{

j2
02
X 0 0 H j X 0 0 2
1
n I11 = log 1 +
+ op (1),
= log 1 +

02
n 2
and it follows from the assumption (A3) that
{
}
X 0 0 H j X 0 0 2
lim inf log 1 +
> 0.
n
n 2

(B.1)

(B.2)

Because y t Aj y/(n
j2 ) = 1 + op (1) and y t A0 y/(n
02 ) = 1 + op (1), we obtain
n1 I12 = op (1).

(B.3)

Secondly, we evaluate I2 . It follows that


t
1
log |X tj X j + W 1
j | = pj log n + log |X j X j /n + W j /n| = pj log n + O(1).

It can be also seen that log |X t0 X 0 + W 1


0 | = p0 log n + O(1). Then,
n1 I2 = (pj p0 )n1 log n + o(1) = o(1).
Lastly, it is easy to see that

(B.4)

n1 I3 = o(1).

(B.5)

P {IC,1 (j) IC,1 (j0 ) > 0} 1,

(B.6)

From (B.1)(B.5), it follows that

for all j J .
[Case of j J+ \ {j0 }]. Firstly, we evaluate I1 . From the fact that

02
j2 = t (H j H 0 )/n = Op (n1 ),

(B.7)

it follows that
}

02 (
02
j2 )
=(log n) n log

02
=(log n)1 n log{1 + Op (n1 )} = op (1).
{

(log n) I11

(B.8)

As for I12 , from (B.7) and y t Aj y y t A0 y = Op (1), we can obtain


I12 =y t Aj y/
j2 y t A0 y/
02
=(y t Aj y y t A0 y)/
02 + Op (1) = Op (1).
Then,

(log n)1 I12 = op (1).


20

(B.9)

Secondly, we evaluate I2 . Since pj > p0 for all j J+ \ {j0 },


lim inf (log n)1 I2 = pj p0 > 0.

(B.10)

(log n)1 I3 = o(1).

(B.11)

P {IC,1 (j) IC,2 (j0 ) > 0} 1,

(B.12)

Finally, it is easy to see that


From (B.8)(B.11), it follows that

for all j J+ \ {j0 }.


Combining (B.6) and (B.12), we obtain
P {IC,1 (j) IC,1 (j0 ) > 0} 1,
for all j J \ {j0 }, which shows that IC,1 is consistent.

References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.
In 2nd International Symposium on Information Theory, (B.N. Petrov and Csaki, F, eds.),
267281, Akademia Kiado, Budapest.
Akaike, H. (1974). A new look at the statistical model identification. System identification
and time-series analysis. IEEE Trans. Autom. Contr., AC-19, 716723.
Akaike, H. (1980a). On the use of predictive likelihood of a Gaussian model. Ann. Inst.
Statist. Math., 32, 311324.
Akaike, H. (1980b). Likelihood and the Bayes procedure. In Bayesian Statistics, (N.J.
Bernard, M.H. Degroot, D.V. Lindaley and A.F.M. Simith, eds.), Valencia, Spain, University Press, 141166.
Ando, T. (2007). Bayesian predictive information criterion for the evaluation of hierarchical
Bayesian and empirical Bayes models. Biometrika, 94, 443458.
Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. J. Amer. Statist. Assoc., 83,
2836.
Berger, J.O. and Pericchi, L.R. (1996). The intrinsic Bayes factor for model selection and
prediction. J. Amer. Statist. Soc., 91, 109122.
Bozdogan, H. (1987). Model selection and Akaikes information criterion (AIC): The general
theory and its analytical extensions. Psychometrika, 52, 345370.
21

Casella, G. and Moreno, E. (2006). Objective Bayesian variable selection. J. Amer. Statist.
Assoc., 101, 157167.
Henderson, C.R. (1950). Estimation of genetic parameters. Ann. Math. Statist., 21, 309310.
Hurvich, C.M. and Tsai, C.-L. (1989). Regression and time series model selection in small
samples. Biometrika, 76, 297307.
Kitagawa, G. (1997). Information criteria for the predictive evaluation of Bayesian models.
Commun. Statist. Theory Meth., 26, 22232246.
Mallows, C.L. (1973). Some comments on Cp . Technometrics, 15, 661675.
Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multiple
regression. Ann. Statist., 12, 758765.
Patterson, H.D. and Thompson, R. (1971). Recovery of inter-block information when block
sizes are unequal. Biometrika, 58, 545554.
Prasad, N.G.N. and Rao, J.N.K. (1990). The estimation of the mean squared error of smallarea estimators. J. Amer. Statist. Assoc., 85, 163171.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6, 461464.
Shao, J. (1997). An asymptotic theory for linear model selection. Statist. Sin., 7, 221264.
Shi, P. and Tsai, C.-L. (2002). Regression model selectiona residual likelihood approach.
J. Royal Statist. Soc. B, 64, 237252.
Shibata, R. (1981). An optimal selection of regression variables. Biometrika, 68, 4554.
Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian measures
of model complexity and fit. J. Royal Statist. Soc. B, 64, 583639.
Srivastava, M.S. and Kubokawa, T. (2010). Conditional information criteria for selecting
variables in linear mixed models. J. Multivariate Anal., 101, 19701980.
Sugiura, N. (1978). Further analysis of the data by Akaikes information criterion and the
finite corrections. Commun. Statist. Theory Meth., 7, 1326.
Vaida, F. and Blanchard, S. (2005). Conditional Akaike information for mixed-eects models.
Biometrika 92, 351370.
Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with
g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor
of Bruno de Finetti, (P.K. Goel and A. Zellner, eds.), pp. 233243, Amsterdam: NorthHolland/Elsevier.

22

Anda mungkin juga menyukai