doi: 10.1093/biomet/asq081
JEAN D. OPSOMER
Department of Statistics, Colorado State University, Fort Collins, Colorado 80523, U.S.A. jopsomer@stat.colostate.edu SUMMARY A number of criteria exist to select the penalty in penalized spline regression, but the selection of the number of spline basis functions has received much less attention in the literature. We propose a likelihoodbased criterion to select the number of basis functions in penalized spline regression. The criterion is easy to apply and we describe its theoretical and practical properties.
Some key words: Knot selection; Maximum likelihood; Mixed model; Nonparametric regression.
1. INTRODUCTION In penalized spline regression (Ruppert et al., 2003), an unknown smooth function is estimated by least squares using a high-dimensional spline basis and a penalty is imposed on the spline coefficients to achieve a smooth fit. The penalty can either be treated as a fixed smoothing parameter that balances the bias and variance properties of the estimator, or it can be obtained more naturally by specifying a prior distribution on the spline coefficients. An advantage of the latter approach is that the penalty parameter is the ratio of the residual variance and the a priori variance of the spline coefficients, so that the penalty parameter is susceptible to maximum likelihood estimation. A comparison of maximum likelihood based and crossvalidated penalty parameters is found in Kauermann (2005). In this note, we will show that the dimension of the spline basis can also be treated as a parameter in a likelihood, which can then be optimized with respect to this parameter. The usual approach to determine the dimension of the spline basis in penalized spline regression is to select a sufficiently large basis and let the penalty adjust for the amount of smoothness of the curve. The spline basis is determined by a set of knots, and the number of knots K is often used as a convenient way to denote the dimension of the spline basis. There has been little theoretical investigation of the effect of K on the properties of penalized spline regression. Li & Ruppert (2008) set a lower bound on asymptotic order at which K should increase as the sample size n increases. Claeskens et al. (2009) showed that there are in fact two asymptotic scenarios for K , either leading to classical smoothing spline or regression spline asymptotics, respectively, depending on how fast K increases with n . In Kauermann et al. (2009) the asymptotic behaviour of K and n was tackled by making use of the link between penalized spline smoothing and mixed models. While these articles give asymptotic rates for K based on n , the results do not address the issue of how to select K for a given sample size n .
226
RAN KAUERMANN GO
AND J EAN
D. OPSOMER
In practical terms, the choice of the dimension K has an effect on the properties of the nonparametric function estimator. Too few knots can result in biased estimation, because features of the mean function are missed when the spline basis is not sufficiently large. Conversely, using too many knots increases the risk of overfitting. This is partly offset by the selection of an appropriate penalty, but a more parsimonious spline basis will generally result in a more stable fit even after allowing for the effect of the penalization. In an extensive simulation study, Ruppert (2002) demonstrated the severe bias caused by values of K that are too small, and the fact that once K is sufficiently large, the effect of further increasing K is modest. Specifically, he noted that increasing K beyond what is required to capture the features of the underlying mean function causes a modest increase in mean squared error. Ruppert (2002) investigated generalized crossvalidation to select K , and also suggested a heuristic rule of thumb K = min(40, n /4), which works well in practice for many datasets. The generalized crossvalidation criterion only applies under the framework of treating the penalty as a tuning constant. In the current note, we consider the mixed-model specification of the penalized spline regression and now treat K as a parameter in the model likelihood. This provides a natural way to integrate the selection of K within the mixed-model framework, for which no method is currently available.
MIXED MODELS
(i = 1, . . . , n ),
(1)
where () is an unknown smooth function, the i are independent and identically distributed errors, i.e. 2 i (0, ), and the x i take values in [0, 1], without loss of generality. We estimate () using penalized splines. To do so, we replace (x ) in (1) by a high-dimensional spline of the form (x ) = X (x ) + Z K (x )u K . Here, X () is a low-dimensional basis while Z K () is high-dimensional. Using truncated polyq q nomials, we set X (x ) = (1, x , x 2 /2!, . . . , x q /q !) and Z K (x ) = {(x 1 )+ /q !, . . . , (x K )+ /q !} where q q (t )+ = t for t > 0 and 0 otherwise, and the knots 1 < < K cover the range of x . Alternatively, one can transform the truncated polynomials to B-splines (de Boor, 1978), which exhibit a numerically more stable behaviour. Under the mixed-model framework for fitting penalized splines, model (1) is replaced by Y | u K N ( X + Z K u K , 2 In ),
2 u K N (0, u I ), K K
(2)
where Y = (Y1 , . . . , Yn ), X and Z K are matrices with i th row X (xi ) and Z K (xi ), respectively, and In and I K are n - and K -dimensional identity matrices, respectively. The loglikelihood function for model (2) can be written as n 1 1 1 l K (, 2 , ) = log( 2 ) log | VK , | 2 (Y X )T VK , (Y X ), 2 2 2
2 2 2 with VK , = In + Z K Z T K / and = /u K . The likelihood depends on the parameters , , as well as on the spline dimension K , written as a subscript. The loglikelihood is usually maximized with respect to the parameters, but K is kept fixed. Our proposal in this paper is to maximize l K (, 2 , ) with respect to both the parameters and K . While in principle it might be possible to maximize l K directly and simultaneously with respect to all the parameters and K , the discrete nature of K and the fact that the structure of V changes with K makes this impracticable. Instead, we therefore consider the maximized loglikelihood
n 1 ) = log( 2 ) log | VK , 2, l K (, |, 2 2
(3)
where we inserted the maximum likelihood estimators for 2 , and and removed terms not depending on K , see Schall (1991) or Searle et al. (1992). The resulting function (3) depends only on K , and it can
Miscellanea
(a) Loglikelihood 80 40 0 1 2 3 (b)
227
1 00 (c) 0006
02
04 x
06
08
10
0 (d) 050
20
40
60 K
80 100
2 1/s u
se2 0 20 40 60 K 80 100
0000
020 0
035
20
40
60 K
80 100
Fig. 1. (a) Simulated data points (n = 300) drawn from the true mean function (bold) which is estimated using truncated splines (dashed, visually indistinguishable from the solid line). (b) Value of the log profile likelihood as a function of K . Maximum likelihood 2 and (d) 2 for different values of K . The vertical line in (b), (c) estimates of (c) 1/ u K and (d) marks the optimal value of K .
and need now be maximized with respect to K . For each evaluation at a value for K , the quantities 2, to be recomputed. Figure 1 shows an example. Panel (a) gives n = 300 data points drawn from a smooth function, shown as the solid line. The dashed line, which is visually indistinguishable from the solid line, is the best fit using truncated linear polynomials when K is chosen to maximize the likelihood (3). The latter is plotted against K in panel (b). We see that the likelihood increases for small K but once it reaches its weak maximum at K 0 , it flattens out. Using the maximum likelihood principle, we therefore propose ) against K . 2, to select the number of knots K as K 0 , the maximum of the curve l K (, We do not pursue the full theoretical development for our approach in this note but refer to the Supplementary Material for a technical derivation. Instead, we here motivate more heuristically a number of key features of the approach, which determine the behaviour of criterion (3) as a function of K and of its minimizer K 0 . First, the spline dimension K corresponds to knots 1 , . . . , K and we assume that the distance between the knots decreases as K increases, i.e., k k 1 = O ( K 1 ) for 2 k K . Secondly, when we 2 = O ( K 1 ). This is reasonable in order to mainvary K for a fixed function (), we will require that u K tain the same amount of total variability in the function () and its estimator, and clearly corresponds 2 in Fig. 1(c). In fact, the condition to the behaviour exhibited by the maximum likelihood estimator u K 2 on u K guarantees that the random function X + Z K u K with prior in (2) is differentiable in the limit as K increases. In contrast, 2 is a fixed constant and for sufficiently large K , its maximum likelihood estimator 2 is consistent. If K is too small, the estimator of the mean function is biased and the resulting estimate 2 overestimates the true variance. Once K is large enough, the bias vanishes and the consistency of the estimator ensures that 2 remains stable for a wide range of values of K . This is illustrated in 2 , ) in Fig. 1(d). The combination of these two behaviours explains the patterns exhibited by l K (, Fig. 1(b) and provides some justification for using K 0 as a data-driven choice for a suitable number of knots in the mixed-model penalized spline regression context. In the Supplementary Material we prove to obtain 2 , ) that under particular assumptions we may expand lk (, = O p (n ) + O p {log(n )} O (1 + K (2q +2) ) l K (, 2 , ) with q as the polynomial degree of the basis. Bearing the behaviour seen in Fig. 1 in mind and/or following ) 2, the theoretical results, we suggest the following simple strategy for selecting K . Maximizing l K (, with respect to K ensures that we avoid too low values of K for which the spline basis is too inflexible to
228
(a)
RAN KAUERMANN GO
(b) 00
AND J EAN
D. OPSOMER
(c) 020 Density
10
05
10
05 00
30
20
00
04 x
08
16
24 K
32
40
000 0
010
10
20 K
30
40
(d) 25 0
(f)
15
05
05
00
04 x
08
10
20 K
30
40
Fig. 2. Simulation results for the logit and bump functions. Logit function: (a) simulated data, typical realization with mean function (solid line) and spline fit (dashed line), (b) profile likelihood, boxplots of the value of the maximized loglikelihood for different values of K , (c) histogram of K 0 , the values of K maximizing the sample likelihoods; corresponding plots for bump function are in (d)(f).
fit the data well. Once K exceeds K 0 , further increases in K do not increase the likelihood and do not help in achieving a better fit. This automatic balancing of the spline complexity and amount of penalization is a major advantage of the mixed-model approach to spline fitting.
3. SIMULATIONS We conduct a limited simulation study to illustrate the practical performance of the selection criterion, following the settings given in Ruppert (2002). Let covariates x be equidistant in [0, 1]. As spline basis we make use of B-splines, which are easily implemented as mixed models as shown in 2 I ) and penalty Wand & Ormerod (2008), i.e., the spline can be written as X + Z K u K with u K N (0, u K matrix. We make use of four mean functions in what follows. First, we simulate from the logit function (x ) = [1 + exp{20(x 05)}]1 , where we set sample size n = 100 and simulate with error variance 2 = 022 . The results are shown in Fig. 2. Panel (a) displays a typical realization, and panel (b) shows boxplots of values of the maximized likelihood (3) for different values of K based on 100 replicates of the simulation. The vertical line is the median of the selected K 0 . In panel (c) we give a histogram of the optimally chosen values of K 0 . Apparently, the selection is quite stable, preferring small values of K . As second simulation, we employ the bump function (x ) = x + 2 exp[{16(x 05)2 }]. The results for 2 = 032 are shown in Fig. 2, panels (d) to (f), with the same organization as in panels (a) to (c). As before, the value of the likelihood stabilizes rapidly as K increases and a relatively small value K is preferred. The last two functions we simulate from are sine curves, (x ) = sin(2 x ). We set 2 = 1 and n = 200, and provide the results in Fig. 3, panels (a) to (c) for = 3 and (d) to (f) for = 6. The interpretation is the same, and for = 3 it appears that splines with a small number of knots perform best. For = 6, one needs a larger number of knots, as the proposed selector criterion shows. The selected number of knots in the latter case is clearly larger than that obtained by using the Ruppert rule of thumb mentioned in 1.
Miscellanea
(a) 3 0 (b) (c)
229
Density
00
04 x
08
000 0
002
2 18
39
59 K
79
20
40
60 K
80
(d) 3 0
(f)
00
04 x
08
10
20
40 K
60
80
Fig. 3. Simulation results for two sine functions. Plots (a)(c) show results for sine function with = 3 and plots (d)(f) show results for = 6. See Fig. 2 for a description of the individual plots.
(b)
(d)
MSE/GCV
0006
0000
000
002
0000
0006
0012
000
002
004
000
010
000 000
015
015
030
K/GCV
5 10
20
40
5 10 20 K/mixed model
20 40 60 K/mixed model
40 80 K/mixed model
Fig. 4. Mean Squared Error, MSE, top row, and selected number of knots, bottom row, based on minimizing the generalized cross validation, GCV , and maximizing the likelihood in the mixed model for simulations logit (a) and (e), bump (b) and (f), sin 3 (c) and (g) and sin 6 (d) and (h).
230
RAN KAUERMANN GO
AND J EAN
D. OPSOMER
Finally, we compare our criterion with the procedure based on the generalized crossvalidation criterion in Ruppert (2002), which we minimize with respect to both the smoothing parameter and the spline dimension. Figure 4 compares the mean squared error and the number of knots obtained by both methods over 100 realizations for the four mean functions above. Overall, both methods appear to behave similarly with respect to their mean squared error, with the exception of occasional substantially larger values for the generalized crossvalidation. The likelihood approach appears to select more knots, except for the logit function. 4. CONCLUSION We have proposed a likelihood-based method for selecting the number of knots in penalized spline regression. While such a method already existed under the smoothing framework (Ruppert, 2002), our approach is the first that can do so under the mixed-model framework. An important advantage of the latter framework is that it makes it possible to combine model fitting and smoothing parameter, spline basis dimension and penalty settings using a unified objective function. Not only does this simplify the computational aspects, it also further removes some of the subjective decisions often associated with applying nonparametric methods. A possible extension of this work is to evaluate the likelihood-based criterion for knot placement and for determining the degree of the spline. While this could certainly be considered, both factors tend to have only a minor impact of the spline fits once the penalty and the number of knots are allowed to vary. We do not pursue this further here. ACKNOWLEDGEMENT The first author acknowledges financial support provided from the Deutsche Forschungsgemeinschaft. SUPPLEMENTARY MATERIAL Supplementary Material available at Biometrika online includes details of the technical derivations. REFERENCES
CLAESKENS, G., KRIVOBOKOVA, T. & OPSOMER, J. (2009). Asymptotic properties of penalized spline estimators. Biometrika 96, 52944. DE B OOR , C. (1978). A Practical Guide to Splines. Berlin: Springer. KAUERMANN, G. (2005). A note on smoothing parameter selection for penalised spline smoothing. J. Statist. Plan. Infer. 127, 5369. KAUERMANN, G., KRIVOBOKOVA, T. & FAHRMEIR, L. (2009). Some asymptotic results on generalized penalized spline smoothing. J. R. Statist. Soc. B 71, 487503. LI, Y. & RUPPERT, D. (2008). On the asymptotics of penalized splines. Biometrika 95, 41536. RUPPERT, D. (2002). Selecting the number of knots for penalized splines. J. Comp. Graph. Statist. 11, 73557. RUPPERT, R., WAND, M. & CARROLL, R. (2003). Semiparametric Regression. Cambridge: Cambridge University Press. SCHALL, R. (1991). Estimation in generalized linear models with random effects. Biometrika 78, 71927. SEARLE, S., CASELLA, G. & MCCULLOCH, C. (1992). Variance Components. New York: Wiley. WAND, M. & ORMEROD, J. (2008). On semiparametric regression with OSullivan penalized splines. Aust. New Zeal. J. Statist. 50, 17998.