www.elsevier.com/locate/csda
Czech Republic
b Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue,
Abstract
We introduce a nonparametric test intended for large-scale simultaneous inference in situations
where the utility of distribution-free tests is limited because of their discrete nature. Such situations are
frequently dealt with in microarray analysis where the number of tests is much larger than the sample
size. The proposed test statistic is based on a certain distance between the distributions from which
the samples under study are drawn. In a simulation study, the proposed permutation test is compared
with permutation counterparts of the t-test and the KolmogorovSmirnov test. The usefulness of the
proposed test is discussed in the context of microarray gene expression data and illustrated with an
application to real datasets.
2005 Elsevier B.V. All rights reserved.
Keywords: Two-sample statistic; Permutation tests; Nonparametric inference; Microarray analysis
1. Introduction
This work is motivated by statistical challenges of microarray gene expression data analysis. In this rapidly evolving eld of computational biology, little focus has been on the choice
Corresponding author. Tel.: +1 585 275 6688; fax: +1 585 273 1031.
3620
L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628
L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628
3621
To overcome the above-described obstacle, recourse to a continuous test statistic is necessary when testing the hypothesis H0 for each gene. The traditional t-statistic is the most
common choice in microarray analysis. To avoid an overly strong assumption of normality
of gene expression levels, permutation techniques are used to mimic the t-statistic sampling distribution under the null hypothesis (Dudoit et al., 2002, 2003). This makes the
corresponding statistical test essentially nonparametric. Also, the (unpooled) t-statistic is
asymptotically (as the sample size tends to innity) distribution-free. However, the t-statistic
is designed to test the hypothesis H0 : F = G , where F and G are the mean values of
the distributions F and G, respectively, while we are looking for an alternative method for
testing the stronger hypothesis H0 : F (x) = G(x).
In this paper, we propose a test statistic designed
for testing the equality of distributions
(H0 ) rather than only of the mean values H0 . To claim the usefulness of a test, one has to
assess its power in situations where other statistical tests are known to be optimal or nearly
so. The proposed permutation test proves to be almost as powerful as the permutation t-test
in the case of normally distributed data and location alternatives. We also discuss some
computational and practical issues related to applications of the proposed resampling test
in the analysis of microarray data.
R1 R1
R1 R1
R1 R1
1/2
K(x, y) d(x) d(y)
(1)
The distance N = N (, ) is well-dened and can be shown to be a metric in the space of all
probability measures on R1 (Zinger et al., 1989), so that the null hypothesis in two-sample
comparisons can be formulated as H0 : N (, ) = 0. A multivariate extension of the metric
N was considered in Szabo et al. (2002, 2003), Xiao et al. (2004), where it was used to
search for differentiallyexpressed gene combinations. A normalized version of N can be
dened as Nnorm = N/ A, where
A=
R1
R1
R1
R1
(2)
If K(x, y) = (x y) and () is homogeneous of any order, then Nnorm is both location
and scale invariant.
3622
L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628
Consider two independent random samples x1 , . . . , xn1 and y1 , . . . , yn2 of sizes n1 and
n2 , respectively, and introduce the empirical counterpart of N (, ) as follows:
n1
n1
n2
n1
1
1
N=
2K xi , yj 2
K xi , xj
n1 n2
n
1 i=1 j =1
i=1 j =1
1/2
n2
n2
1
K yi , yj .
(3)
2
n2 i=1 j =1
A distinct advantage of the approach based on N is a wide selection of negative denite
kernels that are sensitive to various departures from the hypothesis: = (Szabo et al.,
2002; Xiao et al., 2004). In this paper, use is made of the Euclidean distance between
points representing experimental measurements: K(x, y) = |x y|. Here x and y denote
observations in two samples on a particular variable.
L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628
3623
(m)
Pr Yp 0.95 0.005 0.95. In other words, the choice of depends on the value of
p used to determine the corresponding signicance level. Using formula (4) and solving the
resultant equation numerically, we obtain m 16, 688 in this case. For = 0.0075, however,
the required number of permutations is much fewer, namely, m4, 349. It is worth noting
that the estimate of m based on (4) is distribution-free and its derivation does not rely on
any asymptotic argument.
L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628
1.0
3624
0.6
0.8
N-test
t-test
Kolmogorov-Simirnov
0.4
0.1532
0.2
0.0766
0.0
0.0
0.0
0.0
0.5
1.0
1.5
Shift
0.25
2.0
0.5
2.5
3.0
1.0
Fig. 1. Normal case: power functions for the N-test, t-test, and the KolmogorovSmirnov test. x-axis: values of
the parameter , y-axis: estimated power. The t-test outperforms but slightly the N-test, both being superior to the
KolmogorovSmirnov test.
0.0
0.2
0.4
0.6
0.8
N-test
t-test
Kolmogorov-Simirnov
0.0
0.5
1.0
1.5
Shift
2.0
2.5
3.0
Fig. 2. Normal mixture given by F0 : power functions for the three tests. Departures from the null hypothesis are
modeled by the parametric family F1 (see Section 4 for explanations). The same notation as in Fig. 2. The t-test
outperforms but slightly the N-test, both being superior to the KolmogorovSmirnov test.
3625
0.5
L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628
0.0
0.1
0.2
0.3
0.4
N-test
t-test
Kolmogorov-Simirnov
Shift
Fig. 3. Normal mixture given by F0 : power functions for the three tests. Departures from the null hypothesis are
modeled by the parametric family F2 (see Section 4 for explanations). The same notation as in Fig. 2. The N-test
clearly outperforms the t-test.
(YAMC) cells (DAbaco et al., 1996; Whitehead et al., 1993) and their derivatives transformed by activated Ras and a dominant-negative mutant p53 following retroviral gene
transfer were chosen as an experimental model. RNA was isolated from YAMC cells (control #1), cells infected with control retroviruses (control #2), and cells expressing activated
Ras and mutant p53 genes. Labeled cRNA was used to probe high-densityAffymetrix Mouse
Genome 430 2.0 arrays. 10 biological replicates for each of the three experimental conditions were prepared. In this application, the total number of probe sets (genes) was about
45,000. The data were normalized using the quantile normalization procedure (Bolstad et
al., 2003; Irizarry et al., 2003) at the probe feature level. The raw (not normalized) but background corrected expression data were generated by the output of the Bioconductor RMA
(Robust Multi-Array Average) procedure when choosing the option: normalization = false.
We used the step-down maximum-test-statistic multiple testing procedure by Westfall
and Young (Westfall and Young, 1993, pp. 116117) controlling the FWER at a given level.
The N-statistic was computed from log-expressions. The Westfall and Young procedure was
also run with the t-statistic. Given that the sample size is relatively small and the number
of genes is very large in this setting, we were unable to use the KolmogorovSmirnov test
for the reasons discussed in the Introduction.
The testing procedure resulted in a larger set of differentially expressed genes when
applied to the normalized data as compared to the same analysis of the raw (background
corrected but not normalized) expression data. The observed discrepancy was even more
pronounced when the t-test was used. The same effect was obvious when the rank adjustment
(Szabo et al., 2002) was used for data normalization. Our analysis based on the N-test has
resulted in two sets of genes declared differentially expressed between the treatment group
(Condition RP) and each of the control groups. Their intersection includes about 300 genes
when the FWER is controlled at the level of 0.001. Among the known genes we found
3626
L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628
Non-normalized
Normalized
KS: 794
KS: 623
486
N: 536
496
t: 1058
N: 528
t: 790
Fig. 4. The numbers of genes selected by different methods: normalized versus nonnormalized data on childhood
leukemia.
known ligands, receptors, intracellular signal transducers and transcription factors reported
to be involved in controlling cell functions such as survival, motility invasiveness and
angiogenesis. These biological ndings will be described at length in another paper. The ttest selected only 197 genes at the same FWER level. This can be attributed in part to the fact
that the N-test is sensitive to more complex dissimilarities between the distributions under
comparison than those expected under location alternatives. This is not a universal tendency
as evidenced by another application given below. The rejection rate generally depends on
the distribution of data under the unknown alternative hypothesis. The results for the two
tests become closer for higher levels of the FWER. The numbers of genes selected by the
N-test and the t-test were equal to 792 and 775, respectively, when both tests were applied
at the 0.05 level.
In the second application, use was made of the St. Jude Childrens Research Hospital
(SJCRH) Database on childhood leukemia. There are 335 arrays (Affymetrix, Santa Clara,
CA) in the SJCRH dataset, each array representing d = 12, 558 genes. The data are publicly available from the following website: http://www.stjuderesearch.org/data/ALL1/. The
SJCRH data include the information on gene expression in normal blood and various types
of childhood leukemia. We selected two different types of childhood leukemia denoted by
TALL and Heprdip, respectively. Each group of patients was represented by 43 arrays. The
data were normalized using the same RMA procedure. We used the step-down algorithm
by Westfall and Young to control the FWER at the level of 0.05. Since the sample size was
sufciently large in this analysis, we could perform all the three statistical tests studied in
Section 4. The N-test selects 536 genes while the KolmogorovSmirnov test produces a
list of 794 genes. There are 486 genes in the intersection of the two lists. The t-test selects
1,058 genes. These results are summarized in Fig. 4. Clearly, the N-test appears to be the
most conservative one in this particular example, as opposed to what we have seen in the
other application. It is worth noting that the observed rejection rate is not equivalent to
the power of a test. The results for nonnormalized data are largely similar (Fig. 4) with all
the tests selecting fewer genes than in the case of normalized data.
L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628
3627
3628
L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628
Szabo, A., Boucher, K., Jones, D., Klebanov, L., Tsodikov, A., Yakovlev, A., 2003. Multivariate exploratory tools
for microarray data analysis. Biostatistics 4, 555567.
Tusher, V.G., Tibshirani, R., Chu, G., 2001. Signicance analysis of microarrays applied to the ionizing radiation
response. Proc. Nat. Acad. Sci. 98, 51165121.
Vakhaniya, N.N., Tarieladze, V.I., Chobanyan, S.A., 1987. Probability Distributions on Banach Spaces. Riedel,
Dordrecht, Holland.
Westfall, P.H., Young, S., 1993. Resampling-Based Multiple Testing. Wiley, New York.
Whitehead, R.H., Van Eeden, P.E., Noble, M.D., Ataliotis, P., Jat, P.S., 1993. Establishment of conditionally
immortalized epithelial cell lines from both colon and small intestine of adult H-2Kb-tsA58 transgenic mice.
Proc. Nat. Acad. Sci. 90, 587591.
Xiao, Y., Frisina, R., Gordon, A., Klebanov, L., Yakovlev, A., 2004. Multivariate search for differentially expressed
gene combinations. BMC Bioinformatics 5 (Article 164).
Zinger, A.A., Klebanov, L.B., Kakosyan, A.V., 1989. Characterization of distributions by mean values of statistics
in connection with some probability metrics. In: Stability Problems for Stochastic Models. VNIISI, Moscow,
pp. 4755.