Anda di halaman 1dari 10

Computational Statistics & Data Analysis 50 (2006) 3619 3628

www.elsevier.com/locate/csda

A permutation test motivated by microarray


data analysis
L. Klebanova , A. Gordonb , Y. Xiaob , H. Landc , A. Yakovlevb,
a Department of Probability and Statistics, Charls University, Sokolovska 83, Praha-8, CZ-18675,

Czech Republic
b Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue,

Box 630, Rochester, New York 14642, USA


c Department of Biomedical Genetics and James P. Wilmot Cancer Center, University of Rochester, 601

Elmwood Avenue, Box 630, Rochester, New York 14642, USA


Received 12 April 2005; received in revised form 15 August 2005; accepted 16 August 2005
Available online 7 September 2005

Abstract
We introduce a nonparametric test intended for large-scale simultaneous inference in situations
where the utility of distribution-free tests is limited because of their discrete nature. Such situations are
frequently dealt with in microarray analysis where the number of tests is much larger than the sample
size. The proposed test statistic is based on a certain distance between the distributions from which
the samples under study are drawn. In a simulation study, the proposed permutation test is compared
with permutation counterparts of the t-test and the KolmogorovSmirnov test. The usefulness of the
proposed test is discussed in the context of microarray gene expression data and illustrated with an
application to real datasets.
2005 Elsevier B.V. All rights reserved.
Keywords: Two-sample statistic; Permutation tests; Nonparametric inference; Microarray analysis

1. Introduction
This work is motivated by statistical challenges of microarray gene expression data analysis. In this rapidly evolving eld of computational biology, little focus has been on the choice
Corresponding author. Tel.: +1 585 275 6688; fax: +1 585 273 1031.

E-mail address: andrei_yakovlev@urmc.rochester.edu (A. Yakovlev).


0167-9473/$ - see front matter 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.csda.2005.08.005

3620

L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628

of an appropriate test statistic in conjunction with multiple two-sample permutation tests


conducted to detect differentially expressed genes. Although the test statistic proposed in the
present paper may be of interest in the analysis of two- or multi-sample problems in general,
it is especially relevant to the analysis of microarray data for the reasons given below.
The set of microarray expression data on d distinct genes is represented by a random
vector X = (X1 , . . . , Xd ) with stochastically dependent components. The dimension d of
X is extremely high relative to the number of observations (replicates of experiments).
The standard practice is to test the hypothesis of no differential expression for each gene
when comparisons are made between two (or more) different experimental conditions. In
terms of the unknown marginal cumulative distribution functions (c.d.f.s) Fi and Gi of the
ith component of X, the two-sample null hypothesis is formulated as H0 : Fi (x) = Gi (x),
i = 1, . . . , d. The problem of multiple testing represents a real challenge in this situation
because the number of genes is extremely large, ranging typically from 103 to 5 104 .
This problem is discussed at length in a recent paper by Dudoit et al. (2003). Resampling
techniques (cf., Pesarin, 2001; Westfall and Young, 1993) are used as a key element in many
modern approaches to the problem of multiple dependent tests inherent in signicance
analysis of microarrays.
No matter what concept of error rate is chosen to approach the problem of multiple testing, one of the most fundamental questions is how this rate can be estimated for a specic
test statistic. As far as the minimum p-value (min p-value) multiple testing procedures (cf.,
Dudoit et al., 2003; Westfall and Young, 1993) are concerned, the most appealing possibility would be resorting to distribution-free statistics. The sampling distribution of such
a statistic does not depend upon which distribution generated the observed data under the
null hypothesis. Furthermore, the unadjusted p-values can be computed exactly for the most
widely used distribution-free statistics, such as the KolmogorovSmirnov or MannWittney
statistics, which is a serious computational advantage when the Bonferroni-type methods
are used for multiple testing adjustment. However, it is not always possible to take advantage
of the above virtues because all such statistics have an attainable maximum, which property
may make the adjusted p-values too large to reject even a single hypothesis at a reasonable
family-wise error rate (see Dudoit et al., 2003 for denition), thereby effectively stymieing
multiple testing inference in small sample studies. This obstacle was recently discussed in
Lee et al. (2005) in the context of microarray data analysis.
To illustrate the latter point, let us consider the Bonferroni adjustment for multiple testing.
Suppose the probability for a given distribution-free statistic to attain its maximum is equal
to , then the Bonferroni-adjusted p-values will all be greater than or equal to d, where d is
the total number of hypotheses (genes). Note that  is xed and determined by the size of a
sample, while d may be very large. Therefore, the p-values thus adjusted may be too large
to reject even a single hypothesis at a pre-set level of the family-wise error rate (FWER).
As a result, even nonoverlapping samples may not lead to rejection of the null hypothesis.
The Westfall and Young permutation method (Westfall and Young, 1993) suffers from the
same problem, because the probability that at least one out of the d statistics attains its
maximum (which manifests itself in nonoverlapping samples resulted from permutations)
may be prohibitively large with no adjusted p-values being sufciently small to reject even
a single null hypothesis. This applies to any distribution-free rank statistic and any multiple
testing procedure that controls the FWER.

L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628

3621

To overcome the above-described obstacle, recourse to a continuous test statistic is necessary when testing the hypothesis H0 for each gene. The traditional t-statistic is the most
common choice in microarray analysis. To avoid an overly strong assumption of normality
of gene expression levels, permutation techniques are used to mimic the t-statistic sampling distribution under the null hypothesis (Dudoit et al., 2002, 2003). This makes the
corresponding statistical test essentially nonparametric. Also, the (unpooled) t-statistic is
asymptotically (as the sample size tends to innity) distribution-free. However, the t-statistic
is designed to test the hypothesis H0 : F = G , where F and G are the mean values of
the distributions F and G, respectively, while we are looking for an alternative method for
testing the stronger hypothesis H0 : F (x) = G(x).
In this paper, we propose a test statistic designed
for testing the equality of distributions

(H0 ) rather than only of the mean values H0 . To claim the usefulness of a test, one has to
assess its power in situations where other statistical tests are known to be optimal or nearly
so. The proposed permutation test proves to be almost as powerful as the permutation t-test
in the case of normally distributed data and location alternatives. We also discuss some
computational and practical issues related to applications of the proposed resampling test
in the analysis of microarray data.

2. Theoretical framework: the test statistic


Let X and Y be two random variables with probability distributions  and , respectively,
dened on the real
line R1 . Let
 K(x,
 y) be a strictly negative denite kernel (Vakhaniya

et al., 1987), that is si,j =1 K xi , xj hi hj 0 for any x1 , . . . , xs and h1 , . . . , hs , si=1 hi =
0, with equality if and only if all hi = 0. Introduce the following distance between two
probability distributions  and :
 
N (, ) = 2


R1 R1

K(x, y) d(x) d(y)

R1 R1

R1 R1

1/2


K(x, y) d(x) d(y)

K(x, y) d(x) d(y)

(1)

The distance N = N (, ) is well-dened and can be shown to be a metric in the space of all
probability measures on R1 (Zinger et al., 1989), so that the null hypothesis in two-sample
comparisons can be formulated as H0 : N (, ) = 0. A multivariate extension of the metric
N was considered in Szabo et al. (2002, 2003), Xiao et al. (2004), where it was used to
search for differentiallyexpressed gene combinations. A normalized version of N can be
dened as Nnorm = N/ A, where



A=

R1


R1

K(x, y) d(x) d(y) +

R1

R1

K(x, y) d(x) d(y).

(2)

If K(x, y) = (x y) and () is homogeneous of any order, then Nnorm is both location
and scale invariant.

3622

L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628

Consider two independent random samples x1 , . . . , xn1 and y1 , . . . , yn2 of sizes n1 and
n2 , respectively, and introduce the empirical counterpart of N (, ) as follows:

n1
n1
n2
n1




1
1

N=
2K xi , yj 2
K xi , xj
n1 n2
n
1 i=1 j =1
i=1 j =1
1/2
n2
n2


1
K yi , yj .
(3)
2
n2 i=1 j =1
A distinct advantage of the approach based on N is a wide selection of negative denite
kernels that are sensitive to various departures from the hypothesis:  =  (Szabo et al.,
2002; Xiao et al., 2004). In this paper, use is made of the Euclidean distance between
points representing experimental measurements: K(x, y) = |x y|. Here x and y denote
observations in two samples on a particular variable.

3. Computational framework: accuracy of permutation quantiles


It is always desirable to determine how many permutations are required to provide a
sufciently accurate estimation of the critical region for a given test. This issue has not
received much attention in the literature on microarray analysis. In relevant publications,
permutation tests are sometimes performed with just 100 permutations (Storey and Tibshirani, 2003). For the purposes of illustration, we consider the case where the samples under
comparison are of equal size n. Unequal samples can readily be accommodated within the
same framework. From the multiple testing perspective, the estimates discussed below apply to the maximum test statistic distribution yielded by resampling methods (Westfall and
Young, 1993).
Let FN (z) be the permutation c.d.f. of a given continuous test statistic Z or the maximum
of such statistics in the multiple testing setting. Here N = (2n)!/n! n! is the total number of
distinct permutations yielded by the two samples under comparison. Let m be the number
of actual random permutations conducted to estimate a p-quantile of the distribution FN (z).
The test statistics Zi , i =1, . . . , m, associated with each permutation represent independent
and identically distributed random variables.
Let Z(1) , . . . , Z(m) be the sequence of the m order statistics from a sample of Zs of size
m. We estimate the p-quantile of FN (z) by the empirical quantile Z[pm] , where [a] is the


(m)
integral part of a. Introduce the following random variable: Yp = FN Z[pm] . Although
FN (z) is in fact discrete, we can treat it as a continuous c.d.f. whenever N is large. For
simplicity suppose that such a condition is met. It is well known (David, 1981) that


Pr Yp(m) y = Iy ([pm], m [pm] + 1),
(4)
where Ix (a, b) is the normalized incomplete beta function. Now we can nd a value
(m)
of m for which the estimate Yp of a preset value of p is accurate to an arbitrary 
with sufciently high probability. For example, we may choose  = 0.005 to ensure that

L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628

3623




 (m)

Pr Yp 0.95 0.005 0.95. In other words, the choice of  depends on the value of
p used to determine the corresponding signicance level. Using formula (4) and solving the
resultant equation numerically, we obtain m 16, 688 in this case. For  = 0.0075, however,
the required number of permutations is much fewer, namely, m4, 349. It is worth noting
that the estimate of m based on (4) is distribution-free and its derivation does not rely on
any asymptotic argument.

4. Power of the test: a simulation study


To assess the power of the proposed test, we designed our simulation study as follows:
1. Logarithms of gene expression signals are generated from a normal distribution (, )
with mean  and standard deviation . In the context of microarray data analysis, this
design implies that the original gene expression levels are log-transformed.
2. One of the two samples under comparison is generated from the distribution with  = 0
and  = 1. To generate the other sample, the parameter  is set at different values from
the range [0,3] with an increment of 0.03, keeping  = 1. This design allows us to model
the class of location (shift) alternatives. The size of each sample is equal to 8.
3. The resultant pair of samples is used to compute the observed values of the N - and
t-statistics. The null distribution is generated by randomly permuting the samples 5,000
times.
4. Steps 13 are repeated 5,000 times. The number (and the proportion) of rejections of the
null hypothesis at a signicance level of 0.05 is recorded for each value of .
Shown in Fig. 1 are the power functions for the two tests. We also included the KolmogorovSmirnov two-sample test in this study for comparison. Fig. 1 shows that the permutation
t-test outperforms the one based on the N-statistics. However, the losses of power at different
values of  do not appear to be substantial and the performance of the N-statistic is quite
competitive with that of the t-statistic. Needless to say that no conclusion regarding arbitrary
alternatives can be drawn from this particular example. Both tests provide the nominal
signicance level under the null hypothesis in this study. The N-test clearly outperforms the
KolmogorovSmirnov test in this situation. It is seen in Fig. 1 that the latter test does not
attain the nominal size of 0.05. The reason has to do with the granularity of its p-values.
To study the effect of the shape of the underlying distribution on the power of the tests
under comparison, we conducted additional simulations based on normal mixtures. To model
the null hypothesis, the following two-component mixture was used: F0 = 0.5 [(1, 1) +
(1, 1)]. The set of location alternatives was modeled as F1 =0.5 [(1+, 1)+(1+, 1)]
resulting in an increase in the overall mean much like as in the normal case described above.
The power curves shown in Fig. 2 indicate that the t-test still slightly outperforms the N-test,
both being superior to the KolmogorovSmirnov test. However, the situation is not the same
if we model the alternative as F2 = 0.5 [(1, 1) + (1 + , 1)], thereby increasing the
mean in only one component, in which case the N-test is more powerful than both the t-test
and the KolmogorovSmirnov test for moderately distant alternatives (Fig. 3). This example
is biologically meaningful as it refers to frequently encountered situations where the study

L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628

1.0

3624

0.6

0.8

N-test
t-test
Kolmogorov-Simirnov

0.4

0.1532

0.2

0.0766

0.0

0.0
0.0

0.0

0.5

1.0

1.5
Shift

0.25

2.0

0.5

2.5

3.0

1.0

Fig. 1. Normal case: power functions for the N-test, t-test, and the KolmogorovSmirnov test. x-axis: values of
the parameter , y-axis: estimated power. The t-test outperforms but slightly the N-test, both being superior to the
KolmogorovSmirnov test.

0.0

0.2

0.4

0.6

0.8

N-test
t-test
Kolmogorov-Simirnov

0.0

0.5

1.0

1.5
Shift

2.0

2.5

3.0

Fig. 2. Normal mixture given by F0 : power functions for the three tests. Departures from the null hypothesis are
modeled by the parametric family F1 (see Section 4 for explanations). The same notation as in Fig. 2. The t-test
outperforms but slightly the N-test, both being superior to the KolmogorovSmirnov test.

population of subjects includes both responders and nonresponders to a given biological


stimulus. This is a situation where the statistic N offers a clear advantage.

5. Examples of data analysis


Our experimental study was concerned with the identication of genes differentially
expressed as a consequence of malignant cell transformation. Young adult mouse colon

3625

0.5

L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628

0.0

0.1

0.2

0.3

0.4

N-test
t-test
Kolmogorov-Simirnov

Shift
Fig. 3. Normal mixture given by F0 : power functions for the three tests. Departures from the null hypothesis are
modeled by the parametric family F2 (see Section 4 for explanations). The same notation as in Fig. 2. The N-test
clearly outperforms the t-test.

(YAMC) cells (DAbaco et al., 1996; Whitehead et al., 1993) and their derivatives transformed by activated Ras and a dominant-negative mutant p53 following retroviral gene
transfer were chosen as an experimental model. RNA was isolated from YAMC cells (control #1), cells infected with control retroviruses (control #2), and cells expressing activated
Ras and mutant p53 genes. Labeled cRNA was used to probe high-densityAffymetrix Mouse
Genome 430 2.0 arrays. 10 biological replicates for each of the three experimental conditions were prepared. In this application, the total number of probe sets (genes) was about
45,000. The data were normalized using the quantile normalization procedure (Bolstad et
al., 2003; Irizarry et al., 2003) at the probe feature level. The raw (not normalized) but background corrected expression data were generated by the output of the Bioconductor RMA
(Robust Multi-Array Average) procedure when choosing the option: normalization = false.
We used the step-down maximum-test-statistic multiple testing procedure by Westfall
and Young (Westfall and Young, 1993, pp. 116117) controlling the FWER at a given level.
The N-statistic was computed from log-expressions. The Westfall and Young procedure was
also run with the t-statistic. Given that the sample size is relatively small and the number
of genes is very large in this setting, we were unable to use the KolmogorovSmirnov test
for the reasons discussed in the Introduction.
The testing procedure resulted in a larger set of differentially expressed genes when
applied to the normalized data as compared to the same analysis of the raw (background
corrected but not normalized) expression data. The observed discrepancy was even more
pronounced when the t-test was used. The same effect was obvious when the rank adjustment
(Szabo et al., 2002) was used for data normalization. Our analysis based on the N-test has
resulted in two sets of genes declared differentially expressed between the treatment group
(Condition RP) and each of the control groups. Their intersection includes about 300 genes
when the FWER is controlled at the level of 0.001. Among the known genes we found

3626

L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628

Non-normalized

Normalized

KS: 794

KS: 623

486

N: 536

496

t: 1058

N: 528

t: 790

Fig. 4. The numbers of genes selected by different methods: normalized versus nonnormalized data on childhood
leukemia.

known ligands, receptors, intracellular signal transducers and transcription factors reported
to be involved in controlling cell functions such as survival, motility invasiveness and
angiogenesis. These biological ndings will be described at length in another paper. The ttest selected only 197 genes at the same FWER level. This can be attributed in part to the fact
that the N-test is sensitive to more complex dissimilarities between the distributions under
comparison than those expected under location alternatives. This is not a universal tendency
as evidenced by another application given below. The rejection rate generally depends on
the distribution of data under the unknown alternative hypothesis. The results for the two
tests become closer for higher levels of the FWER. The numbers of genes selected by the
N-test and the t-test were equal to 792 and 775, respectively, when both tests were applied
at the 0.05 level.
In the second application, use was made of the St. Jude Childrens Research Hospital
(SJCRH) Database on childhood leukemia. There are 335 arrays (Affymetrix, Santa Clara,
CA) in the SJCRH dataset, each array representing d = 12, 558 genes. The data are publicly available from the following website: http://www.stjuderesearch.org/data/ALL1/. The
SJCRH data include the information on gene expression in normal blood and various types
of childhood leukemia. We selected two different types of childhood leukemia denoted by
TALL and Heprdip, respectively. Each group of patients was represented by 43 arrays. The
data were normalized using the same RMA procedure. We used the step-down algorithm
by Westfall and Young to control the FWER at the level of 0.05. Since the sample size was
sufciently large in this analysis, we could perform all the three statistical tests studied in
Section 4. The N-test selects 536 genes while the KolmogorovSmirnov test produces a
list of 794 genes. There are 486 genes in the intersection of the two lists. The t-test selects
1,058 genes. These results are summarized in Fig. 4. Clearly, the N-test appears to be the
most conservative one in this particular example, as opposed to what we have seen in the
other application. It is worth noting that the observed rejection rate is not equivalent to
the power of a test. The results for nonnormalized data are largely similar (Fig. 4) with all
the tests selecting fewer genes than in the case of normalized data.

L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628

3627

6. Discussion and conclusions


While two-sample permutation tests have enjoyed a wide use in microarray data analysis,
the search for better test-statistics continues. The proposed nonparametric test-statistic can
be considered an alternative to distribution-free statistics whenever the number of hypotheses to be tested is large and the sample size is small. It can also be considered supplemental
to the t-statistic as it is designed to detect more complex changes in the marginal distributions of gene expression signals than only in the mean values. Other distinct advantages of
the N-statistic are:
1. It is numerically stable and does not suffer from the instability caused by small variances
as does the commonly used t-statistic (Tusher et al., 2001).
2. It accommodates both continuous and ordinal data.
3. In simulation studies, the proposed test appears to be almost as powerful as the permutation t-test even in the case of normally distributed data and location alternatives.
The issue of the required number of permutations naturally arises in the design of permutation tests. The present paper addresses this issue in terms of the accuracy of estimated critical regions. The recommended estimate of the number of permutations given in
Section 3 applies to all permutation tests.
Acknowledgements
This research was supported by Czech Ministry of Education Grant MSM 113200008,
NIH Grants GM075299 and CA090663, and a Discovery Grant from the James P. Wilmot
Cancer Center.
References
DAbaco, G.M., Whitehead, R.H., Burgess, A.W., 1996. Synergy between Apc min and an activated ras mutation
is sufcient to induce colon carcinomas. Mol. Cell Biol. 16, 884891.
Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P., 2003. A comparison of normalization methods for high
density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185193.
David, H.A., 1981. Order Statistics. second ed. Wiley, New York.
Dudoit, S., Yang, Y.H., Speed, T.P., Callow, M.J., 2002. Statistical methods for identifying differentially expressed
genes in replicated cDNA microarray experiments. Statist. Sinica 12, 111139.
Dudoit, S., Shaffer, J.P., Boldrick, J.C., 2003. Multiple hypothesis testing in microarray experiments. Statist. Sci.
18, 71103.
Irizarry, R.A., Gautier, L., Cope, L.M., 2003. An R package for analyses of Affymetrix oligonucleotide arrays.
In: Parmigiani, G., Garrett, E.S., Irizarry, R.A., Zeger, S.L. (Eds.), The Analysis of Gene Expression Data.
Springer, New York, pp. 102119.
Lee, M.-L.T., Gray, R.J., Bjrkbacka, Freeman, M.W., 2005. Generalized rank tests for replicated microarray data.
Statist. Appl. Genet. Mol. Biol. 4(1) (Article 3).
Pesarin, F., 2001. Multivariate Permutation Tests: With Applications in Biostatistics. Wiley, Chichester.
Storey, J.D., Tibshirani, R., 2003. Statistical signicance for genomewide studies. Proc. Nat. Acad. Sci. 100,
94409445.
Szabo, A., Boucher, K., Carroll, W., Klebanov, L., Tsodikov, A., Yakovlev, A., 2002. Variable selection and pattern
recognition with gene expression data generated by the microarray technology. Math. Biosci. 176, 7198.

3628

L. Klebanov et al. / Computational Statistics & Data Analysis 50 (2006) 3619 3628

Szabo, A., Boucher, K., Jones, D., Klebanov, L., Tsodikov, A., Yakovlev, A., 2003. Multivariate exploratory tools
for microarray data analysis. Biostatistics 4, 555567.
Tusher, V.G., Tibshirani, R., Chu, G., 2001. Signicance analysis of microarrays applied to the ionizing radiation
response. Proc. Nat. Acad. Sci. 98, 51165121.
Vakhaniya, N.N., Tarieladze, V.I., Chobanyan, S.A., 1987. Probability Distributions on Banach Spaces. Riedel,
Dordrecht, Holland.
Westfall, P.H., Young, S., 1993. Resampling-Based Multiple Testing. Wiley, New York.
Whitehead, R.H., Van Eeden, P.E., Noble, M.D., Ataliotis, P., Jat, P.S., 1993. Establishment of conditionally
immortalized epithelial cell lines from both colon and small intestine of adult H-2Kb-tsA58 transgenic mice.
Proc. Nat. Acad. Sci. 90, 587591.
Xiao, Y., Frisina, R., Gordon, A., Klebanov, L., Yakovlev, A., 2004. Multivariate search for differentially expressed
gene combinations. BMC Bioinformatics 5 (Article 164).
Zinger, A.A., Klebanov, L.B., Kakosyan, A.V., 1989. Characterization of distributions by mean values of statistics
in connection with some probability metrics. In: Stability Problems for Stochastic Models. VNIISI, Moscow,
pp. 4755.

Anda mungkin juga menyukai