Anda di halaman 1dari 19

17 Other Methods of Inference

A CONFIDENCE INTERVAL FOR THE MEDIAN ........................................................................... 17-3 Nonparametric Statistics .................................................................................................... 17-4 Nonparametric confidence interval.................................................................................... 17-4 Parametric versus Nonparametric ..................................................................................... 17-7 TRANSFORMATIONS AND INTERVALS ...................................................................................... 17-8 PREDICTION INTERVALS .......................................................................................................... 17-9 Prediction Intervals for the Normal Model...................................................................... 17-10 Nonparametric Prediction Interval .................................................................................. 17-11 PROPORTIONS BASED ON SMALL SAMPLES ............................................................................ 17-13 SUMMARY .............................................................................................................................. 17-17

4/20/08

4/20/08

17 Other Methods

Businesses that have employees at some point have to solve the principal-agent problem. From an economic point of view, the principal-agent problem is to make sure that employees (the agents) share the goals of the businesss management (the principal). The situation is most apparent in setting compensation policies. The salary paid should motivate employees to work in a manner thats most beneficial to the company as a whole rather than their personal self-interest. For instance, a manufacturing facility might pay employees a piece rate rather than an hourly wage. The more items produced, the more the employee is paid and the more goods the manufacturer has to sell. The principal-agent problem becomes harder to solve when employees operate in less easily watched, farflung locations. As an example, consider the situation faced by an insurance company. Agents of the company operate out of small offices around the country. Think about what can happen if agents are paid a percentage of the cost of the insurance policies that they sell. Unless the company carefully reviews the records of the insured drivers, agents might be encouraged to sell policies to risky drivers. The agent gets paid right away, but the insurance company faces losses from subsequent accidents. (Some claim a similar problem produced the economic turmoil in the US housing market and banking industry in 2007-2008.) One way to avoid this misalignment of incentives is to change the compensation structure. The insurer could pay employees incrementally over the life of the policy rather than at the time of the initial sale. A second way to avoid this issue is to monitor the claims produced by different agents, comparing each agent to a standard. For example it is known that claims on auto insurance average near $3,200 nationwide, with the median claim equal to about $2,000. We will explore the second approach in this chapter, using methods of inference that do not rely on a normal model. Normal models and averages typically come together in statistics, but they do not necessarily provide the best procedure in every situation. The methods of this chapter are also useful when data fail to meet the CLT condition.

17-2

4/20/08

17 Other Methods

A Confidence Interval for the Median


The following histogram shows the distribution of claims on 42 auto insurance policies that originated at a local agency. Most of the claims are reasonably small, but the distribution is very skewed.
Whats the diamond? The diamond superimposed on the boxplot is a 95% confidence interval for .

Figure 17-1. Claims on auto insurance policies.

The average claim is for $3,632 whereas the median claim is $2,456. Half of the claims are for $2,456 or less, but these account for only 14% of the total value of the claims. The few, expensive claims generate most of the costs. Is the average of this sample compatible with the nationwide average $3,200? We can use a 95% confidence interval for the mean, treating these 42 claims as a sample from a population with mean . If this agency is operating differently from what is typical around the country, may be higher or lower than $3,200. If the confidence interval for includes $3,200, there is not a significant difference. If $3,200 lies outside the interval (particularly if costs are significantly higher than the national average), then there is evidence that policies sold by this agency generate different claims from what the insurer expects. Before we can use a t-interval for , we need to check the appropriate conditions. The data are a random sample of claims from this agency, so the SRS condition is met. Checking the CLT condition, however, the 2 skewness K3 = 2.407, so that the sample size must be at least 10 K 3 = 10(2.407)2 58. It isnt. Also, the kurtosis K4 = 6.473, requiring n > 65. We need another method for constructing a confidence interval in this situation. For reference, the nominal 95% confidence t-interval is (we say nominal since we have little faith that this interval has 95% coverage)

x t0.025,41 s/ n = $3632 2.02 $4254/ 42 = $2,306 to $4,958


Figure 17-1 shows this interval as a diamond superimposed over the boxplot. The national average $3,200 lies well inside this range. Were we to accept this evidence to suggest that policies interval, then there is no 17-3

4/20/08

17 Other Methods

written by this agency generate statistically significantly higher (or lower) claims.

Nonparametric Statistics
Why something else? 1. Fail to meet sample size condition. 2. CI based on mean is too wide.

There are two reasons to consider an alternative to the t-interval for the mean in the previous section. First, the sample is not large enough to satisfy the CLT condition. We cannot justify using a t-distribution to determine the confidence interval. Second, as a summary statistic, the mean is highly variable from sample to sample in the presence of outliers. As a result, a 95% confidence interval based on x may be much wider than a 95% confidence interval based on a different characteristic of the sample that is less variable from sample to sample.

nonparametric statistics Statistical methods that make fewer assumptions about a population, avoiding the need to specify a distribution.

Nonparametric statistics consist of methods that avoid making assumptions about the population. In spite of their name, nonparametric statistics involve parameters of the distribution of the population, and inferences based on nonparametric statistics do require assumptions. Rather than make assumptions about the shape of the population or sampling distribution, however, nonparametric statistics typically require only that the data be a sample.
Many nonparametric methods rely on sorting the data. Sorting is tedious to do by hand, and consequently nonparametric methods were slow to gain popularity. Computers removed this barrier. The connection to sorting also affects the choice of parameters. In place of the mean, nonparametric methods are more suited to parameters such as the population median. We use the Greek letter (delta) for the population median. Half of a population is smaller than and half is larger, analogous to how a sample median divides the data in half. If the random variable X denotes a random draw from a population with median , then P(X ) = . If the distribution of the population is symmetric, then = and inferences about the median are inferences about the mean. If the distribution is right-skewed, as in this application, the mean and median are quite different. For these data, < . The average claim in the population is about $3,200 whereas the median claim is near $2,000.

Nonparametric confidence interval


The first step in finding a confidence interval for is to put the observed data in order. To show the calculations (which are usually done by computer), we will work with only the first 10 claims in this sample, as if n = 10. After illustrating the calculations, well build the interval for the full sample. In order, the first 10 claims (to the nearest dollar) are 617 732 129 2450 3168 3545 4498 4539 4808 5754

17-4

4/20/08

17 Other Methods

The cheapest claim is $617, and the most expensive is $5,754. Because n = 10 is even, the sample median m the average of the two middle claims, (3168+3545)/2 = $3356.50. In place of a normal model or t-distribution, the nonparametric confidence interval for the median uses a simple but clever idea. Let X1, X2, , X10 be random variables that stand for the costs of 10 randomly chosen claims. Because of the importance of sorting, it is convenient to have a notation that denotes the ordered values. The ordered values are known as order statistics and usually written as X(1) < X(2) < < X(10) Parentheses around the subscripts signal that these are ordered random variables. We use lower case letters to denote observed values. For example, the two smallest values are x(1) = $617 and x(2) = $732. To build a confidence interval for without relying on hypothetical models like the normal, the nonparametric interval relies on the two things that must be true if these data are a sample from a population with median . Since is the median of the population, it must be the case that P(X1 < ) = , P(X2 < ) = , , P(X10 < ) = , Because the data are a sample, the random variables X1, X2, , X10 must be independent. Hence, the chance of getting a sample of 10 claims in which every claim is less than is P(X1 < and X2 < and X3 < and X10 < ) = ()10 The probability of a sample in which all 10 observations are less than is ()10 0.001 A confidence interval emerges from looking at this and similar statements from a different point of view that introduces the order statistics. If every observation is less than , then the largest observation is less than . Hence, P(X(10) < ) = ()10 Lets do another calculation, then find a pattern. What is the probability that exactly 9 of the 10 claims are less than ? Thats the same as the probability of getting 9 heads in 10 tosses of a fair coin. The counting methods used in Chapter 11 imply that P(9 out of 10 claims are less than ) = 10C1 ()42= 10()10 Using order statistics, the event in which 9 out of 10 claims are less than occurs if and only if lies between the two largest values, so P(X(9) < X(10)) = 10()10 In general, this approach assigns probabilities to the intervals between the ordered data. The probability that lies between two adjacent order statistics is 17-5

order statistics The sample values in ascending order.

Ck =

n! k! ( n k)!

4/20/08

17 Other Methods

P(X(i) < X(i+1)) = 10Ci ()10 If we define y(0) = - and y(43)=+, this formula works for the 11 disjoint intervals defined by the 10 values of the sample. The first interval is everything below the smallest value, and the last interval is everything above the largest value. This table summarizes the probabilities, which are the same as those for a binomial random variable with n = 10 and p = . i 0 1 2 3 4 5 6 7 8 9 10 Interval 617 617 < 732 732 < 1298 1298 < 2450 2450 < 3168 3168 < 3545 3545 < 4498 4498 < 4539 4539 < 4808 4808 < 5754 5754 < Probability 0.00098 0.0098 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.0098 0.00098

Table 17-1. Probabilities for location of median.

To form a confidence interval, we combine several segments. Because the events that define the segments are disjoint, the probabilities add. Using two segments to illustrate the calculations, we have P(X(9) < ) = P(X(9)< X(10) or X(10) < ) = P(X(9)< X(10)) + P(X(10) < ) = 0.00098 + 0.0098 = 0.01078 To get a confidence interval for , we choose the segments that have the most probability. For example, the interval [x(2) to x(8)] = [$732 to $4808] is a confidence interval for the median with coverage 0.044 + 0.117 + + 0.044 0.978. The coverage is the sum of the probability of the included segments. (These segments are shaded in Table 17-1.) In this example, as is true in general, we cannot get an interval whose coverage is exactly 0.95. The coverage of the shorter confidence interval [$1,298 to $4,808] is 0.978 0.044 = 0.934 because it omits the segment from $732 to $1,298 which has probability 0.044. To summarize, using a sample of 10 claims, we obtain a 97.8% confidence interval for . The interval [$732 to $4808] is a 97.8% confidence interval; we dont have to worry about an assumption that we cannot verify. This interval suggests that the claims from this agency are compatible with the insurance companys experience. The company expects $2,000, which lies well within the interval.

17-6

4/20/08

17 Other Methods

Are You There?


The salaries of 3 senior executives listed in the annual report of a publicly traded firm are $310000, $350000, and $425000. Let denote the median salary earned by this type of executive, and assume these 3 are a random sample from the population of senior executives at firms like this one. (a) Which value is x(2)?1 (b) What is the probability that all 3 of these executives earn more than ?2 (c) What is the probability that the interval $310,000 to $350,000 holds ?3 (d) What level of confidence should we attach to the interval $310,000 to $425,000?4

Parametric versus Nonparametric


Lets compare the nonparametric interval for the median using all 42 cases to the t-interval for the mean. As in the prior example, we begin by sorting the 42 claims into ascending order. The probability contribution of each segment defined by the order statistics is then P(X(i) < X(i+1)) = 42Ci ()42 A spreadsheet is useful for this calculation. Adding up the contributions of segments as in Table 17-1, we find that P(X(14) < X(27)) = 0.946 which is close to the usual 95% coverage. The ordered values from the data are x(14) = $1,217 and x(27)=$3,168, so the 94.6% confidence interval for the median claim is [$1,217 to $3,168]. The nominal 94.6% coverage tinterval for is considerably higher and longer, running from $1,870 to $4,411. Nonparametric methods are not without limitations, however. The confidence interval for the median illustrates two weaknesses common to nonparametric methods: The coverage is limited to certain values. We can come close, but not exactly obtain a 95% interval that is the de facto standard in many applications. The parameter of the population is related to percentiles rather than the mean. The median is not the same as the mean if the distribution of the population is skewed.

We often need a confidence interval for and cannot use an interval for . If we care about total costs, we care about the mean, not the median. By multiplying a confidence interval for by n we obtain a confidence
The middle value, $350,000 P(all 3 larger than ) = 3 = 1/8 3 P(X(1) < X(2)) = 3C1 3 = 3/8 4 The coverage of the range of this small sample is 3/8 + 3/8 = 3/4
1 2

17-7

4/20/08

17 Other Methods

interval for the total amount associated with n items. Thats not possible to do with the median.

Transformations and Intervals


When dealing with positive data that are right-skewed like the insurance claims in the histogram of Figure 17-1, it is often recommended to transform the data in order to obtain a more symmetric distribution. As an illustration, the following histogram summarizes base 10 logs of the 42 claims. Each value of this new variable is the common logarithm of a claim, yi = log10 xi . Since these are base 10 logs, we can associate 2 with 100, 3 with 1000, and 4 with 10000 to make plots easier to interpret.

Figure 17-2. Base 10 logarithm of the value of insurance claims.

If you compare this histogram to the original histogram in Figure 17-1, the log transformation has removed most of the skewness from the sample. If we check the normal quantile plot (Chapter 12), the log of the claims could be a sample from a normally distributed population. All of the data remain within the dashed region around the diagonal reference line.

Figure 17-3.Normal quantile plot of the log of the claims.

The skewness and kurtosis of the logarithms of the claims are also close to zero. After taking logs, K3 = -0.166 and K4 = -0.370. These data clearly 17-8

4/20/08

17 Other Methods

satisfy the CLT condition, we can use a t-interval for the mean of the population on a log scale. The average of the logarithm of the claims is y = 3.312 with sy = 0.493. The 95% confidence t-interval for y is then 3.312 2.02 0.493/ 42 = [3.16 to 3.47]
This interval satisfies the conditions for using a t-interval; we can rely on the coverage being 95%. But now we run into a problem: How are we to interpret this interval on the log10 scale? Unless we can interpret a confidence interval, it is not of much value.

tip
log(average) average(log)

The difficulty with interpreting this interval is that the log of an average is not the same as the average of the logs. The average claim in this sample is x = $3,632, so log10( x ) 3.560. That is quite a bit larger than the average of the data on the log scale ( y = 3.312). Similarly, if we unlog y (raise 10 to the power indicated by the mean), we get 103.312 = $2,051. Thats nowhere near the average claim x . If we transform the endpoints of the confidence interval on the log scale back to the scale of dollars, we have similar problems. Returned to the scale of dollars, the confidence interval based on the logarithm of the claims is [103.16 to 103.47] = [$1,445 to $2,951] On the scale of dollars, the 95% confidence t-interval does not include x , the observed average claim. Why have we discussed this interval here? The explanation is that the tinterval obtained from the transformed data is similar to the nonparametric confidence interval for the median, [$1,217 to $3,168]. Transforming to a log scale does indeed remove the skewness and produce data that resemble a sample from a normal distribution. This transformation, however, changes the meaning of the confidence interval. A t-interval computed on a log scale and converted back to the scale of dollars is closer to a confidence interval for the median rather than a confidence interval for . Thats not a problem if you have taken this route to avoid the calculations needed to obtain the nonparametric interval for the median. (Transforming to logs and forming a confidence interval is easy with most software, but the nonparametric interval for is less common.) It is a problem, however, if you mistakenly think that you end up with a confidence interval for . If you want a confidence interval for , you are stuck with the t-interval for the mean of the observed sample and all of the attendant limitations (such as the uncertain coverage).

Prediction Intervals
Prediction intervals are a different type of statistical interval that are sometimes with great misfortune confused with confidence intervals. A prediction interval (or the closely related tolerance interval) is a range that contains a chosen percentage of the population. Rather than hold a 17-9

prediction interval Interval that holds a future draw from the population with chosen probability.

4/20/08

17 Other Methods

parameter such as with a given level of confidence, a prediction interval holds a future draw from a population. A confidence interval for the mean insurance claim, for instance, asserts a range for , the average claim in the population. The larger the sample, the smaller the confidence interval becomes. A prediction interval anticipates the size of the next claim, allowing for the random variation associated with an individual value.

Prediction Intervals for the Normal Model


The percentiles of the Empirical Rule are approximate prediction intervals for an observation from a normal distribution. Because a prediction interval must anticipate the next value, not a mean, we cannot rely on the Central Limit Theorem to smooth away deviations from normality. If a population is exactly normally distributed with known mean and known standard deviation , then 95% of the population lies between 1.96 and + 1.96 . These intervals are called prediction intervals because we can think of as a prediction of a random draw from the population. The term 1.96 bounds the size of the probable prediction error, P( -1.96 X 1.96 ) = 0.95 That is, 95% of prediction errors X are less than 1.96 in magnitude. In general, the 100(1-)% tolerance interval for a normal population is z/2 to + z/2 . Notice that the sample size does not affect this interval; its the same regardless of n because we assume that and are known rather than estimated from data. Also, the width of the interval is determined by itself, not the standard error. The width of the 95% prediction interval is about 4 times the standard deviation . Two aspects of this prediction interval change if and are estimated from data. First, percentiles of the t-distribution replace the normal percentile z/2. Second, the interval adds a factor that compensates for using a less accurate guess of the next value. The variance of the prediction error is 2 if is known, but (1+1/n) 2 if the predictor is X . If X ~ N(, ) is an independent draw from the same population as the sample that produces X and S, then
1 1 P( X t/2,n-1 1 + S X X + t/2,n-1 1 + S) = 0.95 n n Both adjustments make this interval longer than the prediction interval based on and . It seems reasonable to have a longer interval since we less about the population. The percentile for the t-distribution is know larger than the corresponding percentile for a normal distribution (t,n-1 >

tolerance interval A statistical interval designed to hold a fraction of the population.

z) and the additional factor

1 + 1/ n > 1.

17-10

4/20/08

17 Other Methods

Nonparametric Prediction Interval


A simple nonparametric procedure produces a prediction interval that does not require assumptions about the shape of the population. This nonparametric interval is also easy to compute. The nonparametric prediction interval relies on a fundamental property of the order statistics X(1) < X(2) < < X(n) of a sample. So long as our data lack ties (each value is unique), then P(X(i) X X(i+1)) = 1/(n+1). Every gap between adjacent order statistics has equal probability 1/(n+1) of holding the next observation. The intervals below the minimum X(1) and above the maximum X(n) also have probability 1/(n+1), P(X X(1)) = 1/(n+1) and P(X(n) X) = 1/(n+1) These properties of the order statistics hold regardless of the shape of the population. As an example, the smallest two insurance claims among the 42 in the prior example are x(1) = $158 and x(2) = $255. Hence, the probability that the next claim (after this sample) from the sampled population is less than $158 is 1/(42+1) 0.023. Similarly, the probability that the next claim is between $158 and $255 is also 0.023. As with the nonparametric interval for the median, we can combine several segments to increase the coverage. Each segment that is included increases the coverage by 1/(n+1). In general, the coverage of the interval between the ith largest value X(i) and the jth largest value X(j) is (j-i)/(n+1) so long as j > i. Using the claims data, P(x(2) X x(41)) = P($255 X $17,305) = (41-2)/43 0.91 There is a 91% probability that the next claim is between $255 and $17,305.

Are You There?


The salaries listed in the prior AYT are $310000, $350000, and $425000. Assume these 3 are a random sample. (a) What is the probability that the next draw from this population is larger than $310,000?5 (b) Find a 50% tolerance interval for the next random draw from this population?6 (c) Is this the only possible 50% interval, or are there others?7

0.75, because there is a 25% chance for each of the 4 subintervals. The middle interval, from $310,000 to $425,000. 7 There are 6 50% intervals; pick any pair of the 4 subintervals defined by the data.
5 6

17-11

4/20/08

17 Other Methods

Example 17.1 Executive Salaries

Motivation

state the question

Fees earned by an executive placement service are 5% of the starting annual total compensation package. A current client is looking to move into the telecom industry as CEO. How much can the firm expect to earn by placing this client as a CEO?

Method

describe the data and select an approach

In this situation, directors of the placement service want a range for the likely earning from placing this candidate. They are not interested in what happens on average ( or ), but rather want a range that accommodates the variability among individuals; they want a tolerance interval. The relevant population is hypothetical. From public records, the placement service has the total compensation (in millions of dollars) 1. Identify parameter, earned by all 23 CEOs in the telecom industry. Thats the population of if there is one current CEOs. We can, however, think of these 23 as a sample from the 2. Identify population 3. Describe data collection of all possible compensation packages that are available within 4. Pick an interval this industry. These 23 compensation totals represent a sample of what is 5. Check conditions possible. It is clear from the histogram of these data that the distribution is not normal. Consequently, we will use a nonparametric tolerance interval. Well use 75% coverage tolerance interval obtained by trimming off the top 3 and bottom 3 subintervals. The coverage of the prediction interval between X(3) and X(21) is (21-3)/24 = 0.75. SRS Condition. If we view these compensation figures as a sample from the negotiation process used in the telecom industry to set salaries, then it is reasonable to call these data a sample. It is hard to believe, though, that the data are independent. These executives know how much each other earns after all, we do! We will note this issue in our message.

Mechanics
The following table shows a portion of the sorted observations. x (1) 0.2 x (2) 0.421338 x (3) 0.743801 x (4) 0.80606

do the analysis

17-12

4/20/08

17 Other Methods

x (6) x (19) x (20) x (21) x (22) x (23)

1.627999 20.064332 25.719144 29.863393 29.914396 34.606343

The interval x(3) to x(21) is $743,801 to $29,863,393 is a 75% tolerance interval.

Message

summarize the results

The compensation package of three out of four placements in this industry can be predicted to earn salaries in the range from to 30 million dollars. The implied fees thus range from a relatively paltry $37,500 all the way up to $1,500,000. This range is wide because of the massive variation in pay packages: from $200,000 to almost $35 million. If we take account of possible dependencies in the data, the range may be larger still. The best recommendation might be to suggest that the placement agency needs to learn why some salaries are so much higher than others. (We will come to methods for doing just that beginning in Chapter 19.)

Proportions based on Small Samples


Students t-distribution provides a confidence interval for means of small samples from a normal population. The nonparametric interval for the median handles inference for the center of a population that is not normal. But neither method applies to a proportion. Here are results of a small survey. We observed 7 customers make a purchase at a supermarket. The data record whether the customer used a credit card to pay for the purchase: Yes, No, No, Yes, No, No, No = 2/7 0.286. How should we compute a 95% The sample proportion p confidence interval for p, the population proportion that pay with a credit card? Nonparametric methods dont help because the data have only two distinct values. Because of the matching values, sorting the data is not informative. We cannot use a t-interval either. These yess and nos (or 0s and 1s) are not a sample from a normal population. With a large sample, the Central Limit Theorem takes care of us, but n = 7 is too small. These data do not < n < 10). The lower satisfy the CLT condition for proportions (n p endpoint of the z-interval in this summary confirms a problem:
p (1- p )/n) ( p 95% z-interval n

0.2857 0.1707 -0.0489 to 0.6204 7 17-13

Table 17-2. Summary of a small survey.

4/20/08

17 Other Methods

The 95% z-interval is 4.9% to 62%. The confidence interval for the proportion p includes negative values. That doesnt make sense because we know that p > 0 since 2 of these 7 made a purchase. This graph of the shows what went wrong. normal model for the sampling distribution of p and se( p ) =p (1- p )/n. The mean is p

Figure 17-4. Normal approximation to the sampling distribution.

puts some The normal model for the sampling distribution of p probability below zero in a range that is not possible. A t-distribution does the same thing. The CLT condition for proportions (Chapter 15,16) avoids 10 and n(1- p ) 10. In order to use this problem by requiring both n p the z-test or z-interval for a proportion, the CLT condition requires at least 10 successes and 10 failures. This small sample includes only 2 successes and 5 failures. One remedy for this problem is work directly with a binomial distribution. Another that is easier is to move the sampling distribution away from 0 and 1. The way this is done is a remarkably easy: add four cases to the data, 2 successes and 2 failures. The interval is centered on the = (# successes+2)/(n+4) and uses p to determine the adjusted proportion p standard error in the revised z-interval. Wilsons Interval for a Proportion.8 Add 2 successes and 2 failures to the data and then compute the usual z-interval from the augmented data. (1 p ) (1 p ) p p 1.96 1.96 ,p p n+4 n+4

Checklist SRS Condition. The data are a simple random sample from the relevant population (that is, the data are Bernoulli trials). This adjustment moves the center of the sampling distribution closer to , away from the troublesome boundaries at 0 and 1. This adjustment also . The hard part is to prove that packs the sampling distribution closer to p the procedure indeed produces a better 95% confidence interval. This idea for improving a confidence interval by adding more successes and failures dates back to the French mathematician Pierre-Simon Laplace. (A more
Pierre-Simon Laplace was a renowned mathematician and astronomer of the 18th and 19th centuries and proved a version of the Central Limit Theorem. This interval is studied in Agresti, A., & Coull, B.A. (1998). Approximate is better than exact for interval estimation of binomial proportions. The American Statistician, 52, 119126. Wilson also proposed a more elaborate interval, but this one does quite well and is easy to compute.
8

17-14

4/20/08

17 Other Methods

elaborate confidence interval for p known as the score interval for a proportion gives similar results but is more challenging to compute.) In this example, adding 2 successes and 2 failures increases n from 7 to 11 = (2+2)/(7+4) 0.364. The revised 95% z-interval for p is and makes p 0.364 1.96 (0.364(1-0.364)/11) [0.079, 0.647] With 2 successes and 2 failures added to the data, the Wilsons z-interval for the proportion no longer includes negative values.

Example 17.3 Drug Testing

Motivation

state the question

Surgery can remove a tumor, but that does not mean the cancer wont return. To prevent a relapse, pharmaceutical companies develop drugs that delay or eliminate the chance for a relapse. To assess the impact of the medication, we have to have a baseline for comparison. If every patient survives for a long time without the benefits of the drug, it is hard to make the case that health insurance should pay thousands for the treatment. Studies designed to establish the baseline are expensive and can take years. The expense can exceed $30,000 to $40,000 per subject if surgery and hospitalization is involved. The company in this example is developing a drug to prolong the time before a relapse of cancer. For the drug to appeal to physicians and insurers, management believes that the drug must cut the rate of relapses in half. Thats the goal, but the scientists need to know if thats possible. As a start, theyd like to know the current time to relapse.

Method
1. Identify parameter, if there is one 2. Identify population 3. Describe data 4. Pick an interval 5. Check conditions

describe the data and select an approach The parameter of interest is p, the probability of relapse over a relevant time horizon. In this case, the horizon is 2 years after surgery. The population is the collection of adults who are older than 50 and have this operation. The data in this case are 19 patients who were observed for 24 months after surgery to remove a single tumor. Among these, 9 suffered a relapse: within 2 years, doctors found another tumor in 9 of the 19 patients.

We will provide the scientists a 95% confidence interval for p. The sample is small with 9 who relapse and 10 who do not. Before we go further, lets check the relevant conditions. SRS Condition. These data come from a medical study sponsored by the National Institutes of Health and published in a leading journal. They should have a good sample. CLT Condition for Proportions. We do not quite have enough data to satisfy this condition. It requires 10 of each type.

17-15

4/20/08

17 Other Methods

These data are a random sample from the population, but the data do not meet the CLT condition To make the 95% confidence interval more reliable, we will use Laplaces adjustment and add 2 successes and 2 failures.

Mechanics

do the analysis

This table shows the standard output of a statistic package without any adjustments. Although this interval does not include negative values as in the prior example, we cannot rely on its being a 95% confidence interval for p since it violates the CLT condition. 0.47368 19 0.11455 0.24917, 0.69824 The improved interval that adds the artificial cases is similar but shifted toward 0.5 and =(9+2)/(19+4) .478, and the interval is shorter. By adding 2 success and 2 failures, p
p n (1- p )/n) ( p 95% z-interval

0.478 1.96 0.478(1-0.478)/23 [0.27 to 0.68]

Message

summarize the results

We are 95% confident that the proportion of patients under these conditions who survive at least 24 months without remission lies in the range of 27% to 68%. In order for the proposed drug to cut this proportion in half, it must be able to reduce the rate of relapses to somewhere between 13% and 34%

17-16

4/20/08

17 Other Methods

Summary
Not all data satisfy the conditions for a z-interval (proportions) or a tinterval (means). Most often, the failure is due to a small sample or distribution that is far from the normal. A nonparametric confidence interval for the median provides an alternative to the t-interval for the mean. By adding 4 artificial observations, Wilsons interval for the proportion produces a more reliable z-interval for the proportion p. Tolerance intervals (or prediction intervals) are statistical intervals designed to hold a chosen percentage of the population. These intervals do not benefit from averaging and hence require either a nonparametric approach or strong assumptions about the form of the population.

Key Terms
nonparametric statistics, 17-4 order statistics, 17-5 prediction interval, 17-9 tolerance interval, 17-9 Wilsons Interval for a Proportion, 17-14

Best Practices
Check your assumptions carefully when dealing with small samples. Weve been able to rely on the Central Limit Theorem to simplify our methods. We cant do that when we work with small samples. For these, we have to either make strong assumptions, such as normality, or find methods that avoid the assumptions (here, the nonparametric confidence interval for the median). Consider a nonparametric alternative if you suspect non-normal data. If data are a sample from a normal population, the t-interval is the bestalternative. If the data arent normally distributed, the t-interval can be senseless and do things like include negative values in a confidence interval for a parameter that we know has to be positive. Use the adjustment procedure for proportions from small samples. Its easy to add 2 successes and 2 failures, and the resulting interval has much better coverage properties. Verify that your data is a simple random sample. We talked a lot about random samples in Chapter 14, but this is the most important condition of all. Every statistical method requires that the data be a sample from the appropriate population. Optimistic use of t-intervals for the mean of a small sample. A tinterval might be appropriate, but only if you are quite sure that the population youre sampling has a normal distribution. Expecting software to know to use the right procedure. Software will only do so much for you. Most packages do not check to see that 17-17

Pitfalls

4/20/08

17 Other Methods

your sample size is adequate for the method that youve used. For all it knows, you might be a teacher preparing an example to illustrate what can go wrong! Thinking you can prove normality using a normal quantile plot. The normal quantile plot can show that data are not a sample from a normal population. If the data remain close to the diagonal reference in this display, however, thats no proof that the population is normal. It only shows that the data might be a sample from a normal population. We cannot prove the null hypothesis of normality. Using a confidence interval when you need a tolerance interval. If you want a range that anticipates the value of the next observation, you need a tolerance (or prediction) interval. Most 95% confidence intervals contain only a very small proportion of the data; they are designed to hold the population mean with chosen confidence level, not to contain data values.

Formulas
Order statistics. The data put into assending order, identified by paratheses in the subscripts. minimum = X(1) < X(2) < < X(n) = maximum Nonparametric confidence interval for the median. Combine subintervals defined by the ordered data, with probabilities assigned to the subintervals from the binomial distribution with parameters n+1 and . The probability attached to the subinterval between the ith largest value and the (i+1)st largest value is P(X(i) < X(i+1)) = nCi ()n Nonparametric tolerance interval. Combine subintervals defined by the ordered data, with equal probability 1/(n+1) assigned to each. If X denotes an independent random draw from the population, then P(X(i) < X X(i+1)) =1/(n+1)

About the Data


The data for the cancer trial are from the National Center for Health Statistics (NCHS) is located on the web at. Data on the cost of claims made to auto insurers comes from the Insurance Information Institute. Both sources are available on-line. Salaries of telecom executives come from company listings in the telecom industry in the 2003 Execucomp database. The stopping distances come from studies done by the National Highway Traffic Safety Administration, located at http://www.nhtsa.dot.gov.

17-18

4/20/08

17 Other Methods

Software Hints
Modern statistics software has reduced the importance of many of the details in this chapter. If youre willing to accept what it does, you can ignore many of the concerns caused by small samples. Some would argue that this chapter is all mechanics, and we tend to agree. Even so, youll be able to appreciate what your software can do after this chapter. The calculation of the interval for the median in this chapter requires some programming on your part. Alternatively, most statistics packages include so-called nonparametric methods that resemble the methods shown here for the median. Check the documentation for your software.

Excel Minitab
Minitab includes a specialized confidence interval designed proportions. Follow the menu commands Stat > Basic Statistics > 1 Proportion to use this approach. (Minitab computes an interval based on the likelihood ratio test rather than the method discussed in this chapter.) .

JMP

Following the menu items Analyze > Distribution opens a dialog that allows you to pick the column that holds the data for the test. Specify this column, click OK, and JMP produces the histogram and boxplot of the data. The table below the histogram shows x , s, the standard error and the endpoints of the 95% t-interval for . To obtain a different level of confidence, click on the red triangle in the output window beside the name of the variable above the histogram. Choose the item Confidence Interval and specify the level of confidence. If the variable is categorical, the Analyze > Distribution commands produce a frequency table. Click the red triangle above the bar chart and select the item Confidence Interval and choose the level of the interval. The output window then expands to show confidence interval for the proportion in each category. JMP obtains these intervals by a refined version (called the score interval) of the methods in this chapter.

17-19

Anda mungkin juga menyukai