Anda di halaman 1dari 104
STATISTICAL INFERENCE FOR DISCRETE DISTRIBUTIONS* Howard G. Tucker University of California, Irvine “4 University of California TABLE OF CONTENTS Chapter 1. Review of Basic Probability 1 Section 1. Probability and Conditional Probability 1 Section 2. Random Variables and their Distributions 4 Section 3. Expectation G Section 4. Limit Theorems 9 ‘Section 9. Random Numbers and their Uses 15 Section 6. Probability Generating Functions 18 Sextion 7. Hypergeometric and Multinomial Distributions 23 Section 8. Sufficient Statistics 6 Chapter 2. Estimation 29 Section 1 Sampling Finite Populations 20 Section 2. The Precision of Unbiased Estimates 33, Section 3. Review of Maximum Likelihood Estimates 39 Section 4. The Rao-Dlackwell Theorem, 42 Chapter 3. Analysis of Categorical Data 49 Section 1. Randomized and Conditional ‘Tests 49 Section 2. ‘lhe Irwin-Fisher Test 53 Section 3. ‘The Irwin-Fisher Test Adapted to Small Probabilities 87 Section 4. The Irwin-Fisher Test Adapted to 2-by-2 Contingency Tables 61 Sextion 5. Test for Equality of Two Probabilitic in a Multinomial Distribution 64 Section 6. Test for Equality of Ratios of Two Multinomial Probailities or Section 7. McNemar’s Test. 69 Section 8. The Chi-Square Goodnes of Fit Test 73 Section 9. Test for Trend in Bernoulli Trials. 76 Chapter 4. Elementary Decision Making 80 Section 1. The Neyman-Pearson Fimdamental Lemma. an Section 2. Discriminant Analysis 84 Section 3. The Sequential Probability Ratio Test 87 Section 4. SPRT for Binomial Sampling OL Section 5. The SPRT Ends With Probability One 93 Section 6. 'The Average Sample Number for the SPRT 97 CHAPTER I. REVIEW OF BASIC PROBABILITY. §1. Probability and Conditional Probability. This is a review chapter. Thus definitions will be stated without much discussion or motivation, and theorems will mostly be stated without proofs. The material in this chapter is given to remind the student what he or she has studied in the undergraduate post-calculus course in probability and statistics. We shall first recall the definition of a probability space, (94, P). The first of this triple, 2, denotes a non-empty set of outcomes of a game or experiment. ‘The elements of this set are called elementary events. The second of this triple, A, denotes a sigma-algebra of subsets af O, upan which wo alaharate ae follows. The elements of A are called ovente, and this non-empty set of subsets of 9 is assumed to satisfy the following requirements: (i) If A € A, then A® € A, where A® denotes the complement of A with respect to 9, i.e., AP =O\A= {WEN w¢ A}, (ii) ME A and 4 € A, where ¢ denotes the empty subset of , and (iii) if {A,} denotes any countable sequence of events, i.e., elements of A, then UnAn € A. ‘The third member of the triple, P , is called a probability. It is a real-valued function defined over A which satisfies the following requirements: (i) O< P(A) $1 forall AE A, (ii) P(¢) = 0 and P(Q) =1, and if {Aq} is any countable disjoint sequence of elements of A, Uuen P(UnAn) = P(An)- Once one has a probability P over A, then it is possible to consider other probabilities, called conditional probabilities. If A and B are elements of A and if P(R) > 0, then the conditional probability of A given B, denoted by P(A|B), is defined by P(A|B) = P(AB)/P(B), where AB = {w € A:w € A and w € B}. It should be noted (which means “it can be easily proved") that for any particular B € A such that P(B) > 0 the function P(-|B) defined over A is also a probability, ie, it satisfies: (i) 0 P(AIB) <1 for all AE A, (ii) P(¢|B) = 0 and P(Q|B) =1, and (ii) if {A.) is any countable sequence of disjoint events, then P(UnAnlB) = E P(An|B) ‘There are three basic theorems concerning conditional probability. They are: ‘Theorem of Total Probabilities. If {Aq} is any countable sequence of disjoint events in isfics P(Aq) > 0 for all n, if H € A, and if HC U,An, then P(iH) = > PUH|An)P(An)- Multiplication Rule. If Ao, Ai,--";An are any n +1 events in A(n > 1), and if P(AnAs +++ An—+) > 0, then P(AgA1 +++ An) = P(Ao)P(Ar|A0)P(Az|AoAs) ++ P(An|AoAs --- Ana): Bayes’ Rule. Under the hypotheses of the theorem of total probabilities, P(H\A;)P(A;) En P(H|An)P(An) PAH) = for all 7. The last notion we wish to recall and review in this section is that of independent events. Let € be a non-empty subset of A, and suppose that, for every non-empty and finite subsct {Asoo yAu) of distinct evente in ?, the equation P(AyAg-++ An) = PlAs)P(As) ++ PlAn) is satisfied. Then the events in C are said to be independent. It should be noted that if A and B are independent events, and if P(B) > 0, then P(A|B) = P(A). EXERCISES « Let Ay, Aa,++,Aq be n > 2 events. How many equations of the form P (M.A) TP) must these events satisfy in order that they be independent? Prove: if Ap A1y+-*)An-1 are events which satisfy P(AoAy**-An-1) > 0, then P(AgA1 “++ Aj) > 0 for 0 0, then P(A|B) = P(AB|B) and P(BIB) = Let 9 consist of V # 0 elementary events, and let A deuule the set of all subsets uf Q. For every A € A, define Na to be the number of elementary events in A, and define P(A) = Na/N. Prove that P so defined satisfies the definition of a probability In Problem 4, if A € A and B € A, and if Nz > 0, prove that P(A|B) = Nap/Np Let A and B be events satisfying F(AD) 2 0. Prove that P(CB\A) = P(C|BA)P(BIA) for all CE A. §2. Random Variables and Their Distributions. We shall recall in this section the Uchinitivus of rauduus vaiiables aud random vectors and their distiib A real-valued function X : 0 + R! defined over 2 is called a random variable if, for every real number 2 , functions. {WEU:XW) yn) instead of fx(x)- Finally, it should be recalled that n discrete random variables are said to be independent if their joint distribution function Fy, .,x4(2is"-* +a) factors into the product of the univariate marginale Fy,(r1),<+©, Fieg(?a) for all teal aa, tq) instead of Fx(x) and Prsyotal2iy°*°y80) = TFs, (29) for all real 21,--+,2,. This is equivalent to fx. fa(B1s°** stn) for all real 21,21,-++52n- In this course we shall deal only with discrete univariate and discrete multivariate distri- butions, and hence we shall not review here the so-called continuous or mixed distributions. In closing, one should recall one of the most important discrete random variables, the indicator of an event. If A is an event, then we define its indicator I, as the random variable which satisfies ify hw) = 3 eA 0 fog a Every discrete random variable (vector) can be written as a linear combination of indicators Indeed, if X is a discrete random variable whose values are 2y,72,++-, then X= V toler . Prove: The random variables X1, EXERCISES X,, are independent if and only if the events {[X1 $ 21],-+- [Xn S a9]) are independent for every n-tuple of real numbers 21,-++,25- Prove: if A and B are events, and if A C B, then P(A) < P(B). Use this to prove that the distribution function F(z) of a random variable X is nondecreasing, i.e, if a’ <2”, then Fx(2’) S F(z”), Prove: If X is a discrete random variable whose range is {21,22,-+-}, then X= DV talxesn) §8. Expectation. We recall here the definition of expectation and some definitions and rooulte related to thie definition. If X is a random variahle, if [2°(1 — Fy(2))da < 00 and if ©. Fx(z)dz < 00, then the expectation of X, denoted by EX or E(X), is defined by EX = fo ~Fr(2)jdr~ f° Fete)de Tf X hae a diecrote distribution ae defined in Section 9, then it can be proved that EX = D_2nfx(tq)- The corresponding nth moments of X,E(X*), are obtained from the formula BUX") — LaF fx les) providing the sum converges absolutely. A basic property of expectation is that it is a linear functional, i.e., if X and Y are random variables defined on the same probability space whose expectations exist and are finite and if @ and b are real numbers, then E(aX +BY) = a(X)+ bE(Y). An important moment is the second central moment, called the variance, whose definition and properties we now review. If X is a random variable, and if E(X?) < oo, we define the variance of X, which we denote by VarX or Var(X), by Var(X) = E((X-EX)). It follows that Var(X) = £(X?) — (E(X))? , and Var(eX) — c?Var(X) for any constant «. If Xi,+++,Xq are independent random variables with finite second moments, then Var (& x) = y Var(X;). Variance measures the amount of spread of the distribution function of a random variable X but not its location. Indeed, one easily proves these the important properties: Var(e+X’ Var(X) and Var(eX) = o4Var(X) for every constant. ¢ EXERCISES Prove: If X is a random variable, if E(X*) < 00, and if y : R! + R'is a function defined by y(t) = E((X ~ #)?), then y achieves its minimum value at t = E(X), and E(X)) = Var(X) Prove: If X is a random variable with finite expectation, and if X has a distribution function which is symmetric about 4, ie., PIX — 4 > = PIX — 4 $ -d for every «> 0, then E(X) = Let X be a discrete random variable whose distribution function satisfies oO if’ 23 where py > 0,p2 > 0,pi + p2 <1 and 0 <2 <2 <2. Prove that Var(X) > 0. Prove: If c is a fixed constant, and if X is a random variable satisfying P[X = ¢] = 1 then Var(X) Prove: If X is a discrete random variable satisfying E(X?) < oo, then Var(X) = 0 if and only if there exists a constant c such that P[X = ¢] = 1. Prove the properties of variance given in this section for a discrete random variable X with finite second moment. Prove: If X is a discrete random variable with finite expectation, then fo = Fx(2))dz = [ Fy(z)dz = Yo tafx(tn)s where {21,22,:+5) 8 the range of X. §4. Limit Theorems. Probability theory contains limit theorems that are useful in making approximations. We recall here the basic limit Uhevreis lal we shall wake smut use uf in the future. These theorems deal with a sequence of random variables {X,}, all of which have the same discrete distribution function F(z) and are independent. We say in this case that {X,} are independent and identically distributed with common discrete distribution function F. A short way of writing this is: {Xq} are ii.d.(F). It should be noted that if {X.} are ii.d.(F), and if the mth moment of Xs exists and is finite, then the mth moment exists, is finite and is the same for every Xm Definition: A sequence of random variables {Z,) 1s sald to converge in produditity lu « random variable Z (written: Z, Z) if, for every € > 0, Pl|Z, — Z| 2 J +0 as n+ 00. LEMMA 1. (Chebishev’s inequality). 1 X is a random variable, and if E(X?) < co, then Var(X) PIX -EX)2 45 —G for everye> 0. Proof: Let R denote the range of the random variable X. Then, for any ¢ > 0, D(x?) — DS PP[x - 5] rh = YD fP[xk=c+ YO Px =a] feoteice teotri20 2P[X=2])>é feito) = €PI|X|>4. If the random variahle X ie replaced hy the random variahle X — R(X), the conelusion is obtained. QED. ‘THEOREM 1. (Law of Large Numbers). it {X,} are1.2.d.(") with common expecta- tion x and common variance o?, and if X, denotes the nth arithmetic mean, i.e., Xq= (Kite Xa)/ny then X, 4 wasn 00 Proof: By properties of variance we obtain Var(X,) = o?/n. Thus, by Chebishev's in- equality, for every ¢ > 0, _ PUR, — wl 2d < 07nd. Letting n + 00, we obtain P[|Xq— | > + 0 as n+ 00, ie, X, Bp. QED. Note that the law of large numbers states: for every € > 0, no matter how small « might be, the probability that X, will differ trom the common expectation by an amount less than € converges to 1 as n —+ oo. In practical terms, this states that X, is a “good estimate” of i; this statement will be discussed in more precise terms later. THEOREM 2. If {X(},---,{XW} are k sequences of random variables such that, for 1<9 Sk, AW F a; as n — 0 for some constant a;, and tf f : RX + Rt is u function that is continuous at the point (a1,-++,ax) € RA, then f(X(,---,X) 5 f(as--+,0x) as Proof: Let ¢ > 0 be arbitrary. Since by hypothesis f is continuous at (a;,--+,ax), there exists 6, > 0 such that if be = aa] < 8ey+++s fe — a] 6]) +0 as n — 00. Ww Ww Qh. As an application of the above theorem and the law of large numbers we prove that sample variance converges to the population variance in probability as sample size tends to infinity. More precisely, let {X,} be i.id.(F), and assume that the common distribution F has finite fourth moment, i.e., Detfr(2) < c0- It follows that the sequence, {X32}, of squares is a sequence of i.i.d. random variables finite common second moment. Let us denote ros ands} = Abe - X,). Xx, 10 ‘Then one easily verifies that Now consider these three sequences of random variables: {4}. (i Ex} and {Xq}- Since n/(n — 1) +1, then, as a degenerate random variable, n/(n — 1) 41. By the law of large numbers, 2 YX? 4 E(x?) and X, 5 (Mi). a ‘The function f(z,y,z) define by f(x,y, 2) = ey — 2%) 1s easily seen to be continuous at the point (1, E(X?), E(X;)). Since or(entbaa) it follows from Theorem 2 that s? + E(X?) — (E(X:))? = Var(X,), which proves the assertion. ‘The following stronger theorem, known as the central limit theorem, did not receive a rigorous proof in the undergraduate prerequisite course. This theorem, or two theorems used to prove this, must be accepted without proof until basic results from measure-theoretic probability are learned. THEOREM 3. (The Ceutial Limit Theorem). If {Xp} satisfics the hypotheoie of Theorem 1, then 1 2 0, Pl\S,/n — pl 2 4-0 as n+ 00. THEOREM 5. (Laplace-DeMoivre Theorem) If S, is as in Theorem 4, | Sa—np ae a nel =P) Vin I-00 asn 00. ‘An application of this theorem is as follows. A coin is tossed 400 times. Assuming that it ic unhiaced, we chall find (approvimately) the prohahility that the nnmber of times it comes up heads is not greater than 205? Using the Laplace-DeMoivre theorem as it stands, an approximation to this probability is worked out as follows: Py 205 ~ 200 To | & 20.8) = 0.698. PSs < 205] = P [ ‘A more accurate approximation of P[S, < k] for a sequence of n Bernoulli trials is obtained by using the integer correction. If p= P(E}) for all j, then, indeed, / Sy—np k=np P(S, < k= P |S P< =—EL), ae | Impl = p) | 12 but for this right hand expression, instead of approximating it by @((k — np)/y/np(1 — p)), we instead obtain greater precision by using Tur the example above, if we use the integer correction, we obtain PIS. $ 205} A sequence of random variables {X,,} which is i.d.(F) for some distribution F is realized in practice by taking observations on a population with replacement. Ideally, one stirs the population as if it were a collection of numbered tags in a bowl. One selects an individual or clement from this population, takes note of its value and then returns it to the population. In this manner, X; has been observed. One repeats this process to obtain Xz. A sample of size n,X,,---,Xq, is the result of following the procedure for selecting a tag n times. In this case, X1,-++,Xq are i.id.(F) where F is their common distribution function. In this case F is defined by # of tags in bowl with numbers < E(\X|)/¢ for every € > 0. 85. Random Numbers and Their Uses. A table of random numbers (or any random number generator) offers one the opportunity of simulating any game or experiment. where the outcome is due to chance. Basically, the single digit columns (or rows) of a page of this table are successively formed hy a procedure that is assumed to be equivalent to the following. Ten chips numbered 0, 1,---,9 are placed in a box. They are thoroughly mixed, one is selected at random, the number on it is recorded, and it is returned to the box. Thus a table of recorded randui nuiubess gives us te sesulle of repeating thio procedure. Each column (or row) is in effect obtained in this manner. Thus it not necesary for us to have the chips and box on hand, and we need not have to spend the time in the mechanics of this procedure; we merely use a handy table of random numbers or any calculator or computer random number generator. Suppose we wish to simulate a game in which A and B are events, where P(A) = 6, P(B) = 5 and P(AB) = 3. With the ten chips and the-box, we might decide that event ‘A occurs if we select a chip with a number in the set {1,2,3,4,5,6} and B occurs if we select a chip from the cet (1,5, 7,8}. Obsorve that AR arenes if and only if the chip selected is from the set {4,5,6}. We can simulate 100 plays of this game almost immediately by selecting @ page and a column at random and selecting 100 successive integers down that column and then down the next. If one wishes to select a two-decimal number at random in the unit interval (0,1) (ie., with the uniform distribution), one would select two chips at random with replacement. If the number on the first chip is 2, and if the number on the second chip 1s y, then the two. decimal number selected at random is .xy. We observe that the probability of obtaining each possible outcome is 1/100. This is simulated 250 times with a table of random numbers as follows. One first selects a page at random and then a pair of adjacent columns at random. "Then, one after another, one records the number-pairs in each row of these two column When the botton of the page jo reached, one gose tte the top af tha nevi tun adjacent columns, continuing until one has recorded all 250 integer pairs Sampling from finite populations with and without replacement is not a short story, and the uses of a table of random numbers to effect various saiupling sulraues reyuise @ certain amount of mathematical development. If one has a population of 100 highrise buildings and wanted to select 5 of them at random, it would be somewhat difficult to put them all in an urn, stir them thoroughly and then select the five. Indirectly one could do this as follows: number the buildings from 1 to 100, then get 100 metal-rimmed tags, number them from 1 to 100, put the tage in a hay, stir thoronghly and select five without replacement. If one wishes to select five buildings at random with replacement, one would select a tag at random, record the number on it, return it to the box and go on to the next selections, each time recording the wuinber and returning the tag to the box until five not necessarily distinct numbers are recorded. In sampling n times with replacement from a population of N units, C : Uy,-+-,Uy, our fundamental probability set, §2, will be the set of all ordered n-tuples of units {(Uiy°--)Vin)} with repetitions allowed in each n-tuple. There are NV” such equally likely outcomes, and 15 thus the probability of any one particular outcome, way (Ug,,-+*, Us,) is PU(Uers Uae) = 1/N". In sampling n times without replacement from the same population of N units, the funda. mental probability set, 0, will be the set of all ordered n-tuples {(Uj,,"+*,Us,)} of distinct units from C : Us,---,Un. There are N!/(N —n)! such equally likely outcomes, and the probability of any one particular outcome, say, Us,,-**,Uk, (un this order) 1s (Ugg +++ Tag) =U 16 EXERCISES 1. How would you simulate a game, using a table of random numbers, in which events A and B satisfy P(A) = 36, P(B) = 5 and P(AB) = 18? 2. Let X be a random variable with discrete density a6 3 J 0 0 Sel) =) 35 25 40 9.00 How would you simulate a sample of size n with this density? Le., using a table of random numbers, how would you simulate observations on n i.id. random variables X;,--+,Xn all with the same discrete density as X? 3. Suppose X and ¥ are two random variables whoee joint distribution ie given hy the graph below. How would you simulate one observation on the random vector re through the use of a random number generator? 4. There are two groups of patients with 10 patients in each group. Group A is given treatment A, and each patient in this group responds positively with probability pa = 60. Group B is given treatment B, and each patient in this group responds with probability py — .80. The pationte can be assumed ta he independent. of one another. How would you simulate the outcomes of this clinical trial with the aid of a table of random numbers? 5. An urn contains 10 red balls and 15 white balls. One selects 6 balls at random without replacement to get X red balls and 6 — X white balls. How would you simulate this trial through the use of a table of random numbers? wv §6. Probability Generating Functions. Probability generating functions are useful in connection with distributions of non-negative integer-valued random variables, and they exist only for such random variables. Definition. If X is a non-negative integer-valued random variable, i.e., if )> P(X then the probabihty generating function, px, of X is a function defined by for all real u for which the series converges. From the material on power series covered in elementary calenlus, it follows that the radius of convergence of the series defining yx is not less than 1. Also, for n > 0 ie player * (leo rik == This last remark shows that gx and the distribution function of X determine each other. THEOREM 1. If X andY are independent, non-negative integer-valued random variables, then exty(u) = ex(u)er(u) for all u for which px and yy are defined. Proof: We first observe that perry are ay way Shaw - apy =n = But, since X and Y are independent, (= PIX = iv) (Sew = *) x (= PIX =P -i) “ YS PUX+Y = nut = vrev( px(u)py(u) QED. THEOREM 2. If X is a non-negative integer-valued random variable, and y the radius of convergence of the power series expansion of px(u) is greater than 1, then E(X) = ¢x(1), B(X) = vk (1) + 9x), 18 and Var(X) = ex (1) + #% (1) = (xe (D)* Proof: Since the radius of convergence of the power series expansion of ox(u) is greater than 1, we may differentiate term by term at u = 1 to obtain oe() = Se mPLX =n] = BX), and a PRO) = Soa PX n] = BOK) BCX) Thus B(X?) = g(1) + y'y(1), and Var(X) = E(X*)-((X))? (1) + ee(1) = (HO)? QED. Now we review some important discrete distributions that occur in statistics. We begin with the binomial distribution. In Section 4 we defined what is meant by a sequence of Bernoulli trials with respect to an event E which might or might not occur in a trial, its probability of occurring in a particular trial being p = P(E). If X denotes the number of times that E occurs in n Bernoulli trials, then X is said to have the binomial distribution, B(n,p), and we sometimes write: X is A(n,p). This random variable has the density -{ (s)p"-py,0S een 0 otherwise Upon recalling the binomial theorem, (tyr = > (“jan = ue sees Liat Ube probabilities of Uie Linuuial distribution add up to 1 THEOREM 38. If X is B(n,p), then ex(u) = (1+ p(u~ 1)", B(X) = np and Var(X) = np(1 = p) Prool: By the deimtion of yx we have and applying the binomial theorem, we obtain yx(u) = (pu + (1 — p))", which yields our frst conclusion. Applying Theorem’? to this formula we obtain (he formulae for wean aud variance, QE.D. There is a reproductive property of the binomial distibution, which is the following. THEOREM 4. if X and ¥ are independent random vartabies, ‘f X ts B(m,p) and iY is B(np), then X +Y is B(m+n,p). Proof: By Thevreus 1 aud 3, exey(y) = ex(uler(u) = (1+p(u-yy". QED. ‘The Poisson distribution ranks among the more useful distributions in statistics. A random variable X is said to have a Poisson distribution with parameter \ > 0 if PIX we write: X is P(A). Because for the never-to-be-forgotten sdentity, */nl, one can easily verify that the probabilities add up to one. = eA" /nln = 01,2) In this case THEOREM 5. If X is P(A), then yx(u) = eX") and E(X) = Var(X) =A Proof: By the definitions of vy and P(A), and by the never-to-be-forgotten identity given above, we have px(u) = De atut/nt = ere = eed, Applying Theorem 2, we obtain E(X) = Var(X) = 2. QED. The Poisson distribution also has a reproductive property, given by the next theorem. ‘THEOREM 6. If X and ¥ are independent random variables, if X is P(A) and if ¥ is Plu), then X +Y is PA+p) Praafs Ry Thenroms 1 and § a Mle Neale) exey(u) = ex(uler(u) elt Hle-1), 20 which is the probability generating function of P(A + 1). QED. We conclude this section by leaving off our discussion of probability generating fune- tions. We concentrate on the relationship between the two distributions we have discussed in connection with our main topic. ‘Their relationship is that of an important limit theorem THEOREM 7. (Poicson approximation to the binomial distribution). If X, ie B(n,p,), and if np, + > 0 asin — 00, then, for k = 0,1,2,-++P[Xn = k] + e-*AK/K! as n — 00. Proof: We shall apply the well known result about limits, namely, a, — @ as n — 00 if and only if Jim (1 + a,/n)" Now Since yp, + U as n + 00 for 1 <7 < k, the above limit result yields limasoo [Sn = k] = Bae’. QED. Theorem 7 lays the theoretical foundations for using the Poisson distribution in practice. Very frequently, binomial distributions arise when n is very large, p is very small, and np is "Just sight.” Tn such @ case his Linuudal distsiubtivn iney be approximated by P(up). This happens when one considers “rare events”, like the number of automobile accidents on a free-way during one day, the number of people in a census tract who die on a given day, ete 21 . Prove: If in Problem 1 we let EXERCISES . Let X denote the trial number in a sequence of Bernoulli trials where the event E occurs for the first time, Prove that P[X px(u) = pu/(1— (1 —p)u). (1 p)"p for n s2ye, and Suppose Y is uniformaly distributed over {1,2,---,N}, i LSA SN. Prove that {* 1 , PY = k= 1N for if ul <1 tus er . The distribution given in Problem 1 is called the geometric distribution. Prove that E(X) = Ip. . In Problem 1, suppose that Y denotes the trial number at which E occurs for the second time. Find the probability generating function of Y, ind ihe density of ¥ aud prove that yv(u) = (ex(w))? X, aud if, for i = 2,3,.-+, we let Xj denote the number of trials after the (i—1)th occurrence of E for the ith occurrence to take place, then X;,X2,+++ are iid, random variables. In Problem 5, find the density of X, +--+ + X, 2 §7. The Hypergeometric and Multinomial Distributions. Sampling with replacement produces (@ seyucuve uf) raudum vasiebles which are i. f. d. . Mowever desirable this is mathematically, it is sometimes more desirable from both a theoretical and a practical point of view to sample without replacement. The most basic distribution obtained when this is done is the hypergeometric distribution. Let us suppose we have an urn which contains r red balls and b black balls. If one selects n of these balls at random without replacement, 1 0,---,kr > 0, ky +--+ k, 3 and 1 < k 0. In this case, @= A and @ = (0,00). We shall show that Xq is sufficient for A. We observe that if 21,-++,2,-1 are n ~ I non: negative integers satisfying 0 $2 +----+ay-1 Sk, then ll _ P(X = alone LX, 1 1 = tyr Xn = which does not depend on A. Hence, by the definition given above, X, is a sufficient statistic for A 26 LX, be ii.d(B(1.p)). Let $= SOX;. We shall show that $ is sufficient for p. For 1 1, and 0 < ky + kz 0 is unknown, Prove that X +Y is a sufficient statistic for \ Let (X1, Xz) be MA'(m, p, 6p), let (Yi, ¥2) be MN(n, q, 6q), and assume (Xi, X2) and (%;,¥%) are independent random vectors. Prove that the random vector (Xi + Yi, 1+ X2,¥; + ¥) is sufficient for (m,n,p, 4,8) 28 CHAPTER 2. ESTIMATION. §1. Sampling Finite Populations. In this chapter we shall consider a few topics in statistical estimation. Some of these might be in the nature of review; some might be new. We shall begin by introducing unbiased estimation as it applies to sampling finite populations. We next review the notion of maximum likelihood. Then a development of conditional expectation and the Reo-Blackwell theorem are given, In sampling a finite population, the model will be presented UU, Ur Uy weX Yao Yw. Here, U denotes the set of N distinct units U,,---,Uy, and y is a function defined over which assigns to unit U; the number ¥;. The problem encountered is to determine the value of V defined hy ¥ = ¥;-+---+¥, or af ¥ defined by Y/N. However. one might be limited to being able to determine the y-values of only n units where 1 (st/k) — 1. Hence y is maximized when {st/k], ice. the maximum likelihood estimate of NV is v= [3]. EXAMPLE 2. Let X be a random variable whose distribution is B(n,p). Suppose n is known, and p is unknown, Based on one observation on X we wish to determine the maximum likelihood estimate of p. Let f(z) denote the density of X. Substituting X for 2 we obtain f(x) PX = pyr = (2) PR =p) Ingargnaay + (1 ~ piven) +P" = X, Suppose w € [1X Zn 1},bey euppose that wo ebserve 1 ZX cn — 1. Then, sot n (), : Six ( ) pRe-¥(1 — py ¥-1(X (ww) (1 = p) — wl — X(w)) 39 Solving for p (and denoting the solution by p) we obtain p = X(w)/n. (Note: since (1~p)'p* for positive integers r and s is 0 if p = 0 or p = 1, is positive for all p € (0,1), has a continuous derivative, and if its first derivative is zero at only one point in (0,1), then it achieves its unique maximum at this point.) If w € [X = 0}, then f(X(w)) = (1 — p)* achieves its maximum value in (0,1) at p = 0, and if w € [X = n], then f(X(w)) = p" achieves i maximum value at p = 1. Thus we may speak of the maximum likelihood estimate in this case; it is p— X/n EXAMPLE 3. Let X,,---,X, be independent observations on a random variable X which is known to be uniformly distributed over {1,2,-++,.N} where N > U is an unknown positive integer. We shall find a maximum likelihood estimate of N. Since the discrete density of X 7 CFE ges ea Beg an) ={ Sts FEIN) = tg otherwise wo obtain L, the joint density af Xy,e+8, Xa as = {une ifl f(0) for all z € (0,1), ii) f'(x) exists at all x € (0,1), and iv) there is exactly one value zo of z in (0,1) at which f"(zo) = 0, then f achieves its unique maximum in (0,1) at zo. 41 §4. ‘The Rao-Blackwell Theorem. It might occur in one’s statistical experience that ‘one has more than one unbiased estimate for some unknown parameter. The problem will be: which one is best, i.e., which one has minimum variance. This problem is sometimes salable hy use of the Rao-Blackwell theorem, to which this section is devoted. In order not to bring up convergence problems, we shall consider here in our mathematical development only those discrete random variables with a finite number of values. We first. develope the mecessary Lachgivund in conditional expectation. DEFINITION 1. If X is a random variable and Y is a vector random variable, and if y € Range Y, i, P[Y = yl > 0, then we define B(X|Y = y), the conditional expectation of X given Y ', by E(XIY = y) =D 2P((X = aIll¥ = y)). LEMMA 1. If X is a random variable and sf ¥ 1s a vector random variable, then BUXIY = y) = E(XIpyay))/PLY = 9] for all y € Range Y. Proof: E(X|¥Y=y) = D2P\X al[¥ = y))/PIY =y] = areyhD elix=av=y1} 1 = pray ele thxaa)Iv=y Pry ay te QED. EXAMPLE: In the joint distribution pictured, E(X|Y = 0) = 1/2, yr at . 1/7 «At at B(X|Y = 1) = 1 and BAY = 2) = 3/2. THEOREM 1. If X and Y are random variables, if Z is a random vector, and if a and b are constants, then for every 2 € RangeZ, ElaX + bY |Z = 2) = a£(X|Z = 2) + bE(Y|2 = 2). Proof: By means of Lemma 1 we obtain BaX +b¥|Z=2) = = aB(X|Z = 2) +bE(Y|Z =2). QED. THEOREM 2. If X is « random variable and ¥ is a random vector, and if g(z,y) is any function over (Range X) x (Range Y), then for ally € RangeY B(g(X,Y)IY = y) = E(o(X.¥)I¥ = y)- Proof: By Lemma 1, E(g(X,Y)IY = y) B(y(X,¥ )ty=y)) FUP Sy BlolXy Moran) = Bll X¥II¥ = y)- QED. THEOREM 8. If X and ¥ are random vectors, if X and ¥ are independent, if g(x) is any function defined over the range of X, and ify € Range Y. then E(g(X)I¥ = y) = Ba(X). Proof: Independence implies 0 E(g(X)¥ =y) = Yeo(x)P(IX =xI1¥ = y)) = Lalx)PIX = xl = Bo(X) QED. 43 DEFINITION 2. If X is a random variable and Y is a random vector, then we define the conditional expectation of X given ¥, £(X|Y), to be the random variable E(XIY) = DE(XIY = y)iv-y)- Note: E(X[Y) is a function of Y. THEOREM 4. If X,Y are random variables, if Z is a vector random variable, and yf a and b are constants, then E(aX + bY |Z) — aB(X |Z) + dE(Y |Z). Proof: This is an immediate consequence of the above definition and Theorem 1. ‘THEOREM 65. If X and ¥ are as in Definition 1, then E(E(X|Y)) = E(X). Proof: By Definitions 1 and 9, E(E(XI¥)) E(E(XIY)) = DRX =y)PIY =y) 7 = Dey i lven PY 9) BEX Ten) = E(X). QED. THEOREM ©. [f X and ¥ are as in Definttion 1, and if gly) % any function over Range(Y), then E(o(¥)X1Y) = of YE(ATY). Proof: By Theorems 1 and 2, DY()XIY) — SAY)XIY ~y)Avey, Saw xi¥ = yey Lo Exly = y)ivey = Lol¥)B(XIY = y)vey _ SOE ROW =v von = oA Y)E(XIY). " QED. 4 COROLLARY: If X is any random vector and g is a function defined over Range X, then E(g(X)|X) = 9(X). We shall later dicenver that sufficient statistics are useful in hypothesis testing. Now we shall see how useful they are in decreasing the variance of an unbiased estimate via the Rao-Blackwell theorem. As mentioned earlier, in order to maintain mathematical rigor, the whole Wreaturent here is for Ue distiete vase whic cach random variable tales only @ finite number of values. The conclusions hold for far more general situations. The basic theorem needed for all of this is Theorem 6. THEOREM 7. Let X and Y be as in Definition 1, and let f be any function. Then E(X = f(¥))? = B((X — E(X1Y))*) + E(E(XTY) ~ $(¥))") (This is the Pythagorean theorem in another setting). Proof: Let us denote Z = (X — E(X[¥))(E(X¥) ~ f(¥)). Then, EX S(Y)P = EX ~ B(XIY)) + (EXTY) ~ SOY? = E((X ~ E(XIY))?) + E(E(XIY) — f(¥))) + 2E (2). Thus it sufficies to prove EZ = 0. Applying Theorems 4, 5 and 6 we obtain E(E(ZIY)) B(E((E(X|¥) ~ f(Y)(X — E(X1Y)) IY) B((E(X|Y) ~ f(¥))E(X ~ E(X1¥)IY)) E((B(ALY) = SOY)ME(ATY) — E(ATY))) = 0. E(Z) u 4 QED. ‘An immediate consequence of Theorem 7 is: COROLLARY 1. If X,¥ and f are as in Theorem 7, then F(X — f(¥))? is minimized when f(Y) = E(X1Y) DEFINITION 3: If X and ¥ are as in Definition 1, we define the conditional variance of X given ¥,Var(X[¥), by Var(X[¥) = E(X*|¥) — (£(X1¥))*. COROLLARY 2. If X and ¥ are as in Definition 1, then Var(X) = E(Var(X1¥)) + Var(E(X1¥)) 45 Proof: In Theorem 7 take f(Y) equal to the constant EX. Then E(X—f(¥))? = VarX. In addition, since E(E(X|Y)) = B(X), it follows that B(E(X|¥) — E(X))? = Var(E(X[Y)) Finally, E((X - E(X|¥))*) = E(B((X — E(X1Y))1¥)) = B(E(X*1¥) ~ (B(X1Y)P) = E(Var(X|¥)) QED. Corollary 2 is a very useful result in sample survey theory when one wishes to compute a variance in a two stage sampling procedure. COROLLARY 3. (The Rao-Blackwell Theorem). If X and ¥ are as in Theorem 7, then Var(X) > Var(E(X|¥)), with equality holding if ond only if X is a function of Y. Proof: The inequality itself follows form the easily verified fact that Var(X|Y) > 0. From the proof of Corollary 2 we see that equality holds if and only if B((X — E(X|Y))?) = 0, ive., X ~ E(X[Y) = some constant which is easily shown to be equal to zero, and thus X = E(X]¥), which is a function of ¥. Q.DD. THEOREM 8. Let Xs,---.X. be observable discrete random variables with ioint discrete density p(2,,---,2n|8) which depends on an unknown parameter 6 € © CR. Let Z = f(Xis+ Xn) be an unbiased estimate of 8, and let Y = y(X1,--")Xn) be a sufficient statistic for 0. If we define p(y) = Ba(Z|¥ =y), then i) y(Y) does not depend on 8, fi) g(¥) és an unbiased estimate of 8, and iii) Var o(¥) < VarZ. Proof: i) We observe that PY) = LBA ZY = y)y=y) Y LL smelly) ver where pa(zly) = Pa(|Z = 2]|[¥ = y]). Since ¥ is sufficient for 8, we see that pe(zly) = p(zly) does not depend on @. Hence e¥) = (Xzelcly) vey) ae which does not depend on 6. 46 i) In order to prove that y(Y) is an unbiased estimate of 0, we compute EY) = D(X =rlely))evl8) y= L=plz|8) = EalZ) i) follows immediately since by hypothesis Z is an unbiased estimate uf @. Cuuclusiou from the Rao-Blackwell theorem. QED. Thus we see that if we have an unbiased estimate of a parameter 0, and if Y is a sufficient statistic for 6, then the variance of any unbiased estimate of @ which is not a (measurable) function of Y can be reduced by using as an unbiased estimate the conditional expectation of the original unbiased estimate given Y. a7 EXERCISES Prove: If X is a random variable, and if Y = aX + 6, where a and b are constants, then F(Y|X) = Y. If the joint density of two random variables X and Y is as given in the picture, compute 1/2 Re 1/8 1/24 @ 1/24 © 1/32 1/32_¢ 1/32 « 1/32 1 2 3 EX = 0), BOX = 1), BOX = 2), BOX = 38) and £E(X). Prove: B(E(X|¥)|¥) = E(X|¥). Prove: E(X — E(X|Y)IY) = 0. Prove: P[Var(X[¥) > 0] =1. Prove: Var(X|¥) = B((X — E(X|¥))1Y). Prove: If X,Y and Z are random variables, and if X is a function of Z, then i) Var(X +Y1Z) = Var(¥1Z) and ii) Var(XY|Z) = X*Var(¥Z). Let X,,---,Xq be a sample of size n on X which is P(A), unknown. i) Is X, an unbiased estimate of \? in) Is s2 an unbiased estimate of AY iii) Is X, a sufficient statistic for A? 1v) Which is a better estimate of A,si or £(s{|Xq), and why? v) (This might be difficult but it is worth a try.) Prove: E(s?|Xn) vi) Which is better as an estimate for A,X, or 48 CHAPTER 3. ANALYSIS OF CATEGORICAL DATA. §1. Randomized and Conditional Tests. In the basic course in mathematical statistics the idea of a statistical test of hypothesis was introduced. In somewhat general form, it ‘goes like this. ‘Ihe experimenter observes n random variables whose Joint distribution is known to be in a certain family of distributions. Let us denote the true distribution of the observed random variables by F, and let {F,,p € P} denote the family. Thus we know for some p € P, We select a subset P)'C P and might wish to test the hypothesis Ho that a € Pp against the alternative a € P\ Po. The maximum probability with which we are willing to reject the null hypothesis Ho : a € Py when Ho is true is called the level of significance and is usually denoted by a. It is natural to demand that among all tests which satisfy the preselected level of significance that one should be used which gives the maximum probability of rejecting Ho when a € P\ Fo, ie., when we want to, It is this problem which leads to a great deal of the developmentof methematical statistics: the problem of the power of the test However, before one suffers the embarressment of riches, i.e., which test is best in the above sense, he or she is anxious to lay hands on at least one test. A general procedure is this Suppace there ic a function Z of the observations X;,---.X. whose distribution is completely known if Ho is true, ie., if @ € Po. If'so, then one can find, say, numbers ¢ < d such that the probability that Z < c when Ho is true is < a/2 and the propability that Z > d when Ho is true is $$. Since @ is small, it is reasoned that Z has little chance of taking a value outside [c,d] when Ho is true, Thus, if one observes Xi,"+-,Xq and computes the value of Z, and if ¢ < Z < d, then one might not to reject Ho. But if Z< cor Z>d, then one would wish to reject Ho as being true, since if it were, there would be little chance of this happening. Now sometimes very sound intuitive reasons tell you that one only wishes to reject Ho if Z is “too large”. i.e.. one now wants to select dso that P[Z > d] < a when Ho is true and then to reject Ho if the computed value of Z is > d. A similar statement goes to “too small”. This notion will become clearer later. What is important, then, is to find ‘function Z of the observations whose distribution can be determined and computed when the null hypothesis Ho is true. This will be done in the tests we obtain in our treatment of, analysis of categorical data. What are categorical data’ ‘hey are data mvolving the numbers of yes's and no's the numbers of defectives and non-defectives, or the numbers of those that survive and those that do not survive. Presented in this chapter are some of the tests in the analysis of categorical data that are of considerable importance We shall devote the remainder of this section to two kinds of tests that we shall meet rather frequently: randomized tests and conditional tests. Their developments fallow. Suppose one has an integer-valued test statistic T (a special observable random variable) and a pre-assigned level of significance a. Let us suppose that the test is to reject the null hypothesis Ho when I 1s “too large”. What can happen is that were is au integer ¢ such that Py[T >t] > a and Py.[T > +1] t] and v = Fall 2 > t+]. Next, suppose we decide to reject Hp by means of the following mechanis: i) UT >t41, reject Ho, fi) If T K] = a. Then we would reject Ho if the value of T that we observe is equal to or greater than K. However, if we cannot determine the distribution of T whea Ho is true, a conditional procedure might work. Namely, we might have at our disposal an observation on another random variable N which, say, takes values 1,2,---,r. It might occur that we can find the conditional distribution of T given the event [N = n] for 1 K, then reject Ho; if not, then do not reject Ho THEOREM 2. If the above procedure is used, then Pys[Reject Ho] = a. Proof: By the theorem of total probabilities, Pa((RejectHo]) = Y) Pan ({RejectHol|[N = n))Pxg((N = n}) = YS Pall > Kalil = nl) Pall = nl) = YaPy[N = QED. 51 EXERCISES 1. The randomized test discussed at the beginning of this section is a random variable denoted by Py,([Reject Ho]|T) and defined by a-v P( [Reject Hol|T) = Tyrze4) + piit=d 7S If we let Ey, denote the expectation when Ho is true, prove that, En,(P([Reject Hol|T) = Prp|Reject Ho). 2, Let 0 be any positive integer, and let X;,-++,Xq be a sample of size n on the uniform distribution over the integers {1,2,+--,6}, ie., Xiy-++)Xq are iid. and P[X; = 3) = 1/0 for 1 i$ n,1 10 against the alternative H, : 8 > 10 with level of significance a = .08. Suppose we wish to reject the null hypothesis in favor of the alternative when T' defined by T = max{X;:1 $i c where c is made to satisfy Prgl? > c] — o (oppronimatcly). i) Evaluate Pys[T > 1). ii) Find the density of 7’ when Mo is true. iii) Find the smallest value of ¢ which satisfies Py,[T' > ¢] < a iv) Find a randomized test, which is a random variable Z = Z(7') such that when T is observed one rejects Ho with probability Z, such that Py, ([Reject Ho]) = Ex,(Z). 52 §2. The Irwin-Fisher Test. The Irwin-Fisher test is a basic two-sample test that is frequently applied. It also appears as a conditional test, as we shall see in a later section of this chapter. The basis of this test is the following theorem, THEOREM 1. If X and Y are independent random variables, if X is B(m,p) and if Y is B(n,p), then the conditional distribution of X given the event [X +Y = r] is the hypergeometric distribution: P(X = AIX +¥ =r] af max{U,r—n} Sk < min{m,r}. Proof: Because of independence, PX =HIX+Y =r) = PUX+Y) P(X = KY PIX+¥ (z)pta- pr *(.2 “r) Pa 2 hear max{0,r —n} < k < min{m,r}. QED. We are now able to develop the Irwin-Fisher test. Let X,¥ be two independent, observable random variables, where X is B(m,p’) and Y is B(n,p"), where m and n are known, but Pf and pl arc not known. We wish to test the composite hypothesis Ay : p! — p! ogainst the composite alternative H : p # p". Note that Ho is composite in that it is the set of all bivariate densities {f(2,ylp),0 < p <1), where A(zylp) = P(X = 2][Y = y)) when p= p" =p = (T)ra-er (7) ra-p), for 0 <2 $m,

Anda mungkin juga menyukai