Statistics in Data Science Mini Project 2

The University of Texas at Dallas, Mechanical Engineering Department
Statistics for Data Sciences - Mini-Project 2

Edgardo Javier Garca Cartagena
March 26, 2014
1 Exercise 1
Recall that we have two estimators of parameter in a Uniform (0, ) distribution. One is the MLE, the maximum of the sample, and other the MOME, which is twice the sample average. Suppose we have a random sample of size n from a Uniform (0, ) population. Which of the two estimators should is preferable? Answer this question by comparing the mean squared errors, i.e., E[(estimator of ( )2 ], of the two estimators, computed using the Monte Carlo simulations. Use a variety of values for n and , e.g., n = 5, 10, 30, 100, and = 1, 2, 4. Do you see any patterns in the results?
1.1 Overview
The code is written in a way that results are obtained for all suggested values of and n at onces. Estimate of the parameter is obtained by simulating many random data samples of dierent sizes using the R command runif(n,min=0,max=) to generate random numbers from an uniform distribution for some predetermined suggested values of and sample sizes. For each and sample size n a Monte Carlo simulation is performed of size 10000. In every run the parameter is estimated using the Method of Moments and Maximum Likelihood. After every simulation the mean square error is calculated for both methods and comparison is made between them. To visualize the results a plot of superimposed curves is generated for the mean square error for the two methods and for the dierent values. It is presented how the error for the two methods changes as a function of the sample size.
1.2 Results and Discussion

For all the cases simulated for the dierent values of and n, the maximum likelihood method to estimate parameter from sample data it is observed from gure 1.1 and gure 1.2 that has always less error estimating than the method of moment. However as the sample size increases for the simulated data, the error is reduced for both methods.
1.0
Mean Squared Error
0.4
0.6
MLE, = 1 MOME, = 1 MLE, = 2 MOME, = 2 MLE, = 4 MOME, = 4
0.8
0.2
G G G G
0.0
G G
20
40
60
80
100
Sample Size (n)
Figure 1.1: Superimposed curves of Mean Square Error as a function of sample size for the Method of Moment and Maximum likelihood with dierent parameters.
Mean Squared Error
0.06
0.08
0.10
G G
0.04
0.02
G G
0.00
G G
20
40
60
80
100
Sample Size (n)
Figure 1.2: Zoom in the Mean square error range of [0.0,0.1] to observe more clearly the errors when sample size is 100.
1.3 R Code
1 2 3 4 5 6 7 8 9 10
m=10000 # Number o f Monte C a r l o S i m u l a t i o n s t h e t a=c ( 1 , 2 , 4 ) # Parameter t o e s t i m a t e n=c ( 5 , 1 0 , 3 0 , 1 0 0 ) # Sample s i z e symb=c ( 1 , 5 , 4 ) # l i n e type mle=r e p ( 0 ,m) # i n i t i a l i z e mle v a r i a b l e mome=r e p ( 0 ,m) # i n i t i a l i z e mome v a r i a b l e mlerms=a r r a y ( 0 , dim=c ( l e n g t h ( n ) , l e n g t h ( t h e t a ) ) ) # i n i t mlerms a r r a y momerms=a r r a y ( 0 , dim=c ( l e n g t h ( n ) , l e n g t h ( t h e t a ) ) ) # i n i t momerms a r r a y f o r ( i i i i n 1 : l e n g t h ( t h e t a ) ) { # l o o p t h a t advan ces t h e s u g g e s t e d t h e t a values f o r ( i i i n 1 : l e n g t h ( n ) ) { # l o o p t h a t ad vances t h e s u g e s t e d sample sizes f o r ( i i n 1 :m ) { # l o o p f o r t h e Monte C a r l o s i m u l a t i o n nd=r u n i f ( n [ i i ] , min=0,max=t h e t a [ i i i ] ) # uni for m d i s t r i b u t i o n mle [ i ]=max( nd ) # MLE c a l c u l a t i o n mome [ i ] = 2 . 0 mean ( nd ) # MOME c a l c u l a t i o n } mlerms [ i i , i i i ]=mean ( ( mle t h e t a [ i i i ] ) 2 . 0 ) # MLE RMS c a l c u l a t i o n momerms [ i i , i i i ]=mean ( ( mome t h e t a [ i i i ] ) 2 . 0 ) # MOME RMS c a l c u l a t i o n } } # PLOTTING RESULTS pdf ( f i l e =" r m s F u l l . pdf " , width =13 , h e i g h t =8) for ( i i i in 1: length ( theta ) ) { p l o t ( n , mlerms [ , i i i ] , type=" o " , l t y=symb [ i i i ] , lwd =2, cex =2 , cex . l a b =2, ylim=c ( 0 . 0 , max( momerms ) ) , x l a b=NA, y l a b=NA, a x e s=F) box ( ) a x i s ( s i d e = 1 , cex . a x i s =1.5) a x i s ( s i d e = 2 , cex . a x i s =1.5) mtext ( s i d e = 1 , " Sample S i z e ( n ) " , l i n e = 3 , cex =2.5) mtext ( s i d e = 2 , "Mean Squared E r r o r " , l i n e = 2 . 3 , cex =2.5) par ( new=TRUE) p l o t ( n , momerms [ , i i i ] , type=" o " , l t y=symb [ i i i ] , lwd =2, cex =2 , cex . l a b =2, pch =22 , ylim=c ( 0 . 0 , max( momerms ) ) , x l a b=NA, y l a b=NA, a x e s=F) par ( new=TRUE) } l e g e n d ( " t o p r i g h t " , c ( e x p r e s s i o n ( p a s t e ( "MLE, " , t h e t a ==1)) , e x p r e s s i o n ( p a s t e ( " MOME, " , t h e t a ==1)) , e x p r e s s i o n ( p a s t e ( "MLE, " , t h e t a ==2)) , e x p r e s s i o n ( p a s t e ( "MOME, " , t h e t a ==2)) , e x p r e s s i o n ( p a s t e ( "MLE, " , t h e t a ==4)) , e x p r e s s i o n ( p a s t e ( "MOME, " , t h e t a ==4)) ) , pch =21:22 , cex =2 , l t y=c ( 1 , 1 , 5 , 5 , 4 , 4 ) , lwd=2 ) dev . o f f ( )
11
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33
34 35 36
37
./mp2ex1.r
2 Exercise 2
We know how to construct a large sample condence interval for population proportion p. How good is this condence interval when n is not very large? Answer this question by computing the coverage probability of this interval using Monte Carlo simulations. Take level of condence to be 95% but use a variety of values for n and p, e.g., n = 5, 10, 30, 100, and p = 0.05, 0.1, 0.25, 0.5, 0.9, 0.95. Do you see any patterns in the results?
2.1 Overview
Similar as in exercise 1, the code generate results for all suggested values. To investigate how the coverage probability is aected when sample size is small Monte Carlo simulation is performed for coverage probability. To estimate proportion sample data from Bernoulli distribution is generated with dierent sizes and mean of the sample data is used as a proportion estimate to generate coverage probability. The Monte Carlo simulation is performed for all combinations of values for n and p. To smooth values obtained from each simulation for the corresponding values of n and p average of the coverage probability is calculated and then behavior of the dierent sample sizes is observed by making plots of coverage probability vs sample size for the dierent p and coverage probability vs suggested probabilities for dierent sample sizes.
2.2 Results and Discussion

From gure 2.2 can be seen that for small sample size , variable coverage probability is observed for dierent simulated probabilities, what suggest no accuracy of the condence interval with small sample data, as opposed to when sample size is big, similar coverage probability is obtain for the dierent simulated probabilities
0.4
Coverage Probability
p=0.05 p=0.1 p=0.25 p=0.5 p=0.9 p=.95
0.2
G G G G G G G G
0.0 0
0.1
0.3
200
400
600
800
1000
Sample Size (n)
Figure 2.1: Coverage Probability in function of dierent number of observations
0.4
Coverage Probability
0.3
G G
n=5 n=10 n=30 n=100
0.2
0.1
0.0
0.2
0.4
0.6
0.8
Probability (p)
Figure 2.2: Coverage Probability in function of the dierent probabilities
2.3 R Code
1 2 3 4 5 6 7 8 9 10 11 12 13
m=10000 # Number o f Monte C a r l o S i m u l a t i o n s n=c ( 5 , 1 0 , 3 0 , 1 0 0 0 ) # Sample s i z e p=c ( 0 . 0 5 , 0 . 1 , . 2 5 , . 5 , . 9 , . 9 5 ) # S u g g e s t e d p r o b a b i l i t i e s symbnum=c ( 0 , 1 , 2 , 4 , 5 , 1 6 ) # p o i n t type symbol f o r p l o t i n g covprob_tmp=r e p ( 0 ,m) covprob=a r r a y ( 0 , dim=c ( l e n g t h ( n ) , l e n g t h ( p ) ) ) # i n i t covprov v a r i a b l e f o r ( i i i i n 1 : l e n g t h ( p ) ) { # Loop t h a t c h a n g e s s u g g e s t e d p f o r ( i i i n 1 : l e n g t h ( n ) ) { # Loop t h a t p r o g r e s s e s s u g g e s t e d n f o r ( i i n 1 :m) { rb=rbinom ( n [ i i ] , 1 , p [ i i i ] ) # B e r n o u l l i d i s t r i b u t i o n p_e s t=mean ( rb ) # estimate probability covprob_tmp [ i ]=qnorm ( 0 . 9 7 5 ) s q r t ( p_e s t (1 p_e s t ) /n [ i i ] ) # coverage probability } covprob [ i i , i i i ]=mean ( covprob_tmp ) # Average from a Monte C a r l o simulation } } # PLOTTING RESULTS pdf ( f i l e =" covprob . pdf " , width =13 , h e i g h t =8) for ( i i i in 1: length (p) ) { p l o t ( n , covprob [ , i i i ] , type=" o " , lwd =2, cex =2, cex . l a b =1.5 , pch=symbnum [ i i i ] , yl im=c ( 0 . 0 , 0 . 4 ) , x l a b=NA, y l a b=NA, a x e s=F) box ( ) a x i s ( s i d e = 1 , cex . a x i s =1.5) a x i s ( s i d e = 2 , cex . a x i s =1.5) mtext ( s i d e = 1 , " Sample S i z e ( n ) " , l i n e = 3 , cex =2.5) mtext ( s i d e = 2 , " Coverage P r o b a b i l i t y " , l i n e = 2 . 3 , cex =2.5) par ( new=TRUE) } l e g e n d ( " t o p r i g h t " , c ( "p=0.05 " , "p=0.1 " , "p=0.25 " , "p=0.5 " , "p=0.9 " , "p=.95 " ) , pch= symbnum , cex =2 , l t y =1, lwd=2 ) dev . o f f ( ) pdf ( f i l e =" covprob2 . pdf " , width =13 , h e i g h t =8) for ( i i i in 1: length (n) ) { p l o t ( p , covprob [ i i i , ] , type=" o " , lwd =2, cex =2, cex . l a b =1.5 , pch=symbnum [ i i i ] , yl im=c ( 0 . 0 , 0 . 4 ) , x l a b=NA, y l a b=NA, a x e s=F) box ( ) a x i s ( s i d e = 1 , cex . a x i s =1.5) a x i s ( s i d e = 2 , cex . a x i s =1.5) mtext ( s i d e = 1 , " P r o b a b i l i t y ( p ) " , l i n e = 3 , cex =2.5) mtext ( s i d e = 2 , " Coverage P r o b a b i l i t y " , l i n e = 2 . 3 , cex =2.5) par ( new=TRUE) } l e g e n d ( " t o p r i g h t " , c ( "n=5" , "n=10" , "n=30" , "n=100" ) , pch=symbnum , cex =2, l t y =1 , lwd=2 ) dev . o f f ( )
14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29 30
31 32 33 34 35
36 37 38 39 40 41 42 43
44
./mp2ex2.r

Statistics in Data Science Mini Project 2

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Statistics in Data Science Mini Project 2

Diunggah oleh

Hak Cipta:

Format Tersedia

The University of Texas at Dallas, Mechanical Engineering Department

Statistics for Data Sciences - Mini-Project 2

1.2 Results and Discussion

Mean Squared Error

MLE, = 1 MOME, = 1 MLE, = 2 MOME, = 2 MLE, = 4 MOME, = 4

Sample Size (n)

Mean Squared Error

Sample Size (n)

2.2 Results and Discussion

p=0.05 p=0.1 p=0.25 p=0.5 p=0.9 p=.95

Sample Size (n)

Figure 2.1: Coverage Probability in function of dierent number of observations

n=5 n=10 n=30 n=100

Figure 2.2: Coverage Probability in function of the dierent probabilities

Anda mungkin juga menyukai