Stats Term Project

Kim Padilla
In this Project My data that was taken from the US department of labor statistics on
college graduates in 2014. First I am taking the compensations and finding the mean, which is
the average of all the data, and the standard deviation, which is the amount of data that has
deviated from the mean. I will be creating a histogram of each field that will show the frequency
of the compensation of each field. With the histogram I will also create a graph called a box plot
which divides my data up into 5 groups.
The chart below contains the sample mean and standard deviation of each field. The way
I found the mean was by typing ( =average(D2:D51)). Depending on what data you are working
on the (D2:D51), which depicts the cells in which the data is contained, will change. Finding the
standard deviation is a similar process to finding the mean, I typed stdev.s (=stdev.s(D2:D51)).
Major
Field
Sample
Mean
Sample
Std. Dev.
Education
Engineering
Humanities
& Social
Sciences
Math &
Sciences
59542
40021
60064
38817
41923
4669.96851
7
2364.81633
5
7878.85584
6
7146.18209
7
4938.28781
6
Business
Communications
Computer
Science
54537
46227
7231.72264
4
7214.31666
1
Histograms
To create a histogram I first needed to copy and paste each Field onto a new excel sheet.
To do that I clicked and dragged down on one column. To open a new sheet I went to the bottom
of excel and clicked on the plus sign next to the sheet that says compensation data. Once copied
and pasted, I created a new column under C. I labeled it Compensation and used the cells
underneath to create the compensation categories. After I got all of my numbers onto the new
sheet I went to the top and clicked on data, then clicked on data Analysis. A box popped up
and I scrolled down to histogram. Once histogram was selected I pressed ok. In the input
range I went to my data and clicked on cell box A1. $A$1 popped up in the input range box. I
then went back to my data and clicked on the A51 cell. $A$51 then popped up in the input
range box. After that was done I did the same with column C in the bin range box. Before
pressing ok I clicked on labels and chart output. To get rid of the gaps on my chart I clicked on
one bar of my graph. A popup showed on the right side of the screen. I clicked on the bar
symbol. I went down to gap width and took it down to 0%.
For the most part my histograms have showed what I expected, with engineering and
computer science having the highest compensation and education and humanities having the
lowest. Due to the diversity of the business and engineering field the compensation seemed to
be a little more spread out. All of the graphs were normally distributed with bell shaped curves.
Some fields seemed to be more distributed throughout the graph than others, which I feel is due
to the different work environments (for example, in mathematics and science an individual may
work as a teacher in the school district, teach in a college, or work in a high security laboratory.)
Mathematics and science had more of a peak. I felt that that could be due to a certain area of
that field being more popular than another. Business was skewed to the right and Computer
Science was skewed to the left. My field is in Humanities and Social Science and most of the
data lies between 35000 and 45000.
Business
14
12
10
8
6
Frequency
4
2
0
Frequency
Compensation
Communications
20
15
10
Frequency
Frequency
5
0
Compensation
Computer Science
25
20
15
Frequency
Frequency
10
5
0
45000 50000 55000 60000 65000 70000 More
Compensation
Education
25
20
15
Frequency
Frequency 10
5
0
32500 35000 37500 40000 42500 45000 More
Compensation
Engineering
14
12
10
8
Frequency
Frequency
4
2
0
Compensation
Humanities & Social Science

16
14
12
10
8
6
Frequency
4
2
0
Frequency
Compensation
Mathamatics & Science

14
12
10
8
6
Frequency
4
2
0
Frequency
Compensation
I have created a box plot for each field. In order for me to construct this box plot I needed
to find the 5 number summary. The 5 number summery consists of 5 values- minimum, first
quartile, second quartile, third quartile, and maximum. To find the 5 number summary in excel I
needed to find each value separately. For the minimum I typed in ( = min(D2:D52)). The first
second and third quartile I typed in (=quartile(d2:d51,1), =quartile(d2:d51,2),
=quartile(d2:d51,3)). Last, to find the maximum value I typed into excel ( =max(d2:d51)). At this
point I was able to construct my box plot. The minimum and maximum values are in the
whiskers of the graph and the quartiles make up the Box.
Quartiles and box plots
Business
Min: 40745
Q1: 49229.25
Q2: 54728.5
Q3: 58885.25
Max: 71803
Q2: 45420
Q3: 51103.25
Max: 62359
Communications
Min: 33282
Q1: 41689.75
Computer Science
Min: 46785
Q1: 56937.25
Q2: 590835.5
Q3: 62462.25
Max: 69490
Q1: 38698.75
Q2: 39857
Q3: 41589
Max: 44591
Education
Min: 35250
Engineering
Min: 46475
Q1: 54975
Q2: 60042
Q3: 66394.75
Max: 80236.00
Q2: 39600.5
Q3: 42940.5
Max: 54220.00
Q2: 41939.5
Q3:45000.25
Max: 54261.00
Humanities and Social Science

Min: 23114
Q1: 33733.25
Mathematics and Science

Min: 31920.00
Q1: 38739
All of the fields had outliers and some were significantly more drastic than others. This
could be due to the fact that some fields are in higher demand. The t-distribution is appropriate
for this data. We may use t-distribution when there are more than 30 trial. There are 50 in this
data and it is a simple random sample.
Confidence interval is a range of values used to estimate the true value of the population
parameter. We use confidence intervals to get a range of values(interval estimate) rather than a
single value or point estimate.
1) 95% Confidence Interval
Sample average= 38817, standard deviation= 7146, alpha=.05, sample size=50
Error of margin using the t distribution formula,
=2042.42
Therefore the confidence interval is 40859.42 < < 36774.58
Standard deviation= 7232, sample size= 50, alpha= .01
Chi- square value (right)= 79.490, Chi- squared (left)= 67.320
5678.06 < < 9568.57
Sample size= 350, x= 149, sample proportion= .4257, 1-sample proportion= .5743,
Alpha= .20, z value=1.28
Error of margin using z distribution= .0338
.3919 < p < .4595
Interpretation
We are 95% confident that the true value of the mean business compensation is within
40859.42 and 36774.58. We are 99% confident that the true value of the standard deviation for
humanities and social science compensation in within 5678.06 and 9568.57. We are 80%
confident that the true value of the proportion of students who starting compensation is over
$50,000 is within .3919 and .4595.
Hypothesis testing tests the claim of a property of a population such as mean, sample
proportion, and standard deviation. We use it to test the claim of the above parameters of any
kind of population using sample data.
1) Hypothesis Testing for Education
H0:
= 35,000
H1:< 35,000
Alpha = .05
DF = 49
Sample average = 40021
Standard deviation = 2364.816335

T statistics value = 15.02
P value = 1
Since the p value is more than alpha value we fail to reject H0.
2) Hypothesis Testing for compensation packet value over 40,000
H0:
p= .8
H1: p .8
Alpha = .01
Sample size = 350

Sample proportion = .769
z statistics value = -1.4699

P value = .1416
We fail to reject because the p value is more than the alpha.
Interpretation
We support the claim in the first test that the compensation is less than $35,000, since we
did not have sufficient evidence to reject the H0. The second test we fail to reject the claim that
the proportion of students with a college degree and are making over $40,0000 is 80%.
Conclusion
Conditions for interval estimate and hypothesis testing:
1) Sample must be a simple random sample.
2) The population should be normally distributed or the sample size should be more then 30.
3) Np should be greater than .5 and nq should also be greater than .5 for interval estimate for
proportion data to approximate the binomial distribution with normal distribution.
4) To find the interval estimate for the standard deviation the population should be normally
distributed, which is a strict requirement for chi-squared distribution.
Do the sample data satisfied the above conditions
The data that was collected are from a simple random sample. We do not know if the
population was normally distributed or not, but since the sample size is 50 which is more than 30
we can assume that the population is normally distributed using the central limit theorem. Also
for the proportion data the np and nq are more than .5. The possible errors can be due to
sampling error, data entry error, data reporting error, or rounding error. The different benefit
packages offered to the different degrees. We could use stratified sampling since data is divided
under different strata which may give a better estimate than a simple random sample. This
project clearly demonstrates that statistical methods can be effectively used to analyze any kind
of data in order to make better decisions.
Reflection
This project has taught me how to use Microsoft Excel and what it is capable of,
so in the future I wont be intimidated by the software and shy away from it. I was able to really
work on my problem solving skills. By not having someone sit next to me it forced me to figure
Excel out on my own by trial and error. I can now say that I know how to use Excel and if asked
to find the mean and standard deviation I can do it successfully.
The majority of the time in the real world statistics it not done by hand. This
project exposed me to how statistics would look when done with a computer. I had to learn how
to read the information off of the computer, which I found to be difficult at times due to the fact
that I have done most of my work by hand. Being familiar with steps on how to get my results by
hand I was forced to compare how I would obtain the information both by Excel and by hand.
Knowing that the computer is capable of doing the majority of the work. I feel a
little more comfortable with statistics in the real world and would not mind using it again. Most
of the time while I was calculating the math I would either miss numbers or write the numbers
down wrong, which then gave me the wrong answer. Being able to use Excel, I just need to
double check the data that I insert into the computer and it would do the work for me. It made it a
lot easier and a lot less stressful. Compared to when we were given our first project in class, by
the end of this project I felt that I had started to actually understand statistics and could further
my education in it.

Stats Term Project

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Stats Term Project

Diunggah oleh

Hak Cipta:

Format Tersedia

Kim Padilla

Humanities & Social Science

Mathamatics & Science

Humanities and Social Science

Mathematics and Science

Standard deviation = 2364.816335

Sample size = 350

z statistics value = -1.4699

Anda mungkin juga menyukai