Anda di halaman 1dari 14

Homework 1

Liz Kantor
Mon Feb 5 05:16:09 2018

Homework policy: This homework is due by 8:00am (EST) on the due date. Homework is to
be handed in via the course website in pdf format. Although we prefer you use Rmarkdown
or Word, you do not need to type the homework; there are many ways (scanner in the library
or phone apps) to convert written homework into a pdf file. Ask the teaching staff if you need
assistance.

Late homework will not be accepted. You are encouraged to discuss homework problems with
other students (and with the instructor and TFs, of course), but you must write your final
answer in your own words. Solutions prepared “in committee” or by copying someone else’s
paper are not acceptable.

Please keep your computer output to a minimum and focus on the required answer. The
easiest way to put your computer output into your homework is to cut and paste it into a
Word file and use the font “courier new”.

1
Problem 1

For the following surveys, discuss any problems you think exist and suggest how to fix the
issues.

a) A retail store manager wants to conduct a study regarding the shopping habits of his
customers. He selects the first 60 customers who enter his store on a Saturday morning.
Answer: Sampling the first 60 customers who enter his store on a Saturday morning will likely suggest that
his customers disproportionately shop on Saturday morning. The manager could create a better sample by
randomly surveying customers who enter his store on different days and at different times.

b) The village of Oak Lawn wishes to conduct a study regarding the income level of households
within the village. The village manager selects 10 homes in the southwest corner of the village
and sends an interviewer to the homes to determine household income.
Answer: Because housing is often organized by socioeconomic status, surveying 10 homes in the same area
of the village will not be an accurate representation of household income levels in the village. The village
manager could create a stronger sample by randomly selecting a larger number of households to survey.
Because the sample is random, it is unlikely that any one area of the village will be misrepresented.

c) An antigun advocate wants to estimate the percentage of people who favor stricter gun
laws. He conducts a nationwide survey of 1,203 randomly selected adults 18 years old and
older. The interviewer asks the respondents, “Do you favor harsher penalties for individuals
who sell guns illegally?”
Answer: This question exhibits deliberate wording bias, because it subtly suggests that people who sell guns
illegally are bad and this deserve harsher penalties. A better question would be “What sort of laws should
there be regarding people who sell gun illegally?”

2
Problem 2

Suppose you are back in high school and the campaign manager for your friend who is running
for senior class president. You would like to know what proportion of students would vote
for her if the election was held today. The class is too big to ask everyone (314 students).
Comment on whether or not each of the following sampling procedures should be used. Explain
why or why not.

a) Poll everyone in your friend’s math class.


Answer: This sample should not be used because everyone in the friend’s math class knows the friend and
will have an opinion based on their interactions. This would ignore in the sample the part of the school
population that does not know the friend personally.

b) Assign every student in the senior class a number from 1 to 314. Then, use a random
number generator to select 30 students to poll.
Answer: This would be an appropriate sample because there is no inherent bias.

c) Ask every student who is going through the lunch line in the cafeteria who they will vote
for.
Answer: While this is better than surveying students in the friend’s class, this would still likely contain
some bias, as students who go to the cafeteria together could be in the same friend groups and therefore
more likely to have some association with the friend.

3
Problem 3

In R, read in the results of a small survey done by visitors to a regional mall.


mydata=read.csv("http://people.fas.harvard.edu/~mparzen/stat100/smallsurvey.csv")
# number of rows
nrow(mydata)

## [1] 30
# number of columns
ncol(mydata)

## [1] 10
# names of the variables
names(mydata)

## [1] "id" "gender" "residence" "politicalparty"


## [5] "numbchildren" "age" "income" "jobhappy"
## [9] "tvhours" "radiohours"
# for example, mean of the income variable
mean(mydata$income)

## [1] 45.4

a) How many rows of data are in this data set?


Answer: This data set has 30 rows of data

b) How variables are in this data set? (the ncol(mydata)command could be useful here).
Answer: This data set has 10 columns of data

c) One way to examine categorical variables is with a pie chart. Produce a pie chart of where
people live (the residence variable) by using the pie command. Comment on the graph:
pie(table(mydata$residence))

Rural

Suburban

Urban
Answer: The graph looks like a pie and that makes me hungry.

4
d) Another way to examine categorical variables is with a bar chart. Produce a bar chart
of political affiliation (the politicalparty variable) by using the barplot. Comment on the
graph-why can’t we use a histogram for this variable?
mydata=read.csv("http://people.fas.harvard.edu/~mparzen/stat100/smallsurvey.csv")
barplot(table(mydata$politicalparty))
10
8
6
4
2
0

Democrat Independent Other Republican


Answer: We can’t use a histogram for this variable because one of the axes represents a qualitative variable.

e) Find the average of the income variable.


Answer: The average of the income variable is 45.4.

f) We can subset data in different ways (see handout on class site for how to do this). Compare
the average income and standard deviation of income for men and women.
female.income=mydata$income[mydata$gender=="F"]
male.income=mydata$income[mydata$gender=="M"]
# female average income
mean(female.income)

## [1] 37.4
#male average income
mean(male.income)

## [1] 53.4
Answer: The average male income is 53.4 and the average female income is 37.4.

5
g) The variable jobhappy measures on a 1-10 scale how happy someone is with their job.
Compare the average income for someone with a jobhappy rating of 8 or more versus the
average income of someone with a jobhappy rating of 3 or less. What do you find?
happy.income=mydata$income[mydata$jobhappy>=8]
unhappy.income=mydata$income[mydata$jobhappy<=3]
# happy average income
mean(happy.income)

## [1] 37.25
#unhappy average income
mean(unhappy.income)

## [1] 51.42
Answer: The average happy income is 37.25 and the average unhappy income is 51.4167.

6
Problem 4

This question uses an old data set on cars from Consumer Reports. To load the data into R
enter the following command
mydata=read.csv("http://people.fas.harvard.edu/~mparzen/stat100/cars10.csv")
#Always good to know the variable names
names(mydata)

## [1] "make" "price" "mpg" "headroom"


## [5] "trunk" "weight" "length" "turn"
## [9] "displacement" "gear_ratio" "foreign"
#Calculate some means and medians
mean(mydata$price)

## [1] 6165
median(mydata$price)

## [1] 5006
price=mydata$price
origin=mydata$foreign
price.foreign=price[origin=="Foreign"]
price.domestic=price[origin=="Domestic"]

a) Calculate the mean price of the automobiles in the data set.


Answer: The average price of automobiles in the data set is 6165.2568.

b) Calculate the median price of the automobiles in the data set.


Answer: The median price of automobiles in the data set is 5006.5.

c) What does the difference between the mean and median price indicate about the shape of
the distribution for the price?
Answer: The fact that the mean price is greater than the median price suggests that the distribution is
skewed right.

d) Calculate the mean price of automobiles separately for the domestic and foreign cars and
compare the results. Note that foreign is coded “Foreign” for foreign cars and “Domestic”
for domestic cars.
Answer: The mean price of foreign cars is 6384.6818 and the mean price for domestic cars is 6072.4231.

e) Make a histogram of the price of cars. What shape does the histogram take? (Is it
symmetric? Skewed?)
Answer:
hist(price,
main="Automobile Prices",
xlab="Price",
breaks=10)

7
Automobile Prices
25
20
Frequency

15
10
5
0

4000 6000 8000 10000 12000 14000 16000

Price

f) Discuss the difference in distributions of mpg for foreign and domestic cars. [do this by
comparing means, medians and histograms).
Answer:
mydata=read.csv("http://people.fas.harvard.edu/~mparzen/stat100/cars10.csv")
mpg=mydata$mpg
mpg.foreign=mpg[origin=="Foreign"]
mpg.domestic=mpg[origin=="Domestic"]

hist(mpg.foreign,
main="MPG for Foreign Cars",
xlab="MPG",
breaks=10)

8
MPG for Foreign Cars
5
4
Frequency

3
2
1
0

15 20 25 30 35 40

MPG
hist(mpg.domestic,
main="MPG for Domestic Cars",
xlab="MPG",
breaks=10)

MPG for Domestic Cars


10
8
Frequency

6
4
2
0

15 20 25 30

MPG
The mean and median mpg for foreign cars is 24.7727 and 24.5 respectively. The mean and median mpg for

9
domestic cars is 19.8269 and 19 respectively. Both of the distributions are skewed right, but there are more
domestic cars that get less than 15mpg than foreign cars.

g) Make a scatter plot of the variables weight and length. Does there appear to be any
association between the variables?
weight=mydata$weight
length=mydata$length
plot(weight,length,
main="Weight vs. Length",
xlab="Weight",
ylab="Length")

Weight vs. Length


220
200
Length

180
160
140

2000 2500 3000 3500 4000 4500

Weight
Answer: Yes, there seems to be a positive association between weight and length for the cars in our data set.

10
Figure 1:

Problem 5

Unfortunately, a friend of yours has been diagnosed with cancer. You obtain a histogram of
the survival time (in months) of patients diagnosed with this form of cancer as shown in the
figure above. The median survival time for individuals with this form of cancer is 11 months,
while the mean survival time is 69 months. What words of encouragement should you share
with your friend from a statistical point of view? [It also recommended you read the essay
“the median isn’t the message” found on the course web site.]
Answer: I would tell her that a median survival time of 11 months does not mean she is only going to live
for 11 more months. 50% of people with this diagnosis live for more than 11 months, some as many as 80
or even 160. Also, the number of people who live very short amounts of time will be high because a lot of
diagnosis happens at death and that skews the data.

Problem 6

When my friend Seth transferred from Harvard to Yale, many of his friends remarked that the
average student IQ increased at both places. Is this possible and if so, how? Briefly explain.
Answer: If Seth’s IQ is lower than the mean IQ at Harvard but higher than the mean IQ at Yale, both
schools will see an increase in their average IQs upon his transfer.

Problem 7

Suppose the diameters of a sample of new tires coming off one production line turned out to
have a standard deviation of 0. Would the manufacturer be happy or unhappy, assuming the
average diameter was correct? Explain.
Answer: The manager would be extremely happy because if the average diameter is correct and the standard
deviation is 0, all of the tires were made with the exact correct diameter.

11
Problem 8

Use this data set for the following question {10,20,30,40,50}. Feel free to use R for this
problem. You can define this data set in R with the command
x=c(10,20,30,40,50)
mean(x)

## [1] 30
sd(x)

## [1] 15.81
mean(x+5)

## [1] 35
sd(x+5)

## [1] 15.81
### and so on

a) Find the standard deviation and mean.


Answer: The mean is 30 and the standard deviation is 15.8114.

b) Add 5 to each value, and then find the standard deviation and mean.
Answer: The new mean of the data is 35 and the new standard deviation is 15.8114.

c) Subtract 5 from each value and find the standard deviation and mean.
Answer: The new mean of the data is 25 and the new standard deviation is 15.8114.

d) Multiply each value by 5 and find the standard deviation and mean.
Answer: The new mean of the data is 150 and the new standard deviation is 79.0569.

e) Divide each value by 5 and find the standard deviation and mean.
Answer: The new mean of the data is 6 and the new standard deviation is 3.1623.

f) Generalize the results of parts b through e.


Answer: The mean is always affected by linear transformation, while the standard deviation is only affected
by multiplication or division by a factor. These rules can be defined as follows: For a data set x with a mean
m and a standard deviation s transformed by an integer n
For x + n: m0 = m + n, s0 = s
For x − n: m0 = m − n, s0 = s
For x ∗ n: m0 = m ∗ n, s0 = s ∗ n
For x/n: m0 = m/n, s0 = s/n

12
Problem 9

A company has 30 employees, including a director. The lowest salary among the 30 employees
is $22,000. The director’s salary is $180,000, which is more than twice as much as anyone
else’s salary. Decide for each of the following statements about the 30 salaries whether it is
true, false, or you cannot tell on the basis of the information at hand. You do not have to give
an explanation.

a) The average salary is below $60,000.


Answer: Cannot be determined

b) The median salary is below $60,000.


Answer: Cannot be determined

c) If all salaries are increased by $1,000, that adds $1,000 to the average.
Answer: True

d) If the director’s salary is doubled, and all other salaries remain the same, that increases
the average salary.
Answer: True

e) If the director’s salary is doubled, and all other salaries remain the same, that increases the
median salary.
Answer: False

f) The standard deviation of the salaries is larger than $180,000.


Answer: False

13
Problem 10

A mutual fund has a mean rate of return of about 12.3%, with a standard deviation of 15.7%.

a) According to Chebyshev’s Inequality, at least 75% of returns will be between what values?

Answer: Chebyshev’s rule states that for any data set and for any number k that is greater than one, the
proportion of data x that lies withing k standard deviations of the mean is at least:
1 − 1/k 2
Plugging in 3/4 for x, we find
p
k= (1 − 3/4)−1

k= 4
k=2
So, 75% of the data will fall within 2 standard deviations of the mean. We then calculate the bounds:
[12.3 − 2(15.7), 12.3 + 2(15.7)]
[−19.1, 43.7]
Therefore, 75% of the data falls between -19.1% and 43.7%.

b) According to Chebyshev’s Inequality, at least 88.9% of returns will be between what two
values?
Answer: Using the same procedure as in 10a, we find that k = 3 and that 88.9% of the data falls between
-34.8% and 59.4%.

c) Should an investor be surprised if she has a negative rate of return? Why?


Answer: The investor should not be surprised because negative returns are feasible if an investment lost
money.

d) If we were going to use the Empirical Rule, what would we need to assume about the
returns?
Answer: We would need to assume that the distribution of returns is mound shaped.

Problem 11
P3
Suppose x1 = 2,x2 = −1 and x3 = 0. Find 2 + i=1 5xi .
Answer:
P3
2+ i=1 5xi =
2 + [5(2 + −1 + 0)]3 =
2 + 53 =
2 + 125=
127

14

Anda mungkin juga menyukai