Anda di halaman 1dari 3

Assignment 4

Submission deadline: Friday Feb 24th 2017 by 11.59 PM


Submission format: upload document in Canvas

onsider the babynames data from assignment 2.


1. (20 points) Data visualization. C

(a) Create a subset of the data with female babies named Mary from 1880-2014.
(b) Create a subset of the data with female babies named Sophia from 1880-2014.
(c) Construct a plot of the proportion of female babies named Mary from 1880-2014. On
the same plot, add/overlay a plot of the proportion of female babies named Sophia from
1880-2014.
(d) Briefly describe your interpretation of the plot in (c).

Hint: You can use geom_line() twice overlay two or more plots in R, i.e., two or more
y-series with the same x-series. Run the following code for illustration.

library(ggplot2)
x = (1:100)/10
sin= sin(x)
cos=cos(x)
data = data.frame(x,sin,cos)
ggplot(data,aes(x=x,y=y1,color="red"))+geom_line()+
geom_line(data=data,aes(x=x,y=y2,color="blue"))

2. (20 points). Webscraping. E xtract the data table from the safe routes website at
http://apps.saferoutesinfo.org/legislation_funding%20/state_apportionment.cfm, and
analyze the data to answer the following questions:

(a) Identify the top 5 states that received the most funds in 2010.
(b) Construct a plot of the data set with years in the x-axis, and total funding received by all
states in the y-axis.

onsider the Iris data set from assignment 2.


3. (20 points) Statistical learning intuition. C
Construct the following plots in R.

(a) Plot of Petal Length (x-axis) vs Petal Width (y-axis). Briefly describe the relation between
petal length and petal width as you observe from the plot.
(b) Plot of Petal Length (x-axis) vs Petal Width (y-axis), with different colors for the different
classes of plants.
(c) Plot of Sepal Length (x-axis) vs Sepal Width (y-axis), with different colors for the different
classes of plants.
(d) Observing the plots in (b) and (c), if you had to distinguish between classes by using
either petal dimensions or sepal dimensions, which one would you choose --- petals or
sepals, and why?

4. (20 points) Linear regression. Load the anscombe dataset in R. (Hint: data.anscombe
= anscombe)

(a) Fit linear regression of (i) y1 on x1 (ii) y2 on x2 (iii) y3 on x3 and (iv) y4 on x4. Write
down the four fitted regression lines.
(b) Construct the following plots (i) y1 vs x1 (ii) y2 vs x2 (iii) y3 vs x3 and (iv) y4 vs x4.
(c) Use your judgement and describe the discrepancy between the plots and the regression
lines.
(d) For each of the four cases in the anscombe dataset, explain whether a linear regression
model is appropriate.
Hint: Look at regression diagnostics like plot of residuals against x, plot of leverage (or
influence), etc.

5. (20 points) Linear regression. Load the m tcars data in R. Consider m


pg to be the
response variable, and all other variables as features.

(a) Compute the correlation coefficient between m pg and all other features in the dataset.
What are the two features most strongly correlated with mpg?
(Hint: a strong correlation can be either positive or negative, use abs(x) to obtain the
absolute value of a number x.)
(b) Fit two simple linear regression models: model 1 using the strongest feature from (a) and
model 2 using the second strongest feature from (a). Report the linear regression
formula (i.e., report the line equation) and the value of R2 from the two models. If you
had to choose between these two models, which one would you choose and why?
(c) Fit a multiple linear regression model with all features. Which features are significant in
this model? What is the value of R2 in this model?
(d) Using stepAIC, identify the best subset of features. Fit a multiple linear regression model
using the best subset of features. Write down the regression formula and R2 for this
model. Are any of the features from (a) included in this model? Do they have the same
coefficients as they had in model 1 or model 2 from (b)? If the coefficient values have
changed, explain why.

Assignment instructions:

1. Honor code: The Virginia Tech honor pledge for assignments is as follows:
I have neither given nor received unauthorized assistance on this assignment.

The pledge is to be written out on all graded assignments at the university and signed by
the student. Type up your name to sign.
2. Submit your assignment as a document (word, pdf or similar) to Canvas, clearly marked
with students name and assignment number, eg. Sengupta_Srijan_HW3.pdf. Your
submission should include R code and answers to problems. You can put answers and
R code into a single file or submit two separate files for R and answers.

3. Late submission: 10 points off for late submission within 24 hours of deadline, 20 points
for late submission within 48 hours of deadline. Late assignments beyond 48 hours will
not be accepted. Check Canvas regularly for assignments and submission dates.

4. You are free to discuss assignment problems with your classmates, but submitted work
(answers and codes) must be your own work. Students are not allowed to copy
computer codes or answers from each other, and must write their own codes and
answers.