Anda di halaman 1dari 11

Bailey Wang

Professor Gupta
STA141A
4/21/18

Homework 1

Question 1

How many observations are recorded in the dataset? How many colleges are recorded?

There is 3312 observed data points. However, there are only 2431 colleges recorded.

Question 2

How many features are there? How many of these are categorical? How many are discrete? Are
there any other kinds of features in this dataset?

"unit_id" "branches" "avg_sat"


"undergrad_pop" "grad_pop" "cost" "tuition"
"tuition_nonresident" "revenue_per_student" "spend_per_student"
"avg_faculty_salary" "avg_10yr_salary"
"sd_10yr_salary" "med_10yr_salary" "net_cost" "total_net_income"

Categorical: race, avg_sat, main_campus, open_admissions, primary_degree, highest_degree,


ft_faculty, admission, retention, completion, fed_loan, pell_grand, default_3yr_rate,
repay_5yr_rate_withdraw, repay_5yr_rate

Discrete: unit_id, branches, avg_sat, undergrad_pop, grad_pop, cost, tuition, tuition_nonresident,


revenue_per_student, spend_per_student, avg_faculty_salary, avg_10yr_salary, sd_10yr_salary,
med_10yr_salary, net_cost, total_net_income

Just these two.

Question 3

How many missing values are in the dataset? Which feature has the most missing values? Are
there any patterns?

There are 23398 missing values in the dataset. The feature which has the most missing is the
There are 23398 missing values in the dataset. The feature which has the most missing is the
"other race" feature at 51 missing values. A pattern is that if there is a single race missing, most
likely the other races will be missing as well. Similarly, if one of the genders is missing the other
will be as well.

Question 4

Are there more public colleges or private colleges recorded? For each of these, what are the
proportions of highest degree awarded? Display this information in one graph and comment on
what you see.

Not Public Public


2596 716

Other Certificate Associate Bachelor Graduate


Public 10 0 16 134 556
Nonprofit 73 3 19 393 1222
Profit 161 2 70 368 285

Other Certificate Associate Bachelor Graduate


Public .0409 0 .1523 .1497 .2695
Nonprofit .2991 .6 .1809 .4391 .5923
Profit .6598 .4 .6667 .4111 .1381

Public colleges have 716 while private (non-profit and profit) have 2596. Therefore, private
colleges have more than public. The proportion of highest degree for public is 13.7% and for
private is 86.3%. The highest degree for both public and nonpublic colleges.

Question 5

What is the average undergraduate population? What is the median? What are the deciles?
Display these statistics and the distribution graphically. Do you notice anything unusual?

0 25 50 75 100
0 428 1295 3372 166816

0 10 20 30 40 50 60 70 80 90 100
0 153 319.2 536 847.6 1295 1811.8 2674.5 4550.8 9629.8 166816
0 10 20 30 40 50 60 70 80 90 100
0 153 319.2 536 847.6 1295 1811.8 2674.5 4550.8 9629.8 166816

The mean is 3599.502 or roughly 3600 (since you cannot have a portion of a person). The
median is 1295. In the deciles, there seems to be an outlier which is the maximum value.
However, removing that outlier will remove valuable information. Upon further inspection, the
outlier is University of Phoenix Online, therefore there would be a lot of people taking classes
there.

Question 6

Compare tuition graphically in the 5 most populous states. Discuss conclusions you can draw
from your results.

Besides connecting an outlier for CA, IL. FL, CA, IL, FL and TX all appear to have roughly
Besides connecting an outlier for CA, IL. FL, CA, IL, FL and TX all appear to have roughly
equal variances. The outlier in this case would be NY having an extremely large variance.

Question 7

For the following questions, use code to justify your answer:


a. What is the name of the university with the largest value of avg_sat?
b. Does the university with the largest amount of undergrad_pop have open admissions?
c. List the zip code of the public university with the smallest value of avg_family_inc.
d. Does the university you found in part b. also have the largest amount of grad_pop?

California Institute of Technology has the largest average SAT.


University of Phoenix-Online Campus has the largest undergraduate population and is an open
admission.
The zip code for the public university with the smallest average family income is 11101.
University of Phoenix-Online does not have the largest graduate population.

Question 8

For schools that are for-profit in ownership and issue Bachelor’s degrees as their
primary_degree, do the following:
a. Visualize revenue_per_student and spending_per_student and describe the relation-ship. What
issues may arise when fitting a linear regression model?
b. Create a new variable called total_net_income. Think carefully about how this variable would
be calculated. Visualize the top 5 earning schools.

University Tuition
University Phoenix-Online 2103883392
Ashford 542588295
Kaplan 333051147
DeVry 315034368
Grand Canyon 275134706

For most schools, the lower the revenue the less they spend. The outliers that spend more on
fewer students are most likely from private colleges. Some problems that might occur will be
"outliers" skewing the data.

Question 9

Now, examine the relationship between avg_sat and admissionfor all schools.
a. Use an appropriate plot to visualize the relationship. Split the data into two groups based on
Now, examine the relationship between avg_sat and admissionfor all schools.
a. Use an appropriate plot to visualize the relationship. Split the data into two groups based on
their combination of avg_sat and admission. Justify your answer.
b. Using code to justify your answers, comment on how the following continuous variables
change depending on group:
(a)med_10yr_salary
(b) The percentage ofrace_white and race_asian combined
(c) The percentage of graduate students enrolled at a university
c. Using code to justify your answers, comment on whether the categorical variables are
dependent or independent of group:
(a)open_admission
(b)main_campus
(c)ownership
(d) Whether the university has more than 1 branch or not

Graduate population is 22.1%

Open_admissions if it is true then, they will be accepting anyone. It is independent.


Main_campus depends on the city and zip code, since there are branches for certain colleges, it is
dependent on the city and zip code.
Ownership may be dependent on the number of graduate or undergraduate students depending on
which ownership it is.
Branches are independent. They do not rely on any other variable to decide if there are one or
more branches.

Question 10

Examine the relationship betweenavg_10yr_salaryusingavg_family_incfor all schools.


a. Use an appropriate plot for these two variables. Fit a linear regression model that predicts
avg_10yr_salaryusingavg_family_inc. Add this line to the plot you used. Investigate the groups
of points that may be affecting the regression line.
b. Describe a categorical variable that would improve the fit of the regression line based on your
investigation in part a. What would the levels of this variable be?
The data points in the top corner is skewing the data. If we take a close look, those data points
are medical schools, therefore by removing them, the regression line will improve.

Categorical data that would improve the regression line would be ownership and
open_admissions. Highest_degree would show which degrees would give higher salaries.
Average family income would benefit more from adding open_admissions to explain the
differences between salaries coming from different colleges. Ownership often shows that the
students from private colleges earn higher salary than the ones from public colleges.

data=readRDS("~/Desktop/RStudio/Data141a/college_scorecard_2013.rds")
library(tidyverse)

##1

nrow(data)

length(data$main_campus)
#Two different ways to find the length of the dataset,
#one is by nrow and the other is finding the length of a single sub-data point
sum(data$main_campus)
#After finding the length, there maybe differing colleges recored,
#if the length equals the number of colleges, we use the sum function

##2
#if the length equals the number of colleges, we use the sum function

##2
table1 <- sapply(data, class)
dmy<- 'integer'
sapply(data, class)==dmy
(sapply(data,class)==dmy)[sapply(data, class)==dmy]

table1==dmy
table1[table1==dmy]
names(table1[table1==dmy])
#Lecture

##3
is.na(data)
dataNA=colSums(is.na(data))
which.max(dataNA)
#First we remove the NA values with is.na function, after we create a new dataset without the
NA values
#The which function finds the index that has the max NA values.

NA_counts<-colSums(is.na(data))

sum(NA_counts)

hist(NA_counts)
table(NA_counts)
which(NA_counts==490)

##4

ownership1 = data %>% filter(ownership == "Public")


count(ownership1[1])

ownership2 = data%>% filter(ownership == "Nonprofit")


count(ownership2[1])

ownership3 = data %>% filter(ownership == "For Profit")


count(ownership3[1])

ownership4 = data %>% filter(ownership != "Public")


count(ownership4[1])

#First we filter out a public and nonpublic and count them.

table(data$ownership=="Public")
table(factor(data$ownership=="Public",labels=c("Not Public","Public")))

degree=as.numeric((data$highest_degree))
ownership=as.numeric((data$ownership))

table(data$ownership,data$highest_degree)
#Lecture

#From there, we can count the degrees for public against nonpublic

ggplot(ownership1, aes(x=highest_degree, y= ownership, col = highest_degree))+geom_col()


ggplot(ownership4, aes(x=highest_degree, y= ownership, col = highest_degree))+geom_col()
ggplot(ownership4, aes(x=highest_degree, y= ownership, col = highest_degree))+geom_col()

props <- prop.table(table(data$ownership,data$highest_degree),margin=1)


prop.table(table(data$ownership,data$highest_degree),margin=2)
#Lecture

data$IsPublic<-factor(data$ownership=="Public",labels=c("Not Public","Public"))

dim(data)

props <- prop.table(table(data$IsPublic,data$highest_degree),margin=1)


mosaicplot(props)
mosaicplot(props, color = c("red","blue","green","purple","yellow" ), shade = FALSE,
main = "Proportions of High Degree Per College Type",
xlab = "High Degree Awarded", ylab = "Ownership", las=2)

#Next we plot the data into graphs, Moasaic and barplots relay the same information
#However, Mosaic gives all the data on the same graph.

mat1<-is.na(data)
mat2<-!is.na(data)
dim(mat1)
dim(mat2)

sum(mat1)/(3312*51)
sum(mat2)/(3312*51)
#Lecture

#Use the sum function to find the number of missing valeus then divide by the overall

##5
undergrad_mean = mean(data$undergrad_pop, na.rm = TRUE)
undergrad_median = median(data$undergrad_pop, na.rm = TRUE)

#First we calculate the mean and median using mean and median function

quantile(data$undergrad_pop, na.rm = TRUE)


quantile(data$undergrad_pop, na.rm=TRUE, prob=seq(0,1,length = 11))

#https://stackoverflow.com/questions/26273892/
#r-splitting-dataset-into-quartiles-deciles-what-is-the-right-method

#Next we find the quantile and decile with the quantile function, and for decile,
#We need to specify the probabilties.

density1 = density(data$undergrad_pop, na.rm = TRUE)


probs = quantile(data$undergrad_pop, na.rm = TRUE)

x = is.na(data$undergrad_pop)

y=data[x == FALSE,]
table(is.na(y$undergrad_pop))
table(x)

dataomit=na.omit(data)
hist(dataomit$undergrad_pop) +abline(v = undergrad_mean, col = "red")
+abline(v = undergrad_median, col = "blue")

hist(dataomit$undergrad_pop)+
hist(dataomit$undergrad_pop)+
abline(v=quantile(dataomit$undergrad_pop, c(.25,.5,.75)), col = "brown" )
hist(dataomit$undergrad_pop)+
abline(v=quantile(dataomit$undergrad_pop, prob=seq(0,1,length = 11)), col = "purple" )

hist(dataomit$undergrad_pop, main = "Histogram of Undergrad Population")


abline(v=undergrad_mean, col = "red", lwd=3, lty=2)
abline(v=undergrad_median, col = "blue")
abline(v=quantile(dataomit$undergrad_pop, c(.25,.5,.75)), col = "green" )
abline(v=quantile(dataomit$undergrad_pop, prob=seq(0,1,length = 11)), col = "purple" )

#We omit the NA values since changing them to 0 or any value could skew the data result/graph
#Afterwards, we create ablines for the mean.median,quantile, and decile.

#https://www.r-bloggers.com/quartiles-deciles-and-percentiles/

##6
top5= data[data$state %in% c("CA","TX", "NY", "IL", "FL"),]
topcollege5 = droplevels(top5$state)

boxplot(avg_sat~topcollege5, data = top5, main = "Average SAT score VS Top 5 Populous


Schools")
#legend(400,400,legend = c(""), lty = 1:5, col = c(""))

boxplot((undergrad_pop+grad_pop)~topcollege5, data=top5, main = "")


boxplot(tuition~topcollege5, data=top5, main = "Top 5 Earning Schools")

plot(density(top5$tuition[top5$state == "CA"], na.rm = T))


lines(density(top5$tuition[top5$state == "TX"], na.rm = T), lty = 2, col = "red")
lines(density(top5$tuition[top5$state == "NY"], na.rm = T), lty = 3, col = "blue")
lines(density(top5$tuition[top5$state == "IL"], na.rm = T), lty = 4, col = "green")
lines(density(top5$tuition[top5$state == "FL"], na.rm = T), lty = 5, col = "purple")

par(mfrow = c(2, 2),mar=c(1,1,1,1))


hist(top5$tuition[top5$state == "CA"], col = "yellow", main = " ")
hist(top5$tuition[top5$state == "TX"], col = "blue", main = " ")
hist(top5$tuition[top5$state == "NY"], col = "red", main = " ")
hist(top5$tuition[top5$state == "IL"], col = "green", main = " ")
hist(top5$tuition[top5$state == "FL"], col = "pink")
#Lecture

#California, Texas, New York, Illinois, Florida are given as the 5 most populous states
#Droplevels to remove data points which are not in use
#Use boxplot to show the top populous states and their tuition

##7
data_satavg=data[,c(6,14)]
which.max(data$avg_sat)
data[105,]
#Using the which function to find the index where the avg_sat is the largest,
#then calling it in the data

data_openpop=data[,c(6,5,15)]
which.max(data$undergrad_pop)
data[2371,]
#Using the which function to find the index where the undergrad_pop is the largest,
#then calling it in the data
data_zipfamily=data[,c(6,9,13,29)]
#Using the which function to find the index where the undergrad_pop is the largest,
#then calling it in the data
data_zipfamily=data[,c(6,9,13,29)]
a=data %>% filter(ownership == "Public")
b=data %>% filter(ownership == "Public") %>% summarize(which.min(avg_family_inc))
a[348,]
#subset the data to filter only public, then from the subset data, use which to find the smallest
#avg_family_inc index, call the index from the subset data to find the zip code.

data_gradpop=data[,c(6,5,15,16)]
which.max(data$grad_pop)
data[248,]
#Using the which function to find the index where the grad_pop is the largest,
#then calling it in the data

##8
newdata1=data %>% filter(ownership == "For Profit", primary_degree == "Bachelor")
ggplot(newdata1, aes(x=revenue_per_student, y=spend_per_student))+
geom_point()+geom_smooth(method = "lm", se = FALSE, lty=2)

newdata1$total_net_income = newdata1$undergrad_pop*
(newdata1$revenue_per_student - newdata1$spend_per_student)

data2 <- newdata1[order(newdata1$total_net_income) , ]


data2 <- head(newdata1[order(-newdata1$total_net_income, newdata1$main_campus) ,],5)
data2$total_net_income
data2$name
data3=cbind(data2$name, (data2$total_net_income) )

barplot(data2$total_net_income, col = c("red","blue","green","yellow","purple"),


names.arg= c("Phoenix","Ashford","Kaplan","DeVry","Grand Canyon"),
ylab = "USD in million dollars", main = "Top 5 Earning Schools")

#Subset the data with ownership and primary_degree


#Afterwards, plot the graph using the new data as the data frame.
#Next to create the total_net_income, we create the data using
#Revenue_per_student subtracted by spend_per_student and multple the difference by
undergrad_pop
#Plot the graph with total_net_income for a barplot

##9
data$avg_sat
data$admission

plot(data$avg_sat,data$admission)

data$group = data %>% filter(avg_sat > 1200 & admission < .5)
data$group <- factor(c(data$avg_sat>1200, data$admission < .4), labels=c("above","below"))

group<-(data$avg_sat > 1200) & (data$admission < .4)


group2<-ifelse(is.na(group),FALSE,group)

props2 <- prop.table(table(data$avg_sat > 1200,data$admission < .4),margin=1)


mosaicplot(props2)
mosaicplot(props2, color = c("red","blue" ), shade = FALSE,
main = "Average SAT score vs Admission",
xlab = "Average SAT score above 1200", ylab = "Admission below .4", las=2)
xlab = "Average SAT score above 1200", ylab = "Admission below .4", las=2)

ggplot(data, aes(x=avg_sat, y=admission, col = group2))+geom_point()+


geom_smooth(method = "lm", se = FALSE, col = "blue",lty=2)

ggplot(data, aes(avg_sat, med_10yr_salary, col = group2))+geom_point()+


geom_smooth(method = "lm", se = FALSE, col = "blue",lty=2)

ggplot(data, aes(x=avg_sat,y=race_white+race_asian, col = group2),


main = "Average SAT score vs Race:White+Asian")+geom_boxplot(aes(x=group2))

ggplot(data, aes(avg_sat, data$grad_pop, col = group2),


main =" Graduate Population vs Average SAT" )+geom_boxplot(aes(x=group2))
sum(dataomit$grad_pop)/sum(dataomit$grad_pop+dataomit$undergrad_pop)

#First subset the data by avg_sat and admission, however doing just that will not help
#To provide a better subset, we will require conditions on the data we're working on
#Then plot the data using the subsetted data, continue using the subsetted data for the other
graphs

##10
ggplot(data, aes(x=avg_family_inc, y=avg_10yr_salary))+geom_point()+
geom_smooth(method = "lm", se = FALSE, lty=2)

data4= data[-c(1789,2030.1956,682,1331,891,1085,1577,1241,1337,2180,1635,2204),]

ggplot(data4, aes(x=avg_family_inc, y=avg_10yr_salary))+geom_point()+


geom_smooth(method = "lm", se = FALSE, lty=2)

#Plot the data using the original data frame, afterwards, find the outliers and remove them
#replot the data​