Anda di halaman 1dari 32

POPULARITY OF MUSIC RECORDS

The music industry has a well-developed market with a global annual


revenue around $15 billion. The recording industry is highly competitive
and is dominated by three big production companies which make up
nearly 82% of the total annual album sales.
Artists are at the core of the music industry and record labels provide
them with the necessary resources to sell their music on a large scale. A
record label incurs numerous costs (studio recording, marketing,
distribution, and touring) in exchange for a percentage of the profits
from album sales, singles and concert tickets.
Unfortunately, the success of an artist's release is highly uncertain: a
single may be extremely popular, resulting in widespread radio play and
digital downloads, while another single may turn out quite unpopular,
and therefore unprofitable.
Knowing the competitive nature of the recording industry, record labels
face the fundamental decision problem of which musical releases to
support to maximize their financial success.
How can we use analytics to predict the popularity of a song? In this
assignment, we challenge ourselves to predict whether a song will reach
a spot in the Top 10 of the Billboard Hot 100 Chart.
Taking an analytics approach, we aim to use information about a song's
properties to predict its popularity. The dataset songs.csv consists of all
songs which made it to the Top 10 of the Billboard Hot 100 Chart from
1990-2010 plus a sample of additional songs that didn't make the Top
10. This data comes from three sources: Wikipedia, Billboard.com,
and EchoNest.
The variables included in the dataset either describe the artist or the
song, or they are associated with the following song attributes: time
signature, loudness, key, pitch, tempo, and timbre.
Here's a detailed description of the variables:
year = the year the song was released
songtitle = the title of the song
artistname = the name of the artist of the song
songID and artistID = identifying variables for the song and artist

timesignature and timesignature_confidence = a variable


estimating the time signature of the song, and the confidence in the
estimate
loudness = a continuous variable indicating the average amplitude
of the audio in decibels
tempo and tempo_confidence = a variable indicating the
estimated beats per minute of the song, and the confidence in the
estimate
key and key_confidence = a variable with twelve levels indicating
the estimated key of the song (C, C#, . . ., B), and the confidence in the
estimate
energy = a variable that represents the overall acoustic energy of
the song, using a mix of features such as loudness
pitch = a continuous variable that indicates the pitch of the song
timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, .
. . , timbre_11_min, and timbre_11_max = variables that indicate the
minimum/maximum values over all segments for each of the twelve
values in the timbre vector (resulting in 24 continuous variables)
Top10 = a binary variable indicating whether or not the song made
it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10,
and 0 if it was not)
Use the read.csv function to load the dataset "songs.csv" into R.
How many observations (songs) are from the year 2010?
373

First, navigate to the directory on your computer containing the file


"songs.csv". You can load the dataset by using the command:
songs = read.csv("songs.csv")
Then, you can count the number of songs from 2010 by using the
table function:

table(songs$year)
How many songs does the dataset include for which the artist name is
"Michael Jackson"?
18

If you look at the structure of the dataset by typing str(songs), you can
see that there are 1032 different values of the variable "artistname".
So if we create a table of artistname, it will be challenging to find
Michael Jackson. Instead, we can use subset:
MichaelJackson = subset(songs, artistname == "Michael Jackson")
Then, by typing str(MichaelJackson) or nrow(MichaelJackson), we
can see that there are 18 observations.
Which of these songs by Michael Jackson made it to the Top 10? Select
all that apply.
You Rock My World, You Are Not Alone, - correct
Beat It You Rock My World
Billie Jean You Are Not Alone
We can answer this question by using our subset MichaelJackson from
the previous question. If you output the vector
MichaelJackson$songtitle, you can see the row number of each of the
songs. Then, you can see whether or not that song made it to the top
10 by outputing the value of Top10 for that row. For example, "Beat
It" is the 13th song in our subset. So then if we type:

MichaelJackson$Top10[13]
we get 0, which means that this song did not make it to the Top 10.
The song "You Rock My World" is first on the list, so if we type:
MichaelJackson$Top10[1]
we get 1, which means that this song did make it to the Top 10.
As a shortcut, you could just output:
MichaelJackson[c(songtitle, Top10)]
The variable corresponding to the estimated time signature
(timesignature) is discrete, meaning that it only takes integer values (0,
1, 2, 3, . . . ). What are the values of this variable that occur in our
dataset? Select all that apply.
0, 1, 3, 4, 5, 7, - correct
0
1
2 3
4
5
6 7
8
Which timesignature value is the most frequent among songs in our
dataset?
0 1 2 3 4 4 - correct 5 6 7 8
You can answer these questions by using the table command:
table(songs$timesignature)
The only values that appear in the table for timesignature are 0, 1, 3,
4, 5, and 7. We can also read from the table that 6787 songs have a
value of 4 for the timesignature, which is the highest count out of all
of the possible timesignature values.

Out of all of the songs in our dataset, the song with the highest tempo is
one of the following songs. Which one is it?
Until The Day I Die Wanna Be Startin' Somethin' Wanna Be
Startin' Somethin' - correct

My Happy Ending

You Make Me

Wanna...
You can answer this question by using the which.max function. The
output of which.max(songs$tempo) is 6206, meaning that the song
with the highest tempo is the row 6206. We can output the song title
by typing:
songs$songtitle[6206]
The song title is: Wanna be Startin' Somethin'.
We wish to predict whether or not a song will make it to the Top 10. To
do this, first use the subset function to split the data into a training set
"SongsTrain" consisting of all the observations up to and including 2009
song releases, and a testing set "SongsTest", consisting of the 2010 song
releases.
How many observations (songs) are in the training set?
7201

You can split the data into the training set and the test set by using the
following commands:
SongsTrain = subset(songs, year <= 2009)
SongsTest = subset(songs, year == 2010)

The training set has 7201 observations, which can be found by


looking at the structure with str(SongsTrain) or by typing
nrow(SongsTrain).
In this problem, our outcome variable is "Top10" - we are trying to
predict whether or not a song will make it to the Top 10 of the Billboard
Hot 100 Chart. Since the outcome variable is binary, we will build a
logistic regression model. We'll start by using all song attributes as our
independent variables, which we'll call Model 1.
We will only use the variables in our dataset that describe the numerical
attributes of the song in our logistic regression model. So we won't use
the variables "year", "songtitle", "artistname", "songID" or "artistID".
We have seen in the lecture that, to build the logistic regression model,
we would normally explicitly input the formula including all the
independent variables in R. However, in this case, this is a tedious
amount of work since we have a large number of independent variables.
There is a nice trick to avoid doing so. Let's suppose that, except for the
outcome variable Top10, all other variables in the training set are inputs
to Model 1. Then, we can use the formula
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
to build our model. Notice that the "." is used in place of enumerating all
the independent variables. (Also, keep in mind that you can choose to
put quotes around binomial, or leave out the quotes. R can understand
this argument either way.)
However, in our case, we want to exclude some of the variables in our
dataset from being used as independent variables ("year", "songtitle",
"artistname", "songID", and "artistID"). To do this, we can use the
following trick. First define a vector of variable names called nonvars these are the variables that we won't use in our model.
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
To remove these variables from your training and testing sets, type the
following commands in your R console:
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]

SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]


Now, use the glm function to build a logistic regression model to predict
Top10 using all of the other variables as the independe+nt variables. You
should use SongsTrain to build the model.
Looking at the summary of your model, what is the value of the Akaike
Information Criterion (AIC)?
4827.2

To answer this question, you first need to run the three given
commands to remove the variables that we won't use in the model
from the datasets:
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
Then, you can create the logistic regression mo
del with the following command:
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
Looking at the bottom of the summary(SongsLog1) output, we can
see that the AIC value is 4827.2.
Let's now think about the variables in our dataset related to the
confidence of the time signature, key and tempo
(timesignature_confidence, key_confidence, and tempo_confidence).
Our model seems to indicate that these confidence variables are
significant (rather than the variables timesignature, key and tempo
themselves). What does the model suggest?

The lower our confidence about time signature, key and tempo, the
more likely the song is to be in the Top 10

The higher our confidence

about time signature, key and tempo, the more likely the song is to be in
the Top 10 The higher our confidence about time signature, key and
tempo, the more likely the song is to be in the Top 10 - correct
If you look at the output summary(model), where model is the name
of your logistic regression model, you can see that the coefficient
estimates for the confidence variables (timesignature_confidence,
key_confidence, and tempo_confidence) are positive. This means that
higher confidence leads to a higher predicted probability of a Top 10
hit.
In general, if the confidence is low for the time signature, tempo, and
key, then the song is more likely to be complex. What does Model 1
suggest in terms of complexity?
Mainstream listeners tend to prefer more complex songs
Mainstream listeners tend to prefer less complex songs Mainstream
listeners tend to prefer less complex songs - correct
Since the coefficient values for timesignature_confidence,
tempo_confidence, and key_confidence are all positive, lower
confidence leads to a lower predicted probability of a song being a hit.
So mainstream listeners tend to prefer less complex songs.
Songs with heavier instrumentation tend to be louder (have higher
values in the variable "loudness") and more energetic (have higher
values in the variable "energy").

By inspecting the coefficient of the variable "loudness", what does


Model 1 suggest?
Mainstream listeners prefer songs with heavy
instrumentation Mainstream listeners prefer songs with heavy
instrumentation - correct Mainstream listeners prefer songs with light
instrumentation
By inspecting the coefficient of the variable "energy", do we draw the
same conclusions as above?
No No - correct
No

The coefficient estimate for loudness is positive, meaning that


mainstream listeners prefer louder songs, which are those with heavier
instrumentation. However, the coefficient estimate for energy is
negative, meaning that mainstream listeners prefer songs that are less
energetic, which are those with light instrumentation. These
coefficients lead us to different conclusions!
What is the correlation between the variables "loudness" and "energy" in
the training set?
0.7399067

0
The correlation can be computed with the following command:
cor(SongsTrain$loudness, SongsTrain$energy)
Given that these two variables are highly correlated, Model 1 suffers
from multicollinearity. To avoid this issue, we will omit one of these two
variables and rerun the logistic regression. In the rest of this problem,
we'll build two variations of our original model: Model 2, in which we

keep "energy" and omit "loudness", and Model 3, in which we keep


"loudness" and omit "energy".
Create Model 2, which is Model 1 without the independent variable
"loudness". This can be done with the following command:
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain,
family=binomial)
We just subtracted the variable loudness. We couldn't do this with the
variables "songtitle" and "artistname", because they are not numeric
variables, and we might get different values in the test set that the
training set has never seen. But this approach (subtracting the variable
from the model formula) will always work when you want to remove
numeric variables.
Look at the summary of SongsLog2, and inspect the coefficient of the
variable "energy". What do you observe?
Model 2 suggests that songs with high energy levels tend to be more
popular. This contradicts our observation in Model 1. Model 2 suggests
that songs with high energy levels tend to be more popular. This
contradicts our observation in Model 1. - correct Model 2 suggests
that, similarly to Model 1, songs with low energy levels tend to be more
popular.
The coefficient estimate for energy is positive in Model 2, suggesting
that songs with higher energy levels tend to be more popular.
However, note that the variable energy is not significant in this model.
Now, create Model 3, which should be exactly like Model 1, but without
the variable "energy".
Look at the summary of Model 3 and inspect the coefficient of the
variable "loudness". Remembering that higher loudness and energy both
occur in songs with heavier instrumentation, do we make the same
observation about the popularity of heavy instrumentation as we did
with Model 2?

Yes

Yes Yes - correct

Model 3 can be created with the following command:


SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain,
family=binomial)
Looking at the output of summary(SongsLog3), we can see that
loudness has a positive coefficient estimate, meaning that our model
predicts that songs with heavier instrumentation tend to be more
popular. This is the same conclusion we got from Model 2.
In the remainder of this problem, we'll just use Model 3.
Make predictions on the test set using Model 3. What is the accuracy of
Model 3 on the test set, using a threshold of 0.45? (Compute the
accuracy as a number between 0 and 1.)
0.8793566

You can make predictions on the test set by using the command:
testPredict = predict(SongsLog3, newdata=SongsTest,
type="response")
Then, you can create a confusion matrix with a threshold of 0.45 by
using the command:
table(SongsTest$Top10, testPredict >= 0.45)
The accuracy of the model is (309+19)/(309+5+40+19) = 0.87936

Let's check if there's any incremental benefit in using Model 3 instead of


a baseline model. Given the difficulty of guessing which song is going to
be a hit, an easier model would be to pick the most frequent outcome (a
song is not a Top 10 hit) for all songs. What would the accuracy of the
baseline model be on the test set? (Give your answer as a number
between 0 and 1.)
0.8418231

You can compute the baseline accuracy by tabling the outcome


variable in the test set:
table(SongsTest$Top10)
The baseline model would get 314 observations correct, and 59
wrong, for an accuracy of 314/(314+59) = 0.8418231.
It seems that Model 3 gives us a small improvement over the baseline
model. Still, does it create an edge?
Let's view the two models from an investment perspective. A production
company is interested in investing in songs that are highly likely to make
it to the Top 10. The company's objective is to minimize its risk of
financial losses attributed to investing in songs that end up unpopular.
A competitive edge can therefore be achieved if we can provide the
production company a list of songs that are highly likely to end up in the
Top 10. We note that the baseline model does not prove useful, as it
simply does not label any song as a hit. Let us see what our model has to
offer.
How many songs does Model 3 correctly predict as Top 10 hits in 2010
(remember that all songs in 2010 went into our test set), using a
threshold of 0.45?
19

How many non-hit songs does Model 3 predict will be Top 10 hits
(again, looking at the test set), using a threshold of 0.45?

According to our model's confusion matrix:


table(SongsTest$Top10, testPredict >= 0.45)
We have 19 true positives (Top 10 hits that we predict correctly), and
5 false positives (songs that we predict will be Top 10 hits, but end up
not being Top 10 hits).
What is the sensitivity of Model 3 on the test set, using a threshold of
0.45?
0.3220339

What is the specificity of Model 3 on the test set, using a threshold of


0.45?
0.9840764

Using the confusion matrix:


table(SongsTest$Top10, testPredict >= 0.45)
We can compute the sensitivity to be 19/(19+40) = 0.3220339, and the
specificity to be 309/(309+5) = 0.9840764.
What conclusions can you make about our model? (Select all that apply.)
Model 3 favors specificity over sensitivity., Model 3 provides
conservative predictions, and predicts that a song will make it to the Top
10 very rarely. So while it detects less than half of the Top 10 songs, we
can be very confident in the songs that it does predict to be Top 10 hits. ,
- correct

Model 3 favors specificity over sensitivity.


sensitivity over specificity.

Model 3 favors

Model 3 captures less than half of Top 10

songs in 2010. Model 3 therefore does not provide a useful list of


candidate songs to investors, and hence offers no competitive edge.
Model 3 provides conservative predictions, and predicts that a song will
make it to the Top 10 very rarely. So while it detects less than half of the
Top 10 songs, we can be very confident in the songs that it does predict
to be Top 10 hits.
Model 3 has a very high specificity, meaning that it favors specificity
over sensitivity. While Model 3 only captures less than half of the Top
10 songs, it still can offer a competitive edge, since it is very
conservative in its predictions.

INTERNET PRIVACY POLL (OPTIONAL)


Internet privacy has gained widespread attention in recent years. To
measure the degree to which people are concerned about hot-button
issues like Internet privacy, social scientists conduct polls in which they
interview a large number of people about the topic. In this assignment,
we will analyze data from a July 2013 Pew Internet and American Life
Project poll on Internet anonymity and privacy, which involved
interviews across the United States. While the full polling data can be
found here, we will use a more limited version of the results, available
in AnonymityPoll.csv. The dataset has the following fields (all Internet
use-related fields were only collected from interviewees who either use
the Internet or have a smartphone):

Internet.Use: A binary variable indicating if the interviewee uses


the Internet, at least occasionally (equals 1 if the interviewee uses the
Internet, and equals 0 if the interviewee does not use the Internet).
Smartphone: A binary variable indicating if the interviewee has a
smartphone (equals 1 if they do have a smartphone, and equals 0 if they
don't have a smartphone).
Sex: Male or Female.
Age: Age in years.
State: State of residence of the interviewee.
Region: Census region of the interviewee (Midwest, Northeast,
South, or West).
Conservativeness: Self-described level of conservativeness of
interviewee, from 1 (very liberal) to 5 (very conservative).
Info.On.Internet: Number of the following items this interviewee
believes to be available on the Internet for others to see: (1) Their email
address; (2) Their home address; (3) Their home phone number; (4)
Their cell phone number; (5) The employer/company they work for; (6)
Their political party or political affiliation; (7) Things they've written
that have their name on it; (8) A photo of them; (9) A video of them; (10)
Which groups or organizations they belong to; and (11) Their birth date.
Worry.About.Info: A binary variable indicating if the interviewee
worries about how much information is available about them on the
Internet (equals 1 if they worry, and equals 0 if they don't worry).
Privacy.Importance: A score from 0 (privacy is not too important)
to 100 (privacy is very important), which combines the degree to which
they find privacy important in the following: (1) The websites they
browse; (2) Knowledge of the place they are located when they use the
Internet; (3) The content and files they download; (4) The times of day
they are online; (5) The applications or programs they use; (6) The
searches they perform; (7) The content of their email; (8) The people
they exchange email with; and (9) The content of their online chats or
hangouts with others.

Anonymity.Possible: A binary variable indicating if the


interviewee thinks it's possible to use the Internet anonymously, meaning
in such a way that online activities can't be traced back to them (equals 1
if he/she believes you can, and equals 0 if he/she believes you can't).

Tried.Masking.Identity: A binary variable indicating if the


interviewee has ever tried to mask his/her identity when using the
Internet (equals 1 if he/she has tried to mask his/her identity, and equals
0 if he/she has not tried to mask his/her identity).

Privacy.Laws.Effective: A binary variable indicating if the


interviewee believes United States law provides reasonable privacy
protection for Internet users (equals 1 if he/she believes it does, and
equals 0 if he/she believes it doesn't).
Using read.csv(), load the dataset from AnonymityPoll.csv into a data
frame called poll and summarize it with the summary() and str()
functions.
How many people participated in the poll?
1002

1
The number of people who took the poll is equal to the number of
rows of the data frame, and can be obtained with nrow(poll) or from
the output of str(poll).
Let's look at the breakdown of the number of people with smartphones
using the table() and summary() commands on the Smartphone variable.
(HINT: These three numbers should sum to 1002.)

How many interviewees responded that they use a smartphone?


487

How many interviewees responded that they don't use a smartphone?


472

4
How many interviewees did not respond to the question, resulting in a
missing value, or NA, in the summary() output?
43

From the output of table(poll$Smartphone), we can read that 487


interviewees use a smartphone and 472 do not. From the
summary(poll$Smartphone) output, we see that another 43 had
missing values. As a sanity check, 487+472+43=1002, the total
number of interviewees.
By using the table() function on two variables, we can tell how they are
related. To use the table() function on two variables, just put the two
variable names inside the parentheses, separated by a comma (don't
forget to add poll$ before each variable name). In the output, the
possible values of the first variable will be listed in the left, and the
possible values of the second variable will be listed on the top. Each

entry of the table counts the number of observations in the data set that
have the value of the first value in that row, and the value of the second
variable in that column. For example, suppose we want to create a table
of the variables "Sex" and "Region". We would type
table(poll$Sex, poll$Region)
in our R Console, and we would get as output
Midwest Northeast South West
Female 123 90 176 116
Male 116 76 183 122
This table tells us that we have 123 people in our dataset who are female
and from the Midwest, 116 people in our dataset who are male and from
the Midwest, 90 people in our dataset who are female and from the
Northeast, etc.
You might find it helpful to use the table() function to answer the
following questions:
Which of the following are states in the Midwest census region? (Select
all that apply.)
Kansas, Missouri, Ohio, - correct
Colorado Kansas
Kentucky

Missouri

Ohio

Pennsylvan

ia
Which was the state in the South census region with the largest number
of interviewees?

Texas

Texas Texas -

correct
From table(poll$State, poll$Region), we can identify the census
region of a particular state by looking at the region associated with all
its interviewees. We can read that Colorado is in the West region,
Kentucky is in the South region, Pennsylvania is in the Northeast
region, but the other three states are all in the Midwest region. From
the same chart we can read that Texas is the state in the South region
with the largest number of interviewees, 72.
Another way to approach these problems would have been to subset
the data frame and then use table on the limited data frame. For
instance, to find which states are in the Midwest region we could have
used:
MidwestInterviewees = subset(poll, Region=="Midwest")
table(MidwestInterviewees$State)

and to find the number of interviewees from each South region state
we could have used:
SouthInterviewees = subset(poll, Region=="South")
table(SouthInterviewees$State)

As mentioned in the introduction to this problem, many of the response


variables (Info.On.Internet, Worry.About.Info, Privacy.Importance,
Anonymity.Possible, and Tried.Masking.Identity) were not collected if
an interviewee does not use the Internet or a smartphone, meaning the
variables will have missing values for these interviewees.
How many interviewees reported not having used the Internet and not
having used a smartphone?
186

How many interviewees reported having used the Internet and having
used a smartphone?
470

How many interviewees reported having used the Internet but not having
used a smartphone?
285

2
How many interviewees reported having used a smartphone but not
having used the Internet?
17

These four values can be read from table(poll$Internet.Use,


poll$Smartphone)

How many interviewees have a missing value for their Internet use?
1

1
How many interviewees have a missing value for their smartphone use?
43

4
The number of missing values can be read from summary(poll)
Hide Answer
You have used 3 of 3 submissions
PROBLEM 2.3 - INTERNET AND SMARTPHONE USERS

Use the subset function to obtain a data frame called "limited", which is
limited to interviewees who reported Internet use or who reported
smartphone use. In lecture, we used the & symbol to use two criteria to
make a subset of the data. To only take observations that have a certain
value in one variable or the other, the | character can be used in place of
the & symbol. This is also called a logical "or" operation.
How many interviewees are in the new data frame?
792

7
The new data frame can be constructed with:
limited = subset(poll, Internet.Use == 1 | Smartphone == 1)
The number of rows can be computed with nrow(limited).
Hide Answer
You have used 3 of 3 submissions
Important: For all remaining questions in this assignment please use the
limited data frame you created in Problem 2.3.
PROBLEM 3.1 - SUMMARIZING OPINIONS ABOUT
INTERNET PRIVACY

Which variables have missing values in the limited data frame? (Select
all that apply.)
Smartphone, Age, Conservativeness, Worry.About.Info,
Privacy.Importance, Anonymity.Possible, Tried.Masking.Identity,
Privacy.Laws.Effective, - correct
Internet.Use Smartphone
Sex Age
State Region
servativeness
tance

Info.On.Internet

Anonymity.Possible

Worry.About.Info

Tried.Masking.Identity

Con

Privacy.Impor
Privacy.Law

s.Effective
EXPLANATION
You can read the number of missing values for each variable from
summary(limited)

What is the average number of pieces of personal information on the


Internet, according to the Info.On.Internet variable?
3.795455

This can be obtained with mean(limited$Info.On.Internet) or


summary(limited$Info.On.Internet)
How many interviewees reported a value of 0 for Info.On.Internet?

105

1
How many interviewees reported the maximum value of 11 for
Info.On.Internet?
8

EXPLANATION
These can be read from table(limited$Info.On.Internet)

What proportion of interviewees who answered the Worry.About.Info


question worry about how much information is available about them on
the Internet? Note that to compute this proportion you will be dividing
by the number of people who answered the Worry.About.Info question,
not the total number of people in the data frame.
0.4886076

From table(limited$Worry.About.Info), we see that 386 of


interviewees worry about their info, and 404 do not. Therefore, there
were 386+404=790 people who answered the question, and the
proportion of them who worry about their info is 386/790=0.4886.

Note that we did not divide by 792 (the total number of people in the
data frame) to compute this proportion.
An easier way to compute this value is from the summary(limited)
output. The mean value of a variable that has values 1 and 0 will be
the proportion of the values that are a 1.
What proportion of interviewees who answered the Anonymity.Possible
question think it is possible to be completely anonymous on the Internet?
0.3691899

From table(limited$Anonymity.Possible), 278 respondents said


anonymity is possible and 475 said it is not. Therefore, the desired
proportion is 278/(278+475)=0.3692. This can also be read from
summary(limited$Anonymity.Possible).
What proportion of interviewees who answered the
Tried.Masking.Identity question have tried masking their identity on the
Internet?
0.1632653

This can be computed with the command


table(limited$Tried.Masking.Identity). The output tells us that of all
the respondents who answered the Tried.Masking.Identity question,
128 out of (128+656) have tried masking their identity on the internet.
What proportion of interviewees who answered the
Privacy.Laws.Effective question find United States privacy laws
effective?
0.2558459

We can find this number with the command


table(limited$Privacy.Laws.Effective). The output tells us that 186 out
of (186+541) people who answered the Privacy.Laws.Effective
question find US privacy laws effective.
Often, we are interested in whether certain characteristics of
interviewees (e.g. their age or political opinions) affect their opinions on
the topic of the poll (in this case, opinions on privacy). In this section,
we will investigate the relationship between the characteristics Age and
Smartphone and outcome variables Info.On.Internet and
Tried.Masking.Identity, again using the limited data frame we built in an
earlier section of this problem.

Build a histogram of the age of interviewees. What is the best


represented age group in the population?
People aged about 20 years old

People aged about 40 years old

People aged about 60 years old People aged about 60 years old - correct
People aged about 80 years old
From hist(limited$Age), we see the histogram peaks at around 60
years old.
Both Age and Info.On.Internet are variables that take on many values, so
a good way to observe their relationship is through a graph. We learned
in lecture that we can plot Age against Info.On.Internet with the
command plot(limited$Age, limited$Info.On.Internet). However,
because Info.On.Internet takes on a small number of values, multiple
points can be plotted in exactly the same location on this graph.
What is the largest number of interviewees that have exactly the same
value in their Age variable AND the same value in their Info.On.Internet
variable? In other words, what is the largest number of overlapping
points in the plot plot(limited$Age, limited$Info.On.Internet)? (HINT:
Use the table function to compare the number of observations with
different values of Age and Info.On.Internet.)
6

By reviewing the output of table(limited$Age,


limited$Info.On.Internet), we can see that there are 6 interviewees
with age 53 and Info.On.Internet value 0, with age 60 and
Info.On.Internet value 0, and with age 60 and Info.On.Internet value
1.
A more efficient way to have obtained the maximum number would
have been to run max(table(limited$Age, limited$Info.On.Internet))
To avoid points covering each other up, we can use the jitter() function
on the values we pass to the plot function. Experimenting with the
command jitter(c(1, 2, 3)), what appears to be the functionality of the
jitter command?
jitter randomly reorders the values passed to it, and two runs will
yield the same result

jitter randomly reorders the values passed to it,

and two runs will yield different results

jitter adds or subtracts a small

amount of random noise to the values passed to it, and two runs will
yield the same result jitter adds or subtracts a small amount of random
noise to the values passed to it, and two runs will yield different
results jitter adds or subtracts a small amount of random noise to the
values passed to it, and two runs will yield different results - correct

By running the command jitter(c(1, 2, 3)) multiple times, we can see


that the jitter function randomly adds or subtracts a small value from
each number, and two runs will yield different results.
Now, plot Age against Info.On.Internet with plot(jitter(limited$Age),
jitter(limited$Info.On.Internet)). What relationship to you observe
between Age and Info.On.Internet?
Older age seems strongly associated with a larger value for
Info.On.Internet

Older age seems moderately associated with a larger

value for Info.On.Internet

Older age does not seem associated with a

change in the value of Info.On.Internet

Older age seems moderately

associated with a smaller value for Info.On.Internet Older age seems


moderately associated with a smaller value for Info.On.Internet - correct
Older age seems strongly associated with a smaller value for
Info.On.Internet
For younger people aged 18-30, the average value of Info.On.Internet
appears to be roughly 5, while most peopled aged 60 and older have a
value less than 5. Therefore, older age appears to be associated with a
smaller value of Info.On.Internet, but from the spread of dots on the
image, it's clear the association is not particularly strong.

Use the tapply() function to obtain the summary of the Info.On.Internet


value, broken down by whether an interviewee is a smartphone user.
What is the average Info.On.Internet value for smartphone users?
4.367556

4
What is the average Info.On.Internet value for non-smartphone users?
2.922807

The proper application of tapply here is:


tapply(limited$Info.On.Internet, limited$Smartphone, summary)
We can read the average for non-smartphone users from the summary
output labeled with 0 and the average for smartphone users from
the summary output labeled with 1 .
Similarly use tapply to break down the Tried.Masking.Identity variable
for smartphone and non-smartphone users.
What proportion of smartphone users who answered the
Tried.Masking.Identity question have tried masking their identity when
using the Internet?

0.1925466

What proportion of non-smartphone users who answered the


Tried.Masking.Identity question have tried masking their identity when
using the Internet?
0.1174377

We can get the breakdown for smartphone and non-smartphone users


with:
tapply(limited$Tried.Masking.Identity, limited$Smartphone, table)
Among smartphone users, 93 tried masking their identity and 390 did
not, resulting in proportion 93/(93+390)=0.1925. Among nonsmartphone users, 33 tried masking their identity and 248 did not,
resulting in proportion 33/(33+248)=0.1174.
This could have also been read from
tapply(limited$Tried.Masking.Identity, limited$Smartphone,
summary).

Next week, we will begin to more formally characterize how an outcome


variable like Info.On.Internet can be predicted with a variable like Age
or Smartphone.

Anda mungkin juga menyukai