We will be using RStudio in this course to carryout statistical procedures. RStudio uses the coding language of
R. Thus, in order to use RStudio, you must first download R. The same code can be run in R and RStudio, but
we will use RStudio, because it tends to be easier to work with and has a nicer display than R.
3. Click on desktop.
4. On this page, under the open source edition, click on Download RStudio Desktop. This will take you to a
page with various platform installers.
5. Select the appropriate one for your operating system, and begin the installation process.
To check that you have correctly downloaded RStudio, open RStudio, and it should look like the following (the
display will be similar for all three operating systems).
Command
command+return
control+enter
(With any operating system, you can also copy and paste the code into the console.) This will send 2+2 to the
console and tell it to run. The answer will be returned in the console.
> 2-2
[1] 0
> 100/10
[1] 10
> sqrt(4)
[1] 2
> (200-5)*4+sin(2)
[1] 780.9093
Saving an Rscript
Suppose that you are interested in saving the rscript after typing these expressions. In order to do this, hit the
save button that is at the top of the rscript (not the one on the toolbar under the dropdown folder of code).
When you close out of RStudio, it will ask if you want to save your workspace. For our purposes in this course,
you should always say no. Instead, make sure to save the code you have typed in an rscript.
We can create vectors with more than one value by using the command c(), which stands for concatenate, to
create the list of values. For example, if we want to create a vector called red with the values 2, 3, 4, and 5 in it,
we could do so with the following code.
4
We can give our vectors whatever name we like, but we cannot include spaces in their name. For example, the
following code will produce an error.
a vector <- c(1,2,3,4)
Instead, we need to remove the space, or we could put a period in place of the space.
avector <- c(1,2,3,4)
a.vector <- c(1,2,3,4)
It is also important to know that capitalization is recognized by R. We will find that the vector red that we created
before is not the same as Red.
We can perform mathematical operations with vectors of the same length. For example, if we create a new variable
called blue with four values in it, we can add, subtract, multiply, and divide red with blue using the following
code.
blue <- c(1,2,3,4)
red+blue
red-blue
red*blue
red/blue
When performing mathematical operations on multiple vectors, it creates a new vector. We can assign a name to
the new vector. For example,
purple <- red+blue
We can ask RStuido to tell us what the value in a certain position of a vector is. For example, if we are interested
in determining what the value in the third place of the the vector purple is, we can do so using brackets.
purple[3]
We can also create vectors containing non-numerical values if we are working with a categorical variable. In order
for RStudio to recognize that the values in the vector are categories instead of numbers, we need to put quotation
marks around each value. For example, we may encounter a variable of yes and no responses. We can create such
a vector using the following code.
responses <- c("yes","no","no","yes")
Try creating vectors of the variables measurement and sex from the example dataset in class.
Subject
1
2
3
4
5
6
Treatment
drug
drug
drug
drug
drug
drug
Measurement
5
8
6
14
7
6
Sex
M
M
M
F
F
M
Age
21
25
25
21
20
20
Subject
7
8
9
10
11
Treatment
placebo
placebo
placebo
placebo
placebo
Measurement
6
2
5
3
3
Sex
F
F
M
M
M
Age
30
28
22
22
24
In order to import the data into R so we can use it, we need to set our working directory to the location that
we have the data file saved in. To change your working directory, go to Session, then Set Working Directory, and
select Choose Directory.
This will open a popup. Find the folder that you have the data saved in, highlight it, and hit open. Now RStudio
will be able to access data files that you have saved in that folder.
Next we can use the command read.csv() to load your data into R. To do this, type the name of the .csv file
inside the parentheses of read.csv() with quotation marks around the name. If the .csv file has the variable
columns labelled (as the document Cereals.csv does as shown in the image below), put a comma after the file name
and include header=TRUE. This tells RStudio to treat the first row of entries in the .csv document as the variable
names and the values in the second row as the start of the variable entries. If the columns are not labelled, then
include header=FALSE. Then RStudio will treat the entries in the first row as the start of the variable entries.
It is important that you name the data so you can work with it easily after it has been loaded into RStudio. Again,
we will use <- to name the data.
Thus, to upload the file Cereals.csv to RStudio and name it, use the following command. I have chosen to
call the data cereals.
cereals <- read.csv("Cereals.csv",header=TRUE)
If the file has uploaded correctly, you can type cereals into the console, and it will print the data.
Shelf
bottom: 9
middle:19
top
:15
Sodiumgram
Min.
:0.000000
1st Qu.:0.003150
Median :0.004839
Mean
:0.004634
3rd Qu.:0.006481
Max.
:0.007407
Proteingram
Min.
:0.03030
1st Qu.:0.03391
Median :0.06667
Mean
:0.08162
3rd Qu.:0.09717
Max.
:0.26667
0.004666667
0.005517241
0.004545455
0.007407407
0.003500000
0.005818182
0.006969697
0.006666667
0.005000000
0.004375000
0.001792453
0.002727273
10
0.007000000
0.004500000
0.004687500
0.005333333
0.004500000
0.005600000
0.006000000
0.004375000
0.003833333
0.005666667
0.002800000
0.000000000
0.006129032
0.007096774
0.004500000
0.004848485
0.000222222
0.000000000
If you want to work with the variables without typing as much, you can rename the variable. For example:
Sodiumgram <- cereals$Sodiumgram
Now, just by typing Sodiumgram, it will print the same output as cereals$Sodiumgram does.
Function
mean()
median()
sd()
var()
quantile()
range()
min()
max()
summary()
In order to use the functions on a variable, type the name of the variable inside the parentheses. For example, if
I want to compute the mean of Sodiumgram, I will type
mean(cereals$Sodiumgram)
or
mean(Sodiumgram)
since I renamed the variable.
Tables
You can create tables in RStudio with categorical variables using the command table(). For example, to create
a frequency table of the different levels of the variable age in the dataset cereals, use the command
table(cereals$Age)
This produces the following output. We see that 17 of the cereals are considered adult cereals, and 26 are considered
childrens cereals.
adult children
17
26
To create a two-way table, include the two variables of interest inside the table() command with a comma
between the variables. For example,
table(cereals$Age,cereals$Shelf)
creates a two-way frequency table of the variables age and shelf.
adult
children
11
If you are interested in creating a relative frequency table with proportions (instead of just the frequencies), first
create a table of the desired variable(s), and give it a name. Then divide the table by the summation of the entries
in the table. For example, to create a relative frequency table of the variable age, use the following command.
tab <- table(cereals$Age)
tab/sum(tab)
It produces this output.
adult children
0.3953488 0.6046512
Similarly, we can also use the same procedure for the two-way table.
tab2 <- table(cereals$Age,cereals$Shelf)
tab2/sum(tab2)
The output is as follows.
bottom
middle
top
adult
0.04651163 0.02325581 0.32558140
children 0.16279070 0.41860465 0.02325581
Graphing
There are three basic plotting functions in RStudio: high-level plots, low-level plots, and the layout command par
(which will not be discussed in this handout). Basically, a high-level plot function creates a complete plot and a
low-level plot function adds to an existing plot, that is, one created by a high-level plot command.
Command
barplot()
hist()
boxplot()
Suppose we want to create a barplot of the age levels in the cereal dataset. We can do this using the following
command.
barplot(table(cereals$Age))
Note that you must use the table() command inside the command barplot() when creating a boxplot if the
categorical variable is a vector as cereals$Age is. The plot will appear in the lower right hand corner of RStudio
as follows.
12
If we want to make a histogram and boxplot of the variable Sodiumgram, we can apply the commands to the
variable, and the graphs below are created.
hist(cereals$Sodiumgram)
boxplot(cereals$Sodiumgram)
0.004
8
6
4
0
0.000
Frequency
Histogram of cereals$Sodiumgram
0.000
0.002
0.004
0.006
0.008
cereals$Sodiumgram
These plots are fairly basic (and are lacking good titles and labels). We can add additional arguments to the code
to enhance the plots. The table below includes some options.
Option
col
xlim
ylim
xlab
ylab
main
sub
Description
color (color=red, blue,...)
x-axis limits: xlim=c(min,max)
y-axis limits
x-axis label: xlab=my label
y-axis label
main title
sub title
For example, we can improve the boxplot by adjusting the code in a way such as the following. Note that I
have I have included an additional option specific to boxplots: horizontal=TRUE. This causes the boxplot to be
horizontal instead of the default of vertical.
boxplot(cereals$Sodiumgram,col="grey",ylim=c(-0.0005,0.009),xlab="Sodium (grams)",
main="Boxplot of Amount of Sodium in Sampled Cereals", horizontal=TRUE)
Boxplot of Amount of Sodium in Sampled Cereals
0.000
0.002
0.004
0.006
0.008
Sodium (grams)
Additionally, we can easily create side-by-side boxplots in RStudio. For example, if we want to compare the
sodium amounts of cereals by the age the cereal is marked to we can use the following command to do so. Note
that we do this by including a after the quantitative variable and then list the categorical variable of interest.
The code also includes names=c("Adult","Children"), which allows us to adjust the category names.
boxplot(cereals$Sodiumgram~cereals$Age,col="grey",ylim=c(-0.0005,0.009),xlab="Sodium (grams)",
ylab="Age", main="Boxplots of Amount of Sodium in Sampled Cereals by Age",
names=c("Adult","Children"),horizontal=TRUE)
13
Adult
Age
Children
0.000
0.002
0.004
0.006
0.008
Sodium (grams)
8 10
6
4
0
Frequency
0.000
0.002
0.004
0.006
0.008
Sodium (grams)
(dashed line represents mean amount of sodium)
Other options for the line are listed in the table below.
Option
lty
lwd
col
Description
line type (lty=1, 2,...)
line thickness (lwd=1, 2,...)
color (color=red, blue,...)
Saving Plots
There are several ways to save a plot in RStudio. I will only discuss one way. Once the plot has been created, click
on the export button above the plot, and select either Save as Image... or Save as PDF... depending on which you
would prefer. A window will appear that will allow you to name the plot, choose where you want to store it, and
save it.
14
If you type # at the beginning of a line of code, RStudio will treat the typing after # as a comment. It can
be a good idea to include comments in rscripts as a title, to explain what is happening in the code, etc.
15