Anda di halaman 1dari 15

Introduction to R and RStudio

We will be using RStudio in this course to carryout statistical procedures. RStudio uses the coding language of
R. Thus, in order to use RStudio, you must first download R. The same code can be run in R and RStudio, but
we will use RStudio, because it tends to be easier to work with and has a nicer display than R.

Lawrence University RStudio Server


The first option for using RStudio is to use the Lawrence University RStudio server. This can be accessed at
https://rstudio.lawrence.edu. Use your Lawrence email username and password to log in. This site can only be
accessed on the Lawrence campus.

Downloading R and RStudio


The second option for using RStudio is to download it. This will allow you to use RStudio when you are not on
campus and when you are no longer at Lawrence.
The first step in the process is to download R. Follow these instructions to do so.
1. Go to https://www.r-project.org.
2. Click on the link that says download R. This will take you to a page of CRAN Mirrors from around the
world.
3. Scroll down until you reach the USA mirrors.
4. Click on any of the options under USA, such as http://cran.cnr.Berkeley.edu/. This will direct you to another
page, where you will have the option of downloading R for Linux, Mac, or Windows.
5. Select the appropriate link based on the type of operating system your computer uses and follow the instructions for downloading the latest version of R. (The most current version is R 3.2.1 World-Famous
Astronaut released on 2015/06/18.)
To check that you have correctly downloaded R, open R, and it should look like one of the following displays based
on your operating system (Linux will appear similar to Windows).

Next you will download RStudio. Follow these instructions to do so.


1. Go to https://www.rstudio.com.
2. Place your cursor over products and click on RStudio. This will take you to a page with the options of either
selecting desktop or server.
1

3. Click on desktop.
4. On this page, under the open source edition, click on Download RStudio Desktop. This will take you to a
page with various platform installers.
5. Select the appropriate one for your operating system, and begin the installation process.
To check that you have correctly downloaded RStudio, open RStudio, and it should look like the following (the
display will be similar for all three operating systems).

Becoming Familiar with RStudio


The display of RStudio is broken into four sections as seen in the image below.
On the top left is an rscript. This is where you will type your code and is a document that can be saved
when you are done working, so you can return to it at a later time.
Below the rscript, in the bottom left corner, is the console. This is where the code is run and where results
will be produced.
To the right of the console, in the bottom right corner, is the space where figures will be displayed when you
create them. It is also where files can be accessed, help information is displayed, and packages can be loaded
(but we will not discuss packages now and maybe not at all in this course).
The top right area displays your environment, which includes current data sets you have loaded, variables
you have created, and so forth. (Since you have not worked with RStudio yet, there should be nothing in
your environment.)

Using RStudio as a Calculator and Working with an Rscript


Opening a New Rscript
If there is not an rscript already open, click on the sheet with a white plus sign inside a green circle at the top left
of RStudio, and select R Script from the dropdown menu. This will provide you with an untitled blank rscript.

Some Simple Computations


We will first use RStudio as a calculator. In the blank rscript, type the following:
2+2
In order to perform the computation, place your cursor on the same line as 2+2. Depending on whether you are
using a Mac or PC, use the following command.
Operating System
Mac
PC

Command
command+return
control+enter

(With any operating system, you can also copy and paste the code into the console.) This will send 2+2 to the
console and tell it to run. The answer will be returned in the console.

Try using RStudio to compute the following expressions.


2-2
100/10
sqrt(4)
(200-5)*4+sin(2)
The output should appear as follows.

> 2-2
[1] 0
> 100/10
[1] 10
> sqrt(4)
[1] 2
> (200-5)*4+sin(2)
[1] 780.9093

Saving an Rscript
Suppose that you are interested in saving the rscript after typing these expressions. In order to do this, hit the
save button that is at the top of the rscript (not the one on the toolbar under the dropdown folder of code).

When you close out of RStudio, it will ask if you want to save your workspace. For our purposes in this course,
you should always say no. Instead, make sure to save the code you have typed in an rscript.

Variables and Vectors


When working with data, we find ourselves working with variables that have a list of values that have been collected in the sample. One way to work with these variables in R, is to create a vector. In R, a vector is an object
with a list of values assigned to it. There is a lot that can be done with vectors, but the following section will give
a brief introduction to them.
We can make vectors by using <- to assign values to a name. Suppose that we want to create a vector called
yellow with only the value 1 assigned to it. We can do this by typing the following code in the rscript and running
it in the console.
yellow <- 1
Now, when I type yellow in the console and hit enter, it will output the number 1. (Note that we can leave out
the spaces, and the command will work in the same way. yellow<-1)

We can create vectors with more than one value by using the command c(), which stands for concatenate, to
create the list of values. For example, if we want to create a vector called red with the values 2, 3, 4, and 5 in it,
we could do so with the following code.
4

red <- c(2,3,4,5)


Then, when red is run in the console, the list of numbers assigned to it will be returned.

We can give our vectors whatever name we like, but we cannot include spaces in their name. For example, the
following code will produce an error.
a vector <- c(1,2,3,4)
Instead, we need to remove the space, or we could put a period in place of the space.
avector <- c(1,2,3,4)
a.vector <- c(1,2,3,4)

It is also important to know that capitalization is recognized by R. We will find that the vector red that we created
before is not the same as Red.

We can perform mathematical operations with vectors of the same length. For example, if we create a new variable
called blue with four values in it, we can add, subtract, multiply, and divide red with blue using the following
code.
blue <- c(1,2,3,4)
red+blue
red-blue
red*blue
red/blue

When the code is run in the console, it appears as follows.

When performing mathematical operations on multiple vectors, it creates a new vector. We can assign a name to
the new vector. For example,
purple <- red+blue

We can ask RStuido to tell us what the value in a certain position of a vector is. For example, if we are interested
in determining what the value in the third place of the the vector purple is, we can do so using brackets.
purple[3]

We can also create vectors containing non-numerical values if we are working with a categorical variable. In order
for RStudio to recognize that the values in the vector are categories instead of numbers, we need to put quotation
marks around each value. For example, we may encounter a variable of yes and no responses. We can create such
a vector using the following code.
responses <- c("yes","no","no","yes")

Try creating vectors of the variables measurement and sex from the example dataset in class.
Subject
1
2
3
4
5
6

Treatment
drug
drug
drug
drug
drug
drug

Measurement
5
8
6
14
7
6

Sex
M
M
M
F
F
M

Age
21
25
25
21
20
20

Subject
7
8
9
10
11

Treatment
placebo
placebo
placebo
placebo
placebo

Measurement
6
2
5
3
3

Sex
F
F
M
M
M

Age
30
28
22
22
24

The code and output should look as follows.


> measurement <- c(5,8,6,14,7,6,6,2,5,3,3)
> measurement
[1] 5 8 6 14 7 6 6 2 5 3 3
> sex <- c("M","M","M","F","F","M","F","F","M","M","M")
> sex
[1] "M" "M" "M" "F" "F" "M" "F" "F" "M" "M" "M"

Creating a dataset in RStudio


If we are working with a small dataset, it can be useful to simply type the data into RStudio. We just saw how to
create a single vector in RStudio, but in order for RStudio to recognize them as a dataset, we use the command
data.frame. For example, if we want RStudio to recognize the example dataset from class, we can use data.frame
in the following way. Note that we use <- to assign a name to the dataset, but inside the command data.frame,
we have to use = to create the variables. Also, the way I have formatted the code causes it to take up multiple
lines in an rscript. When transferring the code from the rscript to the console, send the first line to the console.
When you do this, the symbol + appears at the command line instead of >. RStudio has recognized that the
code entered is incomplete, and it is waiting for additional code before running the code entered. Thus, move the
cursor to the next line of code in the rscript (it may do this automatically for you), and send this line of code to
the console. Continue to do this until the whole chunk of code is in the console. To check if the dataset has been
entered correctly, type data into the console, and the data frame will print as seen in the image below.
data <- data.frame(treatment=c("drug","drug","drug","drug","drug","drug",
"placebo","placebo","placebo","placebo","placebo"),
measurement=c(5,8,6,14,7,6,6,2,5,3,3),
sex=c("M","M","M","F","F","M","F","F","M","M","M"),
age=c(21,25,25,21,20,20,30,28,22,22,24))

Uploading a .csv Dataset


We can also upload a dataset to RStudio that is stored on our computer in a different type of document such as
a .txt file, an excel document, or .csv file. However, we are only going to work with .csv files for now.
I have uploaded a dataset called Cereals.csv to Moodle. Download the dataset, and save it to a folder on your
computer where you know where it is. (I would recommend creating a new folder for this course.)
If you are using the RStudio server, you get a small amount of storage space on the server, and you will need to
upload the dataset to your storage. To do this, I recommend creating a new folder on the server. Click on the
New Folder button shown in the picture below. I created the folder called data. Then go into your new folder,
click on the upload button seen in the picture below, and upload Cereals.csv to your new folder. If you are not
using the server, ignore this paragraph.

In order to import the data into R so we can use it, we need to set our working directory to the location that
we have the data file saved in. To change your working directory, go to Session, then Set Working Directory, and
select Choose Directory.

This will open a popup. Find the folder that you have the data saved in, highlight it, and hit open. Now RStudio
will be able to access data files that you have saved in that folder.

Next we can use the command read.csv() to load your data into R. To do this, type the name of the .csv file
inside the parentheses of read.csv() with quotation marks around the name. If the .csv file has the variable
columns labelled (as the document Cereals.csv does as shown in the image below), put a comma after the file name
and include header=TRUE. This tells RStudio to treat the first row of entries in the .csv document as the variable
names and the values in the second row as the start of the variable entries. If the columns are not labelled, then
include header=FALSE. Then RStudio will treat the entries in the first row as the start of the variable entries.

It is important that you name the data so you can work with it easily after it has been loaded into RStudio. Again,
we will use <- to name the data.
Thus, to upload the file Cereals.csv to RStudio and name it, use the following command. I have chosen to
call the data cereals.
cereals <- read.csv("Cereals.csv",header=TRUE)
If the file has uploaded correctly, you can type cereals into the console, and it will print the data.

Working with a Dataset and Some Simple Computations


Exploring the Uploaded Dataset
Once your data is uploaded to RStudio, it is always a good idea to get a general understanding of the dataset.
The following commands are helpful.
str()
summary()
head()
names()

Displays the general structure of the data


Gives some summary statistics for each of the variables in the data
Displays the top several rows of the data
Lists the names of the variables in the dataset.

For example, the command


summary(cereals)
produces the following output.
> summary(cereals)
ID
Age
Min.
: 1.0
adult
:17
1st Qu.:11.5
children:26
Median :22.0
Mean
:22.0
3rd Qu.:32.5
Max.
:43.0

Shelf
bottom: 9
middle:19
top
:15

Sodiumgram
Min.
:0.000000
1st Qu.:0.003150
Median :0.004839
Mean
:0.004634
3rd Qu.:0.006481
Max.
:0.007407

Proteingram
Min.
:0.03030
1st Qu.:0.03391
Median :0.06667
Mean
:0.08162
3rd Qu.:0.09717
Max.
:0.26667

Accessing Variables in the Dataset


As mentioned above, the command str() displays the general structure of the data. For example, we obtain the
following output if we enter str(cereals) into the console.
data.frame: 43 obs. of 5 variables:
$ ID
: int 1 2 3 4 5 6 7 8 9 10 ...
$ Age
: Factor w/ 2 levels "adult","children": 1 2 2 2 1 2 2 2 2 2 ...
$ Shelf
: Factor w/ 3 levels "bottom","middle",..: 1 1 1 1 1 1 1 1 1 2 ...
$ Sodiumgram : num 0.007 0.00667 0.00467 0.00697 0.007 ...
$ Proteingram: num 0.1 0.0667 0.0333 0.0303 0.1 ...
It lists the variables in the dataset and what type of variable they are. Before each variable is a $, which is known
as the extract operator. Thus, in order to access a variable in the dataset, we must first type the name of the
dataset, a $, and then the name of the variable in the dataset. For example, if we want to print just the variable
Sodiumgram from the dataset cereals, we would type:
cereals$Sodiumgram
The output from this command is as follows. (Note that cereals$Sodiumgram is a vector.)
> cereals$Sodiumgram
[1] 0.007000000 0.006666667
[8] 0.004838710 0.001851852
[15] 0.007000000 0.006785714
[22] 0.006666667 0.006296296
[29] 0.002200000 0.007000000
[36] 0.001634615 0.002800000
[43] 0.002452830

0.004666667
0.005517241
0.004545455
0.007407407
0.003500000
0.005818182

0.006969697
0.006666667
0.005000000
0.004375000
0.001792453
0.002727273

10

0.007000000
0.004500000
0.004687500
0.005333333
0.004500000
0.005600000

0.006000000
0.004375000
0.003833333
0.005666667
0.002800000
0.000000000

0.006129032
0.007096774
0.004500000
0.004848485
0.000222222
0.000000000

If you want to work with the variables without typing as much, you can rename the variable. For example:
Sodiumgram <- cereals$Sodiumgram
Now, just by typing Sodiumgram, it will print the same output as cereals$Sodiumgram does.

Performing Commands on the Uploaded Dataset


We can use RStudio to easily perform many of the computations we discussed in class on quantitative variables
in a dataset (and vectors in general). The table below contains the functions for the computations.
Computation
mean
median
standard deviation
variance
quantiles
range
minimum
maximum
five number summary

Function
mean()
median()
sd()
var()
quantile()
range()
min()
max()
summary()

In order to use the functions on a variable, type the name of the variable inside the parentheses. For example, if
I want to compute the mean of Sodiumgram, I will type
mean(cereals$Sodiumgram)
or
mean(Sodiumgram)
since I renamed the variable.

Tables
You can create tables in RStudio with categorical variables using the command table(). For example, to create
a frequency table of the different levels of the variable age in the dataset cereals, use the command
table(cereals$Age)
This produces the following output. We see that 17 of the cereals are considered adult cereals, and 26 are considered
childrens cereals.
adult children
17
26
To create a two-way table, include the two variables of interest inside the table() command with a comma
between the variables. For example,
table(cereals$Age,cereals$Shelf)
creates a two-way frequency table of the variables age and shelf.

adult
children

bottom middle top


2
1 14
7
18
1

11

If you are interested in creating a relative frequency table with proportions (instead of just the frequencies), first
create a table of the desired variable(s), and give it a name. Then divide the table by the summation of the entries
in the table. For example, to create a relative frequency table of the variable age, use the following command.
tab <- table(cereals$Age)
tab/sum(tab)
It produces this output.
adult children
0.3953488 0.6046512
Similarly, we can also use the same procedure for the two-way table.
tab2 <- table(cereals$Age,cereals$Shelf)
tab2/sum(tab2)
The output is as follows.
bottom
middle
top
adult
0.04651163 0.02325581 0.32558140
children 0.16279070 0.41860465 0.02325581

Graphing
There are three basic plotting functions in RStudio: high-level plots, low-level plots, and the layout command par
(which will not be discussed in this handout). Basically, a high-level plot function creates a complete plot and a
low-level plot function adds to an existing plot, that is, one created by a high-level plot command.

High-Level Plot Functions


There are many plots that can be created in RStudio, but for now, we will just focus on three: barplots, histograms,
and boxplots. The plot functions for these are included in the table below.
Graph
barplot
histogram
boxplot

Command
barplot()
hist()
boxplot()

Suppose we want to create a barplot of the age levels in the cereal dataset. We can do this using the following
command.
barplot(table(cereals$Age))
Note that you must use the table() command inside the command barplot() when creating a boxplot if the
categorical variable is a vector as cereals$Age is. The plot will appear in the lower right hand corner of RStudio
as follows.

12

If we want to make a histogram and boxplot of the variable Sodiumgram, we can apply the commands to the
variable, and the graphs below are created.
hist(cereals$Sodiumgram)
boxplot(cereals$Sodiumgram)

0.004

8
6
4
0

0.000

Frequency

Histogram of cereals$Sodiumgram

0.000

0.002

0.004

0.006

0.008

cereals$Sodiumgram

These plots are fairly basic (and are lacking good titles and labels). We can add additional arguments to the code
to enhance the plots. The table below includes some options.
Option
col
xlim
ylim
xlab
ylab
main
sub

Description
color (color=red, blue,...)
x-axis limits: xlim=c(min,max)
y-axis limits
x-axis label: xlab=my label
y-axis label
main title
sub title

For example, we can improve the boxplot by adjusting the code in a way such as the following. Note that I
have I have included an additional option specific to boxplots: horizontal=TRUE. This causes the boxplot to be
horizontal instead of the default of vertical.
boxplot(cereals$Sodiumgram,col="grey",ylim=c(-0.0005,0.009),xlab="Sodium (grams)",
main="Boxplot of Amount of Sodium in Sampled Cereals", horizontal=TRUE)
Boxplot of Amount of Sodium in Sampled Cereals

0.000

0.002

0.004

0.006

0.008

Sodium (grams)

Additionally, we can easily create side-by-side boxplots in RStudio. For example, if we want to compare the
sodium amounts of cereals by the age the cereal is marked to we can use the following command to do so. Note
that we do this by including a after the quantitative variable and then list the categorical variable of interest.
The code also includes names=c("Adult","Children"), which allows us to adjust the category names.
boxplot(cereals$Sodiumgram~cereals$Age,col="grey",ylim=c(-0.0005,0.009),xlab="Sodium (grams)",
ylab="Age", main="Boxplots of Amount of Sodium in Sampled Cereals by Age",
names=c("Adult","Children"),horizontal=TRUE)
13

Adult

Age

Children

Boxplots of Amount of Sodium in Sampled Cereals by Age

0.000

0.002

0.004

0.006

0.008

Sodium (grams)

Low-Level Plot Functions


Low-level plot functions can be executed only after a high-level plot has been created. There are lots of options
for this, but for now, I will only include the command abline(), which adds a line to a plot after it has been
created. This can be useful for displaying a statistic on the plot. For example, the following code places a dashed
line on the histogram of cereal sodium amounts at the mean sodium amount. Note that the v causes the line at
the mean to be vertical. If you require a horizontal line, use h instead. Also, lty indicates the type of line to be
used. Try using other numbers to view additional line types.
hist(cereals$Sodiumgram,xlab="Sodium (grams)",main="Histogram of Amount of Sodium in Sampled
Cereals", sub="(dashed line represents mean amount of sodium)")
abline(v=mean(cereals$Sodiumgram),lty=2)

8 10
6
4
0

Frequency

Histogram of Amount of Sodium in Sampled Cereals

0.000

0.002

0.004

0.006

0.008

Sodium (grams)
(dashed line represents mean amount of sodium)

Other options for the line are listed in the table below.
Option
lty
lwd
col

Description
line type (lty=1, 2,...)
line thickness (lwd=1, 2,...)
color (color=red, blue,...)

Saving Plots
There are several ways to save a plot in RStudio. I will only discuss one way. Once the plot has been created, click
on the export button above the plot, and select either Save as Image... or Save as PDF... depending on which you
would prefer. A window will appear that will allow you to name the plot, choose where you want to store it, and
save it.

14

General Tips for Using RStudio


There are often many ways to perform the same action in RStudio. If you find a way that you prefer, feel
free to use it.
When you are first learning RStudio (and even when you have worked with it for a long time), errors often
are the result of a missing comma, a misspelled word, the wrong capitalization, etc. Be precise and patient
when typing your code.
Type colors() in the console to see the list of colors available for the plotting commands.
If you place your cursor at the command line in the console and hit the up arrow, previously entered code
will appear, which can be helpful.
If you are ever looking for help with a command in R, type the command with a question mark before it into
the console. For example, ?mean. Then information will appear in the bottom left corner of the window.
The internet is also a great resource for help with R and RStudio.

If you type # at the beginning of a line of code, RStudio will treat the typing after # as a comment. It can
be a good idea to include comments in rscripts as a title, to explain what is happening in the code, etc.

15

Anda mungkin juga menyukai