Anda di halaman 1dari 120

R Programming for Life Scientists

Version 2.0
Raymond R. Balise, Ph.D.
Health Research and Policy
Spectrum
Roadmap
What makes R different for the rest?
Setting up R
Types of data
Working with collections of data
Importing and exporting data
Writing functions
Graphics

When to Use R
Shoestring budget
Cutting edge statistics
Developing your own or fine-tuning existing
methods
Local expertise
Programming Languages
Procedural languages
C, Fortran, Cobol, Basic
use a model where the logic flows from the top of
the page to the bottom with calls to goto
subroutines as needed
It is hard to encapsulate the code.
Object oriented languages
C++, Visual Basic, JAVA
involves creating objects and then operating on them
R is Object Oriented (OO)
You create objects
vector of numbers, a graphic, etc.
You call methods/functions to operate on the
objects.
Working with an OO language requires you to
learn about special methods to create, access,
modify, or destroy objects and their properties.
R hides these processes.
It helps a lot if you want to write new statistics and
methods and is required for making new packages.
OO Example
With R you write code in the editor which I will
show you in a minute.
You can create an object which holds a bunch of
numbers (a vector, if you remember math)
You can then use (aka call) a function (aka
method) to operate on the object.
The summary() function
Create and display a numeric summary object
The plot() function
Create and display a graphic summary object

Make the
ages object
Call the
summary
function.
Call the
plot
function.
But wait theres more!
There is a lot of functionality built into R. It
ships with libraries that do many different
tasks. And you can download more.
Map most of the
USA.
Activate the map
datasets and
functions.
But hold on. There is MORE!
You can add options to the function calls to
make them do fancy things like color.
Or you can have one function act on the
output of another function.
And you can save output as objects!
Important Objects
Vectors are lists of numbers.
Dataframes are like database or spreadsheets.
Everything is an Object
Class:
character
numeric
logical
function
factor
data.frame
list
matrix
lm
table
lots more
R objects can have a class which indicates what they are used for.

character
numeric
logical
function
complex
raw


Mode:
Every object in R has a mode which determines how much space it uses.
Structures to hold data
Columns (of same length) with data like database
Columns of data (of different length)
Grid of data (all same type)
Holding simple data elements
Output from summary/graphic procedures
Where to Get R
R has two main websites. One describes the project:
http://www.r-project.org/
The other has most of the stuff you want to
download:
http://cran.r-project.org/
Because the R project has people working all over
the globe, the software download site is mirrored
everywhere. The closest mirror is USA CA1 (aka UC
Berkeley).
http://cran.cnr.berkeley.edu/
There is an R installer for all the common
operating systems:
cran.cnr.berkeley.edu/bin/windows/base/
cran.cnr.berkeley.edu/bin/macosx/
cran.cnr.berkeley.edu/bin/linux/
Each is basically self explanatory.
Installing on Windows
Double click the installer and just push next
until you get to this screen.






Specify that you
want to do
customized startup.

This will let you set
up R to work with
other programs
nicely.
Customize
Use these options, then hit Next> a bunch.
help.start() and push enter to start the help.
q() and push enter to quit but dont yet.
GUI
Use the built in
editor.
Save or restore all
the objects in use.
Save or reload the
code from the
console.
Keep all the text in
the console for the
session.
Set the working
directory to save
objects.
GUI
Edit existing data.
Tweak the
appearance of the
console.
Rprofile.site
If you have instructions that you always want
run when R starts up, you can include them in
the Rprofile.site file:
GUI
Common
commands.
Show the add on
packages currently
accessible.
Packages in R
User-supplied packages are typically found at
one of three places:
CRAN for all kinds of stuff
Omegahat for web-based statistics
Bioconductor for genomic analysis
R packages update often.
Your colleagues will recommend task-specific
packages.
Rcmdr is my favorite.
GUI
Use a previously
downloaded package.
I type library(name)
instead.
USA (CA1) is closest
to Stanford.
Choose which set of
packages to look at.
See the HUGE list of
packages.
Update often!
GUI
This is useful.
HTML
help
This is useful
but not
Google.
This will not find
information if you
have not installed
the packages.
Rseek.org is Google-driven
I highly recommend it.
Mac Quick Help
Search help for the
word "map".
Search for details
on a function if the
package is loaded
and you know the
functions name.
Windows Quick Help
Search help for the
word "map".
Search help for the
function named
"map".
Load the package.
Mac Install
Download and double click the dmg file.
Click customize and
make sure Tcl/Tk is
checked on.
X11
Some packages for R on the Mac (like Rcmdr)
require X11 to be installed.
I think it is part of the standard Leopard
installation but was an option with Tiger. If you
need it, try to install it off of the DVD that came
with your machine because people have reported
using the dmg files from Apple.com.
X11 and Add-on Packages
To get add on
packages, use this
menu.
You can click here
to make sure X11
works.
Getting or Updating Packages
Click Get List, click the package name, be sure
install dependencies is checked on, then click
install.
Instead of Point and Click
You can also run this code to have Mac or
Windows R download a list of packages:

usefulPackages = c("car", "foreign", "hexbin", "gdata",
"ggplot2", "gmodels", "gplots", "Hmisc", "reshape",
"Rcmdr")
install.packages(usefulPackages, dependencies = TRUE)

Be sure to take note of any packages that do not install.
marray, affy, Biobase, Rgraphviz were not available
I suggest you install the Rcmdr package first
thing.
Use the Install packages option on the package
menu to download Rcmdr
To make it available for your R session type:
library(Rcmdr)
CAPITALIZATION MATTERS!
The first time you run it, it will ask you if it can
download additional packages.
Your First Package
If you are on Windows you
can directly import Excel.
On a Mac you can
not directly import
from Excel.
Hate Typing?
Tab is your fiend. It will auto-complete if it
can or give you a list of functions that match
what you have typed. It woks very well on the
Mac. In Windows sometimes you need to type
tab twice.

In Windows if you type tab after a ( it displays
options for the function or they just appear in
the Mac.
Before Analysis
Rcmdr makes the most commonly done
analyses easy or if you are told the names of
the functions to use, writing the code is
almost tolerable.
Data manipulation is relatively difficult in R
compared to other analysis and data
management tools.
You need to know how to manipulate the data
objects.
Data Set Objects
Vectors
A bunch of data in a single row or column
All of the same type
Matrix
A row and column arrangement of data
All of the same type
Data frame
A row and column arrangement of data
Columns are of different types
List
Very free-form structure
A grouping of different types of data
Like a good spreadsheet
or relational database file
Types of Data Vectors
Numeric
Integer, real, and complex are different types but
you will not need to pay attention to the details
NA means missing
NAN means not a number
String
Characters of the alphabet
Logical
TRUE, FALSE or NA

R is case sensitive.
# means ignore the rest of the line.
; means a new command follows.
Making a Vector
Surround the expression
with () to display the result
automatically.
This will be VERY useful.
Make a sequence
OneToThirty = seq(1, 30)
OneToThirty # same as print(OneToThirty)
oneToThirty = seq(1, 30, by = 2); oneToThirty
x1230 = 1:30
(x1to30 = 1:30)
Making Vectors With c()
c stands for concatenate
ages = c(9, 11, 40, 41) ; ages
stooges = c("Larry", "Moe", "Curly", "Shemp"); stooges

Getting Details
You can use is functions and length to get
details on a vector.
is.vector(ages)
is.numeric(ages)
is.logical(ages)
length(ages)

You can add one to all four ages.
ages + c(1,1,1,1)
If you provide the scalar integer, R will
temporarily vectorize the 1 by recycling that
value to match the length of the ages vector.
ages + 1
It will recycle a series also.
ages
ages + c(1,2)
Recycling and Vectorizing
Naming Parts of a Vector
You can assign names to the elements of a vector.
This allows later access to the elements using the
names instead of the position.
names(ages) = stooges
ages
To erase them:
names(ages) = NULL; ages
Notice what happens when the lengths differ:
stooges= c("Larry", "Moe", "Curly")
names(ages) = stooges
ages
Attributes
When you add names to things (objects) they
acquire or change their names attribute.
attributes(ages)
When you strip off the names, the vector is
left with no attributes.
names(ages) = NULL
attributes(ages)


A data frame is an object with many
attributes.
R ships with a lot of datasets if you want one
help.start()
Click packages then datasets.
esoph
?esoph
attributes(esoph)
Complex Objects
Getting at Parts of a Vector
Specify the element number.
heyMoe = ages[2] ; heyMoe

Specify to drop everything except the element
number.
ages[c(-1, -3, -4)]
Specify a list with TRUE and FALSE
ages[c(FALSE, TRUE, FALSE, FALSE)]
Getting Parts with Names
ages = c(9, 11, 40, 41) ; ages
names(ages) = c("Larry", "Moe", "Curly", "Shemp")
ages

Specify the name.
heyMoe = ages["Moe"]


Duplicate Names
That code only returns the first one if there
are duplicates.
names(ages)[4] = "Moe"
ages
heyMoe = ages["Moe"]
heyMoe
Gives all if duplicates
names(ages) %in% "Moe"
ages[names(ages) %in% "Moe"]

Parts of a Data Frame
You can select columns of a data frame just
like you selected elements from a vector.
booze = esoph["alcgp"]
is.data.frame(booze)
esoph[2]
esoph[c(4,5)]
Choosing Records
If you put a single item or series inside of the
square brackets, R thinks you are requesting
columns.
If you want to get access to specific rows, you
include a comma after the rows.
blah[rows, columns]
esoph[ 1 , ]
esoph[ 1:3 , ]
Smarter Access to a Vector
You can use logic checks to find the record
numbers in a vector which meet your criteria.
ages < 21
which(ages < 21)
You can then subset down your data to the
records of interest using the [ ] subset
operator.
ages[which(ages < 21)]
ages[ages < 21]
You can find the records that pass a logic
check inside a data frame.
esoph["ncases"] > 0
esoph$ncases > 0
which(esoph$ncases > 0)
Smarter Access to a Data Frame
Subset a Data Frame
Recall that you can select rows with
frameName[rows,columns] and if you do not
include a comma, all records are chosen.
which(esoph$ncases > 0) gives you a list of
records which adhere to that rule. Therefore,
the code below gives you a subset
esoph[ which(esoph$ncases > 0) , ]
or
esoph[esoph$ncases > 0 , ]
Subset Data Frames Easily
If that logic is rough to think about, use the
subset function.
subset(esoph, ncases > 0)


Choosing Values
If you need specific values, you can use the &
(and) or the | (or) operators to get the
ordered set of TRUE and FALSE values.
ages > 21 & ages < 41
! means not
!(ages > 21 & ages < 41)
Notice that it is applying the one logic check
to the vector of ages. How does it do that?
Math on Data Frame Columns
You have seen how to do scalar and vector
algebra. Algebra on a data frame is easy.
names(esoph)
esoph$total=esoph$ncases + esoph$ncontrols

To see the end of the data frame, use tail()
tail(esoph)
Comparing Against Vectors
This one uses
recycling and
gives wrong
answers.
What happens when you try to compare a vector
to a set of things?
gender = c(NA, "Male", "Female", "Blue", "Female")
gender == "Male" | gender == "Female"
gender == c("Male", "Female")
R recycles the shorter vector to be the longer
length, then does the comparison. Use the %in%
operator if you want to compare as if you wrote a
series of or statements.
gender %in% c("Male", "Female")

Categorical Variables
R makes a distinction between variables holding
a bunch of characters from the alphabet and
variables holding categorical information. If
you have a classification/categorical variable,
you want R to treat it as a factor or an ordered
factor. Typical factors are treatment or gender.
dose = c("low", "placebo", "high", "low")
dose
typeof(dose)
Factors
To convert a character variable to a factor, use the
as.factor function.
doseF = as.factor(dose)
typeof(doseF)
class(doseF)
Behind the scenes, the character variable is
converted into numbers and the numbers are
given character strings to display.
In modern R the levels of the factor are ordered
alphabetically and the first one is represented
with the digit 1, the second is 2, etc.

There are is. or as. predicate functions
to check object types or convert
between types of objects.
Comparing Factors
Notice wrong answer thanks to recycling.
You can compare a factor vs. a constant value.
doseF == "high"
as.integer(doseF) == 1
Or you can compare vs. vectors (CAREFULLY).
doseF == c("high", "low")
doseF %in% c("high", "low")
R will stop you from comparing factors that have
different categories.
doseF2 = as.factor(c("blah", "placebo", "high", "low"))
doseF == doseF2
Recoding Factors
Often you will want to regroup factor levels.
amount=as.factor(c("placebo", "10mg", "5mg", "10mg"))
levels(amount)
regroup = list(none="placebo", some=c("5mg", "10mg"))
levels(amount) = regroup
amount

none
placebo
some
5mg
10mg
Numeric Factors
If you have numeric factors, be careful
converting from factors back to numbers.
ID = c(1000, 1000, 1001, 2)
IDf = factor(ID)
as.integer(IDf)
levels(IDf)
numbersAgain = as.numeric(levels(IDf))[IDf]


Recoding Values in a Vector
R has functions like if and ifelse to process
values.
ifelse(ages < 21, "Young", "Old")
ifelse(ages<21, "Young", ifelse(ages<41, "Old",
"Fossilized") )
Easier Recoding
Other packages like car have functions to recode:
library(car)
newAge2=recode(ages, ' 1:21="Young"; else= "Old" ')
newAge2
detach("package:car")

Attaching and Masking
You can see the order in which R searches for
things using the search() function.
Remove the
package from
the search path.
Attaching Data Frames
People who really hate typing attach data
frames so they can refer to them with short
names.
bmi = women$weight / women$height ^2 * 703
Instead you can attach the women data frame
and an easier formula write:
attach(women)
search()
bmi = weight / height ^2 * 703; bmi
detach(women)


Keeping Track
You can see what datasets are in each of the
work environments/packages with the list
stuff function ls().
rm(list=ls(all=TRUE))
ls()
search()
attach(women)
ls()
search()
head(women); head(height)

datasets

women

women
height
weight
.GlobalEnv
Look 1
st
Look 2
nd
Look 3
rd

Adding a Variable & Making a DF
women$bmi = weight / height ^2 * 703; bmi
head(women)
ls()

datasets

women

women
height
weight
.GlobalEnv
women
Look 1
st
Look 2
nd
Look 3
rd

The data frame
with bmi
and the data
frame without
bmi
Making a Data Frame
Frequently you will want to make data frames for
analysis with Rcmdr. Use the data.frame()
command:
attach(sleep)
pair = data.frame(extra[group=="A"], extra[group=="B"])

Using Rcmdr for a paired t-test
Click here.
Loading Text Data into R
Reading text files:
fakeAlleles=read.table("c:\\blah\\fakeAlleles.txt",
header=TRUE)
See if it worked:
fakeAlleles
names(fakeAlleles)
summary(fakeAlleles)
fakeAlleles$dude = as.character(fakeAlleles$dude)
fakeAlleles
A better option:
fakeAlleles = read.table("c:\\blah\\fakeAlleles.txt", header =
TRUE, colClasses = c("character", "factor","factor"))
Other Text Formats
Other text reading methods:
read.csv = coma separated values
read.csv2 = semicolon delimited files
read.delim = read tab delimited files
read.fwf = read fixed width format files
Use same options as read.table
If the data has bad or no column headings you may
also want to include:
read.table( stuff, col.names = c("name1", "name2") )
To prevent characters from coming in as factors:
options(stringsAsFactors = FALSE)

Data Frames
The data imported into a data frame.
class(fakeAlleles)
A data frame really is a list of vectors where the
vectors are all the same length.
as.list(fakeAlleles)
To select a column you specify the data frame $
variable name.
theDudes = fakeAlleles$dude
All the stuff you saw for logic checks on vectors
can be used on the parts of a data frame.
fakeAlleles$allele1 == "A"
Subsetting Vectors (again)
Recall that you can subset using the [ ]
operator:
ages = c(9, 11, 40, 41)
heyMoe = ages[2]
ages <21
ages[ages < 21]
The same voodoo works on the vectors that
make up data frames!
dudeTwo = fakeAlleles$dude[2]


Subsetting Data Frames
Parts (subsets) of data frames are referenced
by "column numbers comma row numbers":
The first record: fakeAlleles[1, ]
The 2
nd
and 3
rd
columns: fakeAlleles[ , c(2,3)]
The genotype for record 6: fakeAlleles[6, c(2,3)]
or by names:
fakeAlleles[, c("allele1", "allele2")]
Named Rows
You can name your rows also:
fakeAlleles = read.table("c:\\blah\\fakeAlleles.txt",
header = TRUE, colClasses = c("character",
"factor","factor"))
row.names(fakeAlleles) = fakeAlleles$dude
fakeAlleles = fakeAlleles[, -1]; fakeAlleles

fakeAlleles["006", c("allele1", "allele2")]



Getting Counts with Rcmdr
Subsetting Using Logic
You can use logic checks to subset:
fakeAlleles$allele1 == "A" & fakeAlleles$allele2 =="A"
fakeAlleles[ fakeAlleles$allele1 == "A" &
fakeAlleles$allele2 =="A", ]
Importing From Excel
If you have PERL on your machine, you can
use the read.xls() function in the gdata library
to easily get data out of Excel and into a data
frame.
Mac has PERL
Windows
http://www.activestate.com/activeperl/
Using read.xls
Windows:
library(gdata)
sleepy = read.xls("c:\\blah\\sleep.xls")
Mac:
library(gdata)
read.xls("/users/balise/desktop/sleep.xls")
Its that easy Behind the scenes it is converting
the xls file into a csv so you can use the text
importing options.
Do summary() on the data frame and notice what
happens to the missing value.
RODBC
ODBC is a language/convention for accessing
databases. R allows you to use ODBC
connections to burrow directly into databases
and other data containers like Excel.
library(RODBC)
channel<-odbcConnectExcel("c:\\blah\\sleep.xlsx")
sqlQuery(channel, "select * from [Sheet1$]")
odbcCloseAll()
SQL
If you have to learn one programming
language, learn SQL.
With it you can manipulate data stored in nearly
every commercial database.
You can aggregate, subset and modify data.
It is well implemented inside of both R and SAS.
SQL with R is nicely documented in Spector's
(2008) Data Manipulation with R. It is a must
own for people who want to learn R.

Exporting Text Files
R can write objects full of data, including data
frames, into text files.
By default, it will quote the character string and fill
in the letters NA where there were originally
missing values.
This code exports back to the original
appearance.
write.table(sleepy, file = "c:\\blah\\exported.tab",
sep ="\t", quote = FALSE, na ="")

Office 2007 Excel
ODBC connection
1. Control Pannels,
2. Double click Adminstrative Tools
3. Double click Data Sources (ODBC)
4. On the USER DNS tab choose ADD
5. Click Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)
6. Give the connection a name and browse to the file.
Jot down the name of the connection for the R code.
Using an ODBC connection
Once the ODBC connection is set-up use code like this:
library(RODBC)
connection = odbcConnect("sleepODBC")
dataFromODBC= sqlFetch(connection, "Sheet1")
odbcClose(connection)
Creating Programs
You can write line-by-line instructions in the R
console, use the editors built into R, or use a third
party editor (like Tinn-R for windows or JGR).
Console
Type history() to see the lines you have submitted
recently and then save to a file and re-run it later if
needed.
Built-in Editor
Mac: Click the blank page at top of the console
Windows: File > New Script

Windows Tinn-R Editor
http://www.sciviews.org/Tinn-R/index.html
OO Programming in R
OO programming requires
objects
classes
describe specific properties for groups of objects
inheritance
classes related to eachother (derived from other classes) have related
properties
polymorphism
the same function name applied to different classes does different things
R vs. JAVA: R typically has separate classes for actions instead of
bundling them with the data structures
JAVA
Animal -> domesticated -> dog (walks)
R
Animal -> domesticated -> dog
Movement -> Walks
Polymorphism is Fun
plot does different things depending on the
function arguments:
plot(sleepy)
plot(sleepy$extra)
plot(sleepy$baseline, sleepy$extra, sleepy$group)
Take a look at how it works:
isS4(plot)
methods(plot)
getAnywhere(plot.factor)

Arguments
Functions try to match arguments in the order they
appear in the definitions in the help files.
?plot
You can explicitly reference the arguments' full names
and they typically allow abbreviations (but dont
obfuscate your code).
The means that other arguments are allowed and
are passed along the class hierarchy to other methods.
Look at plot to see this.
The makes it easy to write bugged code because you if
misspell the argument name, it is silently passed along.
Writing Functions
You can easily write functions, but notice that the last
thing calculated is returned:
MandM = function(x){mean(x); median(x)}
MandM(sleepy$extra) # returns only the median
Store the values you want into a list:
MandM = function(x) {
blah = list(theMean=0, theMedian=0)
blah$theMean = mean(x)
blah$theMedian = median(x)
return (blah)
}
MandM(sleepy$extra)

Other Arguments
MandM(sleepy$baseline)
It points out that we need to deal with missing
values. Look up mean and median and you will
see they allow the na.rm parameter to determine
if missing values are dropped.
Using MandM(sleepy$baseline, na.rm=TRUE)
does not work because the parameter list does
not allow it. We want to allow that parameter to
be passed along. So rewrite the function.

M and M Again
Recall that an in the argument list means
"other stuff".
MandM = function(x, ...) {
blah = list(theMean=0, theMedian=0)
blah$theMean = mean(x , ...)
blah$theMedian = median(x, ...)
return (blah)
}
MandM(sleepy$extra)
MandM(sleepy$baseline, na.rm=TRUE)
Appling Your Function
R does allow you to write loops to iterate over
records or variables but if you are not writing
novel math functions, they can generally be
avoided.
R will try to vectorize and process:
MandM(c(sleepy$baseline,sleepy$extra))
Use sapply to apply a function to a data frame:
sapply(sleepy, MandM, rm.na=TRUE)
Better M and M
MandM = function(x, ...) {
blah = list(theMean=0, theMedian=0)
if(is.numeric(x)== TRUE) {
blah$theMean = mean(x , ...)
blah$theMedian = median(x, ...)
}
return (blah)
}
sapply(sleepy, MandM, na.rm=TRUE)

Yummy M and M
MandM = function(x, ...) {
blah = list(theMean=NaN, theMedian=NaN)
if(is.numeric(x)== TRUE) {
blah$theMean = mean(x , ...)
blah$theMedian = median(x, ...)
}
return (blah)
}
MandM(sleepy$extra)
MandM(sleepy$baseline, na.rm=TRUE)

sapply(sleepy, MandM, na.rm=TRUE)
Writing Novel Functions
Look hard on rseek.org before you reinvent
the wheel.
R syntax is very similar to C.
Select/Case logic is different (R short-circuits) .
The R Book by Crawley is too big to buy for
just this topic but it is good for syntax. Get it
from the library and read the early chapters.
The final chapter of Spector has a few
wonderful pages.
A Handy Function
# The "what the function" (wtf) function

wtf = function(x){
if (missing(x)) stop("\n\tUsage: wtf(x)\n\twhere x exists")

item = as.character(sys.call())[2]
if(is.factor(x) == TRUE) {
thingy = cat(item,
"\n FACTOR \n length", length(x),
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}
if(is.vector(x) == TRUE) {
thingy = cat(item,
"\n VECTOR \n length", length(x),
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}
if(is.matrix(x) == TRUE) {
thingy = cat(item,
"\n MAXTRIX \n dimentions", paste(dim(x),
collapse="x"),
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}

if(is.list(x) == TRUE) {
thingy = cat(item,
"\n LIST \n Components:", paste(names(x), collapse=" "),
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}
if(is.data.frame(x) == TRUE) {
thingy = cat(item,
"\n DATAFRAME \n Components:", paste(names(x),
collapse=" "),
"\n component lenght of", length(x[[1]]), "items",
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}
if(is.function(x) == TRUE) {
thingy = cat(item,
"\n FUNCTION \n"
)
}
}
Destroying Efficiency
A matrix of data is really a vector with row and
column attributes added to it. This has profound
speed issues if you add to the size of a matrix
because the data has to be shifted all over the
place.
If you plan on writing your own functions to
manipulate matrices, build an empty matrix of
the maximum size (or guess bigger) rather than
using the functions to add rows or columns.
Writing Efficient Code
R has decent tools for profiling code.
The Rprof and summaryRprof functions will
help you figure out what is bogging down your
code.
Rprof()
MandM(rnorm(1000000))
Rprof(NULL)
summaryRprof()
Debugging in R
See Chapter 9 in Gentleman's book.
The browser() function can be put inside a function to
pause execution and see what is going on.
The codetools package is great for tweaking big
functions:
findLocals(), findGlobals(),
shows you if variables and functions originate inside of a
function
checkUsage(), and checkUsagePackage()
shows you what variables are modified or not touched in a
function
Creating Graphs
Basic plots are easy but tweaking them for
publications can be rough because the
documentation on the function arguments is
appalling.
Data Analysis and Graphics Using R by John
Maindonald and John Braun is extremely useful.
There are myriad graphics built into the core of R
plus more in the packages.
addictedtor.free.fr/graphiques/thumbs.php
Test Scores
scores = read.table("c:\\blah\\walkerScores.txt", header = TRUE)
rapply(scores, class)
scores$CENTER = as.factor(scores$CENTER)
scores$PAT = as.character(scores$PAT)
rapply(scores, class)
scores$isSick = ifelse(scores$SCORE > 0, 1, 0);
library(car)
(scores$SEV = with(scores, recode(SCORE, '0 = "None" ;1:30 =
"Mild"; 31:69 = "Moderate"; 70:100 = "Severe"; else = "BAD
DATA"')))
(scores$SEV = factor(scores$SEV, levels = c("None", "Mild",
"Moderate", "Severe"), ordered = TRUE));
Common Plots are Easy
attach(scores) #to avoid typing scores$
plot(SEV, main = "MainTitle", xlab = "xlab", ylab =
"ylab")
plot(SCORE)
hist (SCORE)
boxplot(SCORE)
boxplot(SCORE ~ SEX, ylim = c(0,100))
detach(scores)
Graphics Tweaks
-3 -2 -1 0 1 2 3
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Density = dnorm
z
P
r
o
b
a
b
i
l
i
t
y

d
e
n
s
i
t
y
-3 -2 -1 0 1 2 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Probability = pnorm
z
P
r
o
b
a
b
i
l
i
t
y
0.0 0.2 0.4 0.6 0.8 1.0
-
2
-
1
0
1
2
Quantiles = qnorm
p
Q
u
a
n
t
i
l
e

(
Z
)
Random numbers = rnorm
z
f
r
e
q
u
e
n
c
y
-4 -2 0 2
0
5
0
1
0
0
1
5
0
2
0
0
mfrow is used to set
number of rows and
columns of graphics on
a page
Strip Charts for Small Datasets
par(cex = 1.5) # big font
with(Gad, stripchart(HAMA ~ DOSEGRP, xlab =
"HAMA", pch = 16))
20 25 30 35
H
I
L
O
P
B
HAMA
3 Languages for the Price of 1
The graphics I have shown use the classic
graphic methods.
There are trellis plots from the lattice package
that split the data into multiple panes
automatically.
ggplot2 uses a "grammar of graphics"
approach (like SPSS).
Dont play with pie!
library(lattice)
trellis.par.set(list(fontsize=list(points=20)))
trellis.par.set(list(fontsize=list(text=25)))
dotplot(table(Gad$DOSEGRP), xlim = c(-1, 21))
Freq
HI
LO
PB
0 5 10 15 20
HI
LO
PB
DOSEGRP
The lattice package makes
trellis graphics (I didnt make
up these names!).
Compression Ratio
N
O
x

(
m
i
c
r
o
g
r
a
m
s
/
J
)
1
2
3
4
8 1216
EE
8 1216
EE
8 1216
EE
8 1216
EE
8 1216
EE
8 1216
EE
8 1216
EE
8 1216
EE
8 1216
EE
Typical lattice plot with
banding to show
subsets
trellis.par.set(list(fontsize=list(points=15)))
trellis.par.set(list(fontsize=list(text=15)))
EE <- equal.count(ethanol$E, number=9,
overlap=1/4)
## Constructing panel functions on the fly; prepanel
xyplot(NOx ~ C | EE, data = ethanol,
prepanel = function(x, y) prepanel.loess(x, y,
span = 1),
xlab = "Compression Ratio", ylab = "NOx
(micrograms/J)",
panel = function(x, y) {
panel.grid(h=-1, v= 2)
panel.xyplot(x, y)
panel.loess(x,y, span=1)
},
aspect = "xy")
Basic plot + geometric
details + adding details
+ adding more details +
yet more details
qplot(carat, price, data = diamonds, geom
= c("point", "smooth"))
Rcmdr has A LOT of great graphics built into
the point and click interface.
library(Rcmdr)
Look up my short course (5 talks) covering
basic statistics to see how to code many
graphics.
www.stanford.edu/~balise/HowToDoBiostatistics.htm
Use Rcmdr (R Commander)
You are Going to Need More Help
Data Manipulation with R by Spector.
A must-have book on how to read and write data with or without SQL,
manipulate data with R, aggregate data, and reshape datasets easily.
R Programming For Bioinformatics by Gentleman.
A very good intermediate level book on how R object-oriented
programming really works.
The R Book or Statistical Computing by Crawley.
These have nicely written intermediate level statistics.
But they are highly redundant across the two books.
Redundant
A couple datasets used for this talk were from Glenn
A. Walkers Common Statistical Methods for Clinical
Research with SAS Examples.
Buy Walkers book if you do clinical research and you use
SAS.
Data Analysis and Graphics Using R by Maindonald and Braun
I constantly use this book to figure out how to do graphics.
Using R for Introductory Statistics by Verzani.
This one has fantastic coverage of the R to do the common statistics.
It also has a nearly useless index at the end of book.

Biostatistics
John Fox, the guy who made Rcmdr, is an
excellent author and he provides an R based
supplement for his superb statitics book.
Spectrum
If you are doing biomedical research and have
questions we are here to help.
Study design
Analysis plan
Power and sample size calculation
(Limited availability help with SAS and R code)

med.stanford.edu/spctrm/biostatistician.html
Special thanks to 5F-X, Assemblage 23

Anda mungkin juga menyukai