Anda di halaman 1dari 50

Building an App - Complete the template by adding arguments to fluidPage() and a body to the server function.

Interactive Web Apps Add inputs to the UI with *Input() functions


library(shiny) Inputs - collect values from the user
with shiny Cheat Sheet Add outputs with *Output() functions
ui <- fluidPage( Access the current value of an input object with input
numericInput(inputId = "n", $<inputId>. Input values are reactive.
learn more at shiny.rstudio.com "Sample size", value = 25),
Tell server how to render outputs with R in plotOutput(outputId = "hist")
the server function. To do this: )
actionButton(inputId, label, icon, …)
1. Refer to outputs with output$<id> server <- function(input, output) {
output$hist <- renderPlot({
2. Refer to inputs with input$<id> hist(rnorm(input$n))
}) actionLink(inputId, label, icon, …)
3. Wrap code in a render*() function before }
Basics saving to output shinyApp(ui = ui, server = server)
checkboxGroupInput(inputId, label,
A Shiny app is a web page (UI) connected to a choices, selected, inline)
computer running a live R session (Server) Save your template as app.R. Alternatively, split your template into two files named ui.R and server.R.
library(shiny) # ui.R
fluidPage(
ui.R contains everything
ui <- fluidPage( you would save to ui.
numericInput(inputId = "n", numericInput(inputId = "n", checkboxInput(inputId, label, value)
"Sample size", value = 25), "Sample size", value = 25),
plotOutput(outputId = "hist") plotOutput(outputId = "hist")
) ) server.R ends with the
function you would save dateInput(inputId, label, value, min,
Users can manipulate the UI, which will cause the server <- function(input, output) { # server.R max, format, startview, weekstart,
server to update the UI’s displays (by running R code). output$hist <- renderPlot({ to server.
hist(rnorm(input$n)) function(input, output) { language)
}) output$hist <- renderPlot({
App template }
})
hist(rnorm(input$n)) No need to call
Begin writing a new app with this template. Preview shinyApp(ui = ui, server = server) } shinyApp().
dateRangeInput(inputId, label, start,
the app by running the code at the R command line. end, min, max, format, startview,
Save each app as a directory that contains an app.R file (or a server.R file and a ui.R file) plus optional extra files. weekstart, language, separator)
library(shiny) app-name The directory name is the name of the app
ui <- fluidPage() $
.r app.R (optional) defines objects available to both
server <- function(input, output){} $ global.R ui.R and server.R Launch apps with fileInput(inputId, label, multiple,
runApp(<path to accept)
shinyApp(ui = ui, server = server) $ DESCRIPTION (optional) used in showcase mode directory>)
$ README (optional) data, scripts, etc.
• ui - nested R functions that assemble an HTML user $ <other files> (optional) directory of files to share with web numericInput(inputId, label, value,
interface for your app # www browsers (images, CSS, .js, etc.) Must be named "www" min, max, step)
• server - a function with instructions on how to
Outputs - render*() and *Output() functions work together to add R output to the UI
build and rebuild the R objects displayed in the UI
• shinyApp - combines ui and server into a
DT::renderDataTable(expr, works dataTableOutput(outputId, icon, …)
passwordInput(inputId, label, value)
functioning app. Wrap with runApp() if calling from with
a sourced script or inside a function. options, callback, escape,
env, quoted)
radioButtons(inputId, label, choices,
Share your app selected, inline)
renderImage(expr, env, quoted, deleteFile) imageOutput(outputId, width, height, click,
dblclick, hover, hoverDelay, hoverDelayType,
The easiest way to share your app brush, clickId, hoverId, inline)
is to host it on shinyapps.io, a plotOutput(outputId, width, height, click, selectInput(inputId, label, choices,
cloud based service from RStudio renderPlot(expr, width, height, res, …, env, selected, multiple, selectize, width,
quoted, func) dblclick, hover, hoverDelay, hoverDelayType, size) (also selectizeInput())
brush, clickId, hoverId, inline)
1. Create a free or professional account at renderPrint(expr, env, quoted, func, verbatimTextOutput(outputId)
http://shinyapps.io width) sliderInput(inputId, label, min, max,
value, step, round, format, locale,
2. Click the Publish icon in the RStudio IDE renderTable(expr,…, env, quoted, func) tableOutput(outputId) ticks, animate, width, sep, pre, post)
(>=0.99) or run:
rsconnect::deployApp("<path to directory>")
renderText(expr, env, quoted, func) textOutput(outputId, container, inline) submitButton(text, icon)
(Prevents reactions across entire app)
Build or purchase your own Shiny Server

! at www.rstudio.com/products/shiny-server/ renderUI(expr, env, quoted, func) uiOutput(outputId, inline, container, …)


" & htmlOutput(outputId, inline, container, …) textInput(inputId, label, value)

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com More cheat sheets at http://www.rstudio.com/resources/cheatsheets/ Learn more at shiny.rstudio.com/tutorial • shiny 0.12.0 • Updated: 01/16
Reactivity UI Layouts
Reactive values work together with reactive functions. Call a reactive value from within the arguments of one of An app’s UI is an HTML document. Use Shiny’s Combine multiple elements into a "single element"
these functions to avoid the error Operation not allowed without an active reactive context. functions to assemble this HTML with R. that has its own properties with a panel function, e.g.
fluidPage(
Returns
Trigger Modularize
textInput("a","")
HTML
wellPanel(
) dateInput("a", ""),
arbitrary code run(this) reactions ## <div class="container-fluid"> submitButton()
observeEvent()
reactive() Prevent reactions ## <div class="form-group shiny-input-container"> )
observe() isolate() ## <label for="a"></label>
## <input id="a" type="text"
## class="form-control" value=""/>
absolutePanel() inputPanel() tabPanel()
## </div> conditionalPanel() mainPanel() tabsetPanel()
## </div> fixedPanel() navlistPanel() titlePanel()
headerPanel() sidebarPanel() wellPanel()
input$x expression() output$y Add static HTML elements with tags, a list of
functions that parallel common HTML tags, e.g.
Organize panels and elements into a layout with a
tags$a(). Unnamed arguments will be passed layout function. Add elements as arguments of the
into the tag; named arguments will become tag layout functions.
Create your own Update Render fluidRow()
attributes.
reactive values reactive output
tags$a tags$data tags$h6 tags$nav tags$span ui <- fluidPage(
reactiveValues()
Delay reactions render*() tags$abbr tags$datalist tags$head tags$noscript tags$strong column row col fluidRow(column(width = 4),
*Input() eventReactive() tags$address tags$dd tags$header tags$object tags$style column(width = 2, offset = 3)),
tags$area tags$del tags$hgroup tags$ol tags$sub fluidRow(column(width = 12))
tags$article tags$details tags$hr tags$optgroup tags$summary column
tags$aside tags$dfn tags$HTML tags$option tags$sup )
tags$audio tags$div tags$i tags$output tags$table
Create your own reactive values Render reactive output tags$b tags$dl tags$iframe tags$p tags$tbody flowLayout()
tags$base tags$dt tags$img tags$param tags$td ui <- fluidPage(
# example snippets *Input() functions library(shiny) render*() functions tags$bdi
tags$bdo
tags$em
tags$embed
tags$input
tags$ins
tags$pre
tags$progress
tags$textarea
tags$tfoot
object object object
3
flowLayout( # object 1,
1 2 # object 2,
(see front page) ui <- fluidPage( (see front page) tags$blockquote tags$eventsource tags$kbd tags$q tags$th
ui <- fluidPage( textInput("a","","A"), tags$body tags$fieldset tags$keygen tags$ruby tags$thead object # object 3
textInput("a","","A") reactiveValues(…) textOutput("b") tags$br tags$figcaption tags$label tags$rp tags$time 3 )
) Builds an object to
) tags$button tags$figure tags$legend tags$rt tags$title )
Each input function server <- display. Will rerun code in tags$canvas tags$footer tags$li tags$s tags$tr
sidebarLayout()
server <- function(input,output){ body to rebuild the object tags$caption tags$form tags$link tags$samp tags$track ui <- fluidPage(
creates a reactive value output$b <- tags$cite tags$h1 tags$mark tags$script tags$u sidebarLayout(
function(input,output){
stored as input$<inputId> renderText({ whenever a reactive value tags$code tags$h2 tags$map tags$section tags$ul
rv <- reactiveValues() sidebarPanel(),
rv$number <- 5 input$a
})
in the code changes. tags$col tags$h3 tags$menu tags$select tags$var side main mainPanel()
} reactiveValues() creates a tags$colgroup tags$h4 tags$meta tags$small tags$video panel
list of reactive values
}
Save the results to tags$command tags$h5 tags$meter tags$source tags$wbr panel )
shinyApp(ui, server) output$<outputId> )
whose values you can set. The most common tags have wrapper functions. You
do not need to prefix their names with tags$ splitLayout()
Prevent reactions Trigger arbitrary code ui <- fluidPage( ui <- fluidPage(
h1("Header 1"), object object splitLayout( # object 1,
library(shiny) isolate(expr) library(shiny) observeEvent(eventExpr hr(), # object 2
ui <- fluidPage(
Runs a code block.
ui <- fluidPage(
textInput("a","","A"),
, handlerExpr, event.env, br(), 1 2 )
textInput("a","","A"), event.quoted, handler.env, p(strong("bold")), )
textOutput("b") Returns a non-reactive actionButton("go","Go")
handler.quoted, labe, p(em("italic")),
) )
copy of the results. suspended, priority, domain, p(code("code")),
server <-
verticalLayout() ui <- fluidPage(
server <- autoDestroy, ignoreNULL) a(href="", "link"), verticalLayout( # object 1,
function(input,output){ function(input,output){ HTML("<p>Raw html</p>")
observeEvent(input$go,{ object 1 # object 2,
output$b <-
)
renderText({ print(input$a) Runs code in 2nd object 2
# object 3
isolate({input$a}) })
} argument when reactive )
})
values in 1st argument object 3 )
} To include a CSS file, use includeCSS(), or
change. See observe() for
shinyApp(ui, server) shinyApp(ui, server) 1. Place the file in the www subdirectory
alternative.
2. Link to it with Layer tabPanels on top of each other,
Modularize reactions Delay reactions tags$head(tags$link(rel = "stylesheet", and navigate between them, with:
type = "text/css", href = "<file name>"))
library(shiny)
reactive(x, env, quoted, library(shiny) eventReactive(eventExpr, ui <- fluidPage( tabsetPanel(
ui <- fluidPage( ui <- fluidPage( valueExpr, event.env, tabPanel("tab 1", "contents"),
textInput("a","","A"), label, domain) textInput("a","","A"), tabPanel("tab 2", "contents"),
textInput("z","","Z"), Creates a reactive expression actionButton("go","Go"), event.quoted, value.env, To include JavaScript, use includeScript() or
textOutput("b") textOutput("b") tabPanel("tab 3", "contents")))
) that ) value.quoted, label, 1. Place the file in the www subdirectory
server <- • caches its value to reduce server <-
domain, ignoreNULL) 2. Link to it with ui <- fluidPage( navlistPanel(
function(input,output){ computation function(input,output){ tabPanel("tab 1", "contents"),
re <- reactive({ re <- eventReactive( Creates reactive tags$head(tags$script(src = "<file name>")) tabPanel("tab 2", "contents"),
paste(input$a,input • can be called by other code input$go,{input$a}) expression with code in
$z)}) output$b <- renderText({ tabPanel("tab 3", "contents")))
output$b <- renderText({ • notifies its dependencies re() 2nd argument that only
re() when it ha been invalidated }) invalidates when reactive IMAGES To include an image ui <- navbarPage(title = "Page",
}) }
} Call the expression with values in 1st argument 1. Place the file in the www subdirectory tabPanel("tab 1", "contents"),
shinyApp(ui, server) shinyApp(ui, server) tabPanel("tab 2", "contents"),
function syntax, e.g. re() change. 2. Link to it with img(src="<file name>") tabPanel("tab 3", "contents"))

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com More cheat sheets at http://www.rstudio.com/resources/cheatsheets/ Learn more at shiny.rstudio.com/tutorial • shiny 0.12.0 • Updated: 01/16
RStudio IDE : : CHEAT SHEET
Documents and Apps Write Code R Support Pro Features
Open Shiny, R Markdown, Navigate Open in new Save Find and Compile as Run Import data History of past Display .RPres slideshows Share Project Active shared
knitr, Sweave, LaTeX, .Rd files tabs window replace notebook selected with wizard commands to File > New File > with Collaborators collaborators
and more in Source Pane code run/copy R Presentation Start new R Session
T H J in current project
Check Render Choose Choose Insert T H J
Close R
spelling output output output code Session in
format location chunk project
Select
Cursors of Re-run Source with or Show file Load Save Delete all Search inside R Version
shared users previous code without Echo outline workspace workspace saved objects environment
Jump to Jump Run Publish Show file PROJECT SYSTEM
previous to next selected to server outline Multiple cursors/column selection Choose environment to display from Display objects
chunk chunk lines list of parent environments as list or grid File > New Project
with Alt + mouse drag.
Code diagnostics that appear in the margin. RStudio saves the call history,
Access markdown guide at
Help > Markdown Quick Reference Hover over diagnostic symbols for details. workspace, and working
directory associated with a
Syntax highlighting based Name of
Jump to Set knitr Run this and Run this project. It reloads each when
on your file's extension current project
chunk chunk all previous code chunk you re-open a project.
options code chunks Tab completion to finish
function names, file paths, Displays saved objects by View in data View function RStudio opens plots in a dedicated Plots pane
arguments, and more. type with short description viewer source code

Multi-language code
snippets to quickly use Navigate Open in Export Delete Delete
common blocks of code. recent plots window plot plot all plots
RStudio recognizes that files named app.R,
server.R, ui.R, and global.R belong to a shiny app Jump to function in file Change file type GUI Package manager lists every installed package
Create Upload Delete Rename Change
folder file file file directory
Install Update Create reproducible package
Run Choose Publish to Manage Path to displayed directory Packages Packages library for your project
app location to shinyapps.io publish Working Maximize,
view app or server accounts Directory minimize panes
Press ! to see Drag pane A File browser keyed to your working directory.
command history boundaries Click on file or directory name to open. Click to load package with Package Delete
library(). Unclick to detach version from
package with detach() installed library
Debug Mode Version Control with Git or SVN RStudio opens documentation in a dedicated Help pane
Launch debugger Open traceback to examine Turn on at Tools > Project Options > Git/SVN
Open with debug(), browse(), or a breakpoint. RStudio will open the mode from origin the functions that R called Push/Pull View
Stage Show file Commit
debugger mode when it encounters a breakpoint while executing code. of error before the error occurred diff staged files to remote History
files:
Home page of Search within Search for
Click next to A
• Added helpful links help file help file
line number to D• Deleted
add/remove a M• Modified Viewer Pane displays HTML content, such as Shiny apps,
R• Renamed Open shell to current
breakpoint. RMarkdown reports, and interactive visualizations
?• Untracked type commands branch

Highlighted
line shows Stop Shiny Publish to shinyapps.io, Refresh
where
execution has
Package Writing app rpubs, RSConnect, …

paused File > New Project > View(<data>) opens spreadsheet like view of data set
New Directory > R Package
Run commands in Examine variables Select function Step through Step into and Resume Quit debug Turn project into package,
environment where in executing in traceback to code one line out of functions execution mode Enable roxygen documentation with
execution has paused environment debug at a time to run Tools > Project Options > Build Tools
Roxygen guide at Filter rows by value Sort by Search
Help > Roxygen Quick Reference or value range values for value

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at www.rstudio.com • RStudio IDE 0.99.832 • Updated: 2016-01
1 LAYOUT Windows/Linux Mac 4 WRITE CODE Windows /Linux Mac WHY RSTUDIO SERVER PRO?
Move focus to Source Editor Ctrl+1 Ctrl+1 Attempt completion Tab or Ctrl+Space Tab or Cmd+Space RSP extends the the open source server with a
Move focus to Console Ctrl+2 Ctrl+2 Navigate candidates !/" !/" commercial license, support, and more:
Move focus to Help Ctrl+3 Ctrl+3 Accept candidate Enter, Tab, or $ Enter, Tab, or $
• open and run multiple R sessions at once
Dismiss candidates Esc Esc
Show History
Show Files
Ctrl+4
Ctrl+5
Ctrl+4
Ctrl+5 Undo Ctrl+Z Cmd+Z
• tune your resources to improve performance
• edit the same project at the same time as others
dplyr
Show Plots Ctrl+6 Ctrl+6 Redo Ctrl+Shit+Z Cmd+Shit+Z
Show Packages Ctrl+7 Ctrl+7 Cut Ctrl+X Cmd+X • see what you and others are doing on your server
Show Environment Ctrl+8 Ctrl+8 Copy Ctrl+C Cmd+C • switch easily from one version of R to a different version
Show Git/SVN Ctrl+9 Ctrl+9 Paste Ctrl+V Cmd+V • integrate with your authentication, authorization, and audit practices
Show Build Ctrl+0 Ctrl+0 Select All Ctrl+A Cmd+A Download a free 45 day evaluation at
Delete Line Ctrl+D Cmd+D www.rstudio.com/products/rstudio-server-pro/
2 RUN CODE Windows/Linux Mac Select Shit+[Arrow] Shit+[Arrow]
Select Word Ctrl+Shit+ #/$ Option+Shit+ #/$ 5 DEBUG CODE Windows/Linux Mac
Search command history Ctrl+! Cmd+!
Navigate command history !/" !/" Select to Line Start Alt+Shit+# Cmd+Shit+# Toggle Breakpoint Shit+F9 Shit+F9
Move cursor to start of line Home Cmd+# Select to Line End Alt+Shit+$ Cmd+Shit+$ Execute Next Line F10 F10
Move cursor to end of line End Cmd+ $ Select Page Up/Down Shit+PageUp/Down Shit+PageUp/Down Step Into Function Shit+F4 Shit+F4
Change working directory Ctrl+Shit+H Ctrl+Shit+H Select to Start/End Shit+Alt+!/" Cmd+Shit+!/" Finish Function/Loop Shit+F6 Shit+F6
Interrupt current command Esc Esc Delete Word Let Ctrl+Backspace Ctrl+Opt+Backspace Continue Shit+F5 Shit+F5
Clear console Ctrl+L Ctrl+L Delete Word Right   Option+Delete Stop Debugging Shit+F8 Shit+F8
Quit Session (desktop only) Ctrl+Q Cmd+Q Delete to Line End   Ctrl+K
Restart R Session Ctrl+Shit+F10 Cmd+Shit+F10 Delete to Line Start   Option+Backspace 6 VERSION CONTROL Windows/Linux Mac
Run current line/selection Ctrl+Enter Cmd+Enter Indent Tab (at start of line) Tab (at start of line) Show diff Ctrl+Alt+D Ctrl+Option+D
Run current (retain cursor) Alt+Enter Option+Enter Outdent Shit+Tab Shit+Tab Commit changes Ctrl+Alt+M Ctrl+Option+M
Run from current to end Ctrl+Alt+E Cmd+Option+E Yank line up to cursor Ctrl+U Ctrl+U Scroll diff view Ctrl+!/" Ctrl+!/"
Run the current function Ctrl+Alt+F Cmd+Option+F Yank line ater cursor Ctrl+K Ctrl+K Stage/Unstage (Git) Spacebar Spacebar
Source a file Ctrl+Shit+O Cmd+Shit+O Insert yanked text Ctrl+Y Ctrl+Y Stage/Unstage and move to next Enter Enter
Source the current file Ctrl+Shit+S Cmd+Shit+S Insert <- Alt+- Option+-
Source with echo Ctrl+Shit+Enter Cmd+Shit+Enter Insert %>% Ctrl+Shit+M Cmd+Shit+M Windows/Linux
7 MAKE PACKAGES Mac
Show help for function F1 F1
Build and Reload Ctrl+Shit+B Cmd+Shit+B
Show source code F2 F2
3 NAVIGATE CODE Windows /Linux Mac Load All (devtools) Ctrl+Shit+L Cmd+Shit+L
New document Ctrl+Shit+N Cmd+Shit+N
Goto File/Function Ctrl+. Ctrl+. Test Package (Desktop) Ctrl+Shit+T Cmd+Shit+T
New document (Chrome) Ctrl+Alt+Shit+N Cmd+Shit+Alt+N
Fold Selected Alt+L Cmd+Option+L Test Package (Web) Ctrl+Alt+F7 Cmd+Alt+F7
Open document Ctrl+O Cmd+O
Unfold Selected Shit+Alt+L Cmd+Shit+Option+L Save document Ctrl+S Cmd+S Check Package Ctrl+Shit+E Cmd+Shit+E
Fold All Alt+O Cmd+Option+O Close document Ctrl+W Cmd+W Document Package Ctrl+Shit+D Cmd+Shit+D
Unfold All Shit+Alt+O Cmd+Shit+Option+O Close document (Chrome) Ctrl+Alt+W Cmd+Option+W
Go to line Shit+Alt+G Cmd+Shit+Option+G Close all documents Ctrl+Shit+W Cmd+Shit+W 8 DOCUMENTS AND APPS Windows/Linux Mac
Jump to Shit+Alt+J Cmd+Shit+Option+J Extract function Ctrl+Alt+X Cmd+Option+X Preview HTML (Markdown, etc.) Ctrl+Shit+K Cmd+Shit+K
Switch to tab Ctrl+Shit+. Ctrl+Shit+. Extract variable Ctrl+Alt+V Cmd+Option+V Knit Document (knitr) Ctrl+Shit+K Cmd+Shit+K
Previous tab Ctrl+F11 Ctrl+F11 Reindent lines Ctrl+I Cmd+I Compile Notebook Ctrl+Shit+K Cmd+Shit+K
Next tab Ctrl+F12 Ctrl+F12 (Un)Comment lines Ctrl+Shit+C Cmd+Shit+C Compile PDF (TeX and Sweave) Ctrl+Shit+K Cmd+Shit+K
First tab Ctrl+Shit+F11 Ctrl+Shit+F11 Reflow Comment Ctrl+Shit+/ Cmd+Shit+/ Insert chunk (Sweave and Knitr) Ctrl+Alt+I Cmd+Option+I
Last tab Ctrl+Shit+F12 Ctrl+Shit+F12 Reformat Selection Ctrl+Shit+A Cmd+Shit+A Insert code section Ctrl+Shit+R Cmd+Shit+R
Navigate back Ctrl+F9 Cmd+F9 Select within braces Ctrl+Shit+E Ctrl+Shit+E Re-run previous region Ctrl+Shit+P Cmd+Shit+P
Navigate forward Ctrl+F10 Cmd+F10 Show Diagnostics Ctrl+Shit+Alt+P Cmd+Shit+Alt+P Run current document Ctrl+Alt+R Cmd+Option+R
Jump to Brace Ctrl+P Ctrl+P Transpose Letters   Ctrl+T
Run from start to current line Ctrl+Alt+B Cmd+Option+B
Select within Braces Ctrl+Shit+Alt+E Ctrl+Shit+Alt+E Move Lines Up/Down Alt+!/" Option+!/"
Use Selection for Find Ctrl+F3 Cmd+E Run the current code section Ctrl+Alt+T Cmd+Option+T
Copy Lines Up/Down Shit+Alt+!/" Cmd+Option+!/"
Find in Files Ctrl+Shit+F Cmd+Shit+F Add New Cursor Above Ctrl+Alt+Up Ctrl+Alt+Up
Run previous Sweave/Rmd code Ctrl+Alt+P Cmd+Option+P
Find Next Win: F3, Linux: Ctrl+G Cmd+G Add New Cursor Below Ctrl+Alt+Down Ctrl+Alt+Down Run the current chunk Ctrl+Alt+C Cmd+Option+C
Find Previous W: Shit+F3, L: Cmd+Shit+G Move Active Cursor Up Ctrl+Alt+Shit+Up Ctrl+Alt+Shit+Up Run the next chunk Ctrl+Alt+N Cmd+Option+N
Jump to Word Ctrl+ #/$ Option+ #/$ Move Active Cursor Down Ctrl+Alt+Shit+Down Ctrl+Alt+Shit+Down Sync Editor & PDF Preview Ctrl+F8 Cmd+F8
Jump to Start/End Ctrl+!/" Cmd+!/" Find and Replace Ctrl+F Cmd+F Previous plot Ctrl+Alt+F11 Cmd+Option+F11
Use Selection for Find Ctrl+F3 Cmd+E Next plot Ctrl+Alt+F12 Cmd+Option+F12
Replace and Find Ctrl+Shit+J Cmd+Shit+J Show Keyboard Shortcuts Alt+Shit+K Option+Shit+K
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at www.rstudio.com • RStudio IDE 0.99.832 • Updated: 2016-01
.Rmd files Reproducible Research Dynamic Documents
R Markdown Cheat Sheet
learn more at rmarkdown.rstudio.com
An R Markdown (.Rmd) file is a record of
your research. It contains the code that a
Rmd
At the click of a button, or the type
of a command, you can rerun the Rmd
You can choose to export the
finished report as a html, pdf,
scientist needs to reproduce your work code in an R Markdown file to MS Word, ODT, RTF, or
along with the narration that a reader reproduce your work and export markdown document; or as a
needs to understand your work. the results as a finished report. html or pdf based slide show.

RStudio
Workflow
Pro Features Interactive
DebugDocuments
Mode
Write document Knit document to create report Preview Output
1 Open a new .Rmd file at File ▶ New File ▶ R Markdown.
Use the wizard that opens to pre-populate the file with a 2 by editing template 3 Use knit button or render() to knit 4 in IDE window 5 Publish (optional)
to web or server
Turn your report into an interactive
Shiny document in 4 steps
template Synch publish button
Open in Save Spell Find and Publish Show
window Check replace outline
to accounts at
• rpubs.com,
• shinyapps.io
1 Add runtime: shiny to the
YAML header.
• RStudio Connect
Reload document
Find in document
2 Call Shiny input functions to embed
input objects.

File path to
output document 3 Call Shiny render functions to embed
reactive output.

.Rmd structure
YAML Header
Optional section of
Set Insert
preview code
location chunk
Go to
code
chunk
Run
code
chunk(s) 6 Examine build log
in R Markdown console 4 Render with rmarkdown::run or click
Run Document in RStudio IDE

render (e.g. pandoc)


options written as
key:value pairs (YAML).
7 Use output file that is
saved alongside .Rmd ---
output: html_document
runtime: shiny
• At start of file ---
• Between lines of - - - Modify Run all Run render()
chunk previous current
Text options chunks chunk Use rmarkdown::render() ```{r, echo = FALSE}
Narration formatted with to render/knit at cmd line. numericInput("n",
markdown, mixed with: "How many cars?", 5)
Important args:
Code chunks input - file to render renderTable({
Chunks of embedded head(cars, input$n)
output_format
code. Each chunk: })
output_options - List of ```
• Begins with ```{r} render options (as in YAML)
• ends with ``` output_file
R Markdown will run the output_dir Embed a complete app into your document with
code and append the shiny::shinyAppDir()
results to the doc. params - list of params to use
envir - environment to * Your report will rendered as a Shiny app, which means you
It will use the location of evaluate code chunks in
the .Rmd file as the must choose an html output format, like html_document,
working directory encoding - of input file and serve it with an active R Session.

Embed code
Debug
with
Mode
knitr syntax Debug
Parameters
Mode
Inline code Code chunks Global options Parameterize your documents to reuse with
Insert with `r <code>`. Results appear as text without code. One or more lines surrounded Set with knitr::opts_chunk$set(), e.g. different inputs (e.g., data sets, values, etc.)
with ```{r} and ```. Place ```{r echo=TRUE}
Built with
`r getRversion()`

Important chunk options


chunk options within curly
braces, ater r. Insert with
getRversion()
```
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
1 Add parameters
Create and set
parameters in the
---
params:
n: 100
Indent 2
spaces

header as sub-values d: !r Sys.Date()


cache - cache results for future knits dependson - chunk dependencies for fig.align - 'let', 'right', or 'center' (default message - display code messages in of params ---
(default = FALSE) caching (default = NULL) = 'default') document (default = TRUE)
cache.path - directory to save cached
results in (default = "cache/")
echo - Display code in output document
(default = TRUE)
fig.cap - figure caption as character
string (default = NULL)
results (default = 'markup')
'asis' - passthrough results
2 Call parameters
Call parameter
values in code as
Today’s date
is `r params$d`
child - file(s) to knit and then include engine - code language used in chunk fig.height, fig.width - Dimensions of 'hide' - do not display results params$<name>
(default = NULL) (default = 'R') plots in inches 'hold' - put all results below all code
collapse - collapse all output into single
block (default = FALSE)
error - Display error messages in doc
(TRUE) or stop render when errors occur
(FALSE) (default = FALSE)
highlight - highlight source code
(default = TRUE)
tidy - tidy code for display (default =
FALSE) 3 Set parameters
Set values wth Knit
with parameters or
comment - prefix for each line of results include - Include chunk in doc ater warning - display code warnings in the params argument
eval - Run code in chunk (default = document (default = TRUE)
(default = '##') running (default = TRUE) of render():
TRUE)
Options not listed above: R.options, aniopts, autodep, background, cache.comments, cache.lazy, cache.rebuild, cache.vars, dev, dev.args, dpi, engine.opts, engine.path, fig.asp, fig.env, fig.ext, fig.keep, render("doc.Rmd",
fig.lp, fig.path, fig.pos, fig.process, fig.retina, fig.scap, fig.show, fig.showtext, fig.subcap, interval, out.extra, out.height, out.width, prompt, purl, ref.label, render, size, split, tidy.opts params = list(n = 1, d = as.Date("2015-01-01"))
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com More cheat sheets at http://www.rstudio.com/resources/cheatsheets/ Learn more at rmarkdown.rstudio.com • RStudio IDE 0.99.879 • Updated: 02/16
Pandoc’s
DebugMarkdown
Mode Set render
Debug
options
Modewith YAML
When you render, R Markdown

beamer
ioslides
Write with syntax on the let to create effect on right (ater render)

gituhb
word

slidy
html
1. runs the R code, embeds results and text into .md file with knitr

md
odt
pdf

rtf
Plain text 2. then converts the .md file into the finished format with pandoc sub-option description
End a line with two spaces
to start a new paragraph. citation_package The LaTeX package to process citations, natbib, biblatex or none X X X
*italics* and **bold** code_folding Let readers to toggle the display of R code, "none", "hide", or "show" X
`verbatim code`
colortheme Beamer color theme to use X
sub/superscript^2^~2~ Set a document’s ---
~~strikethrough~~ default output format output: html_document css CSS file to use to style document X X X
in the YAML header: --- dev Graphics device to use for figure output (e.g. "png") X X X X X X X
escaped: \* \_ \\
endash: --, emdash: --- # Body duration Add a countdown timer (in minutes) to footer of slides X
equation: $A = \pi*r^{2}$ output value creates fig_caption Should figures be rendered with captions? X X X X X X X
equation block: html_document html fig_height, fig_width Default figure height and width (in inches) for document X X X X X X X X X X
$$E = mc^{2}$$ pdf_document pdf (requires Tex ) highlight Syntax highlighting: "tango", "pygments", "kate","zenburn", "textmate" X X X X X
> block quote word_document Microsot Word (.docx) includes File of content to place in document (in_header, before_body, after_body) X X X X X X X X
# Header1 {#anchor} odt_document OpenDocument Text incremental Should bullets appear one at a time (on presenter mouse clicks)? X X X
rtf_document Rich Text Format keep_md Save a copy of .md file that contains knitr output X X X X X X
## Header 2 {#css_id}
md_document Markdown
### Header 3 {.css_class} keep_tex Save a copy of .tex file that contains knitr output X X
github_document Github compatible markdown
latex_engine Engine to render latex, "pdflatex", "xelatex", or "lualatex" X X
#### Header 4 ioslides_presentation ioslides HTML slides
lib_dir Directory of dependency files to use (Bootstrap, MathJax, etc.) X X X
##### Header 5 slidy_presentation slidy HTML slides
mathjax Set to local or a URL to use a local/URL version of MathJax to render X X X
###### Header 6 beamer_presentation Beamer pdf slides (requires Tex)
md_extensions Markdown extensions to add to default definition or R Markdown X X X X X X X X X X
<!--Text comment--> --- Indent 2 Indent 4
Customize output number_sections Add section numbering to headers X X
\textbf{Tex ignored in HTML} with sub-options output: spaces spaces
html_document: pandoc_args Additional arguments to pass to Pandoc X X X X X X X X X X
<em>HTML ignored in pdfs</em> (listed at right):
code_folding: hide preserve_yaml Preserve YAML front matter in final document? X
<http://www.rstudio.com> toc_float: TRUE
reference_docx docx file whose styles should be copied when producing docx output X
[link](www.rstudio.com) html tabsets ---
self_contained Embed dependencies into the doc X X X
Jump to [Header 1](#anchor) Use .tabset css class # Body
image: to place sub-headers slide_level The lowest heading level that defines individual slides X
into tabs smaller Use the smaller font size in the presentation? X
![Caption](smallorb.png)
smart Convert straight quotes to curly, dashes to em-dashes, … to ellipses, etc. X X X
# Tabset {.tabset .tabset-fade .tabset-pills}
* unordered list template Pandoc template to use when rendering file X X X X X
+ sub-item 1 ## Tab 1
+ sub-item 2 text 1 theme Bootswatch or Beamer theme to use for page X X
- sub-sub-item 1 ## Tab 2 toc Add a table of contents at start of document X X X X X X X
text 2 toc_depth The lowest level of headings to add to table of contents X X X X X X
* item 2 ### End tabset
toc_float Float the table of contents to the let of the main content X
Continued (indent 4 spaces)
Options not listed: extra_dependencies, fig_crop, fig_retina, font_adjustment, font_theme, footer, logo, html_preview,
1. ordered list reference_odt, transition, variant, widescreen
2. item 2
i) sub-item 1
A. sub-sub-item 1 Create aDebug
Reusable
Modetemplate Table
Debug
suggestions
Mode Citations
Debug
and Bibliographies
Mode
Several functions format R data into tables Create citations with .bib, .bibtex, .copac, .enl, .json,
(@) A list whose numbering
continues after 1 Create a new package with a
inst/rmarkdown/templates directory
.medline, .mods, .ris, .wos, and .xml files

(@)
Term 1
an interruption
2 In the directory, Place a folder that contains:
• template.yaml (see below)
1 Set bibliography file and CSL 1.0 Style
file (optional) in the YAML header
---
• skeleton.Rmd (contents of the template) bibliography: refs.bib
: Definition 1
csl: style.csl
| Right | Left | Default | Center | • any supporting files data <- faithful[1:4, ] ---

3
|------:|:-----|---------|:------:|
|

|
12 | 12 |
| 123 | 123 |
1 | 1 |
12
123
1
|
|
|
12 |
123 |
1 |
Install the package ```{r results = 'asis'}
knitr::kable(data, caption = "Table with kable") 2 Use citation keys in text
Smith cited [@smith04].
- slide bullet 1
- slide bullet 2 4 Access
template in
wizard at File ▶
```
```{r results = "asis"}
print(xtable::xtable(data, caption = "Table with xtable"),
Smith cited without author [-@smith04].
@smith04 cited in line.

(>- to have bullets appear on click)

horizontal rule/slide break:


***
New File ▶
R Markdown ```
type = "html", html.table.attributes = "border=0"))
3 Render. Bibliography will be added to end
of document
template.yaml ```{r results = "asis"} Learn more in
A footnote [^1] --- stargazer::stargazer(data, type = "html", the stargazer,
name: My Template title = "Table with stargazer") xtable, and
[^1]: Here is the footnote. knitr packages.
--- ```
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com More cheat sheets at http://www.rstudio.com/resources/cheatsheets/ Learn more at rmarkdown.rstudio.com • RStudio IDE 0.99.879 • Updated: 02/16
Probability Cheatsheet v2.0 Thinking Conditionally Law of Total Probability (LOTP)
Let B1 , B2 , B3 , ...Bn be a partition of the sample space (i.e., they are
Compiled by William Chen (http://wzchen.com) and Joe Blitzstein, Independence disjoint and their union is the entire sample space).
with contributions from Sebastian Chiu, Yuan Jiang, Yuqi Hou, and
Independent Events A and B are independent if knowing whether P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + · · · + P (A|Bn )P (Bn )
Jessy Hwang. Material based on Joe Blitzstein’s (@stat110) lectures
A occurred gives no information about whether B occurred. More P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) + · · · + P (A ∩ Bn )
(http://stat110.net) and Blitzstein/Hwang’s Introduction to
formally, A and B (which have nonzero probability) are independent if
Probability textbook (http://bit.ly/introprobability). Licensed
and only if one of the following equivalent statements holds: For LOTP with extra conditioning, just add in another event C!
under CC BY-NC-SA 4.0. Please share comments, suggestions, and errors
at http://github.com/wzchen/probability_cheatsheet. P (A ∩ B) = P (A)P (B) P (A|C) = P (A|B1 , C)P (B1 |C) + · · · + P (A|Bn , C)P (Bn |C)
P (A|B) = P (A) P (A|C) = P (A ∩ B1 |C) + P (A ∩ B2 |C) + · · · + P (A ∩ Bn |C)
Last Updated September 4, 2015 P (B|A) = P (B)
Special case of LOTP with B and B c as partition:
Conditional Independence A and B are conditionally independent P (A) = P (A|B)P (B) + P (A|B )P (B )
c c
given C if P (A ∩ B|C) = P (A|C)P (B|C). Conditional independence
Counting does not imply independence, and independence does not imply P (A) = P (A ∩ B) + P (A ∩ B )
c

conditional independence.

Multiplication Rule Unions, Intersections, and Complements Bayes’ Rule


De Morgan’s Laws A useful identity that can make calculating
Bayes’ Rule, and with extra conditioning (just add in C!)
C
cak
e probabilities of unions easier by relating them to intersections, and
V vice versa. Analogous results hold with more than two sets. P (B|A)P (A)
waffle
S C
c c c P (A|B) =
ke (A ∪ B) = A ∩ B P (B)
ca
V cake
c c c
waffle
(A ∩ B) = A ∪ B P (B|A, C)P (A|C)
wa
ffle C S P (A|B, C) =
P (B|C)
V
S
cake
Joint, Marginal, and Conditional We can also write
waffl
e Joint Probability P (A ∩ B) or P (A, B) – Probability of A and B.
P (A, B, C) P (B, C|A)P (A)
Let’s say we have a compound experiment (an experiment with Marginal (Unconditional) Probability P (A) – Probability of A. P (A|B, C) = =
P (B, C) P (B, C)
multiple components). If the 1st component has n1 possible outcomes, Conditional Probability P (A|B) = P (A, B)/P (B) – Probability of
the 2nd component has n2 possible outcomes, . . . , and the rth A, given that B occurred. Odds Form of Bayes’ Rule
component has nr possible outcomes, then overall there are
Conditional Probability is Probability P (A|B) is a probability P (A|B) P (B|A) P (A)
n1 n2 . . . nr possibilities for the whole experiment. =
function for any fixed B. Any theorem that holds for probability also P (Ac |B) P (B|Ac ) P (Ac )
holds for conditional probability.
Sampling Table The posterior odds of A are the likelihood ratio times the prior odds.
Probability of an Intersection or Union
Intersections via Conditioning
Random Variables and their Distributions
P (A, B) = P (A)P (B|A)
P (A, B, C) = P (A)P (B|A)P (C|A, B)
PMF, CDF, and Independence
Probability Mass Function (PMF) Gives the probability that a
2
Unions via Inclusion-Exclusion discrete random variable takes on the value x.
8
5
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
7 9 pX (x) = P (X = x)
1 4 P (A ∪ B ∪ C) = P (A) + P (B) + P (C)
3 6
− P (A ∩ B) − P (A ∩ C) − P (B ∩ C)
The sampling table gives the number of possible samples of size k out + P (A ∩ B ∩ C).

1.0
of a population of size n, under various assumptions about how the
sample is collected. Simpson’s Paradox

0.8
0.6
Order Matters Not Matter heart

pmf
 n + k − 1
k
With Replacement n

0.4

k
n!  n ● ●
Without Replacement

0.2
(n − k)! k
band-aid ● ●

0.0
Naive Definition of Probability 0 1 2 3 4

Dr. Hibbert Dr. Nick x


If all outcomes are equally likely, the probability of an event A
happening is: It is possible to have
The PMF satisfies
c c c c
P (A | B, C) < P (A | B , C) and P (A | B, C ) < P (A | B , C ) X
number of outcomes favorable to A pX (x) ≥ 0 and pX (x) = 1
Pnaive (A) = c
number of outcomes yet also P (A | B) > P (A | B ). x
Cumulative Distribution Function (CDF) Gives the probability Indicator Random Variables LOTUS
that a random variable is less than or equal to x.
Indicator Random Variable is a random variable that takes on the Expected value of a function of an r.v. The expected value of X
FX (x) = P (X ≤ x) value 1 or 0. It is always an indicator of some event: if the event is defined this way:
occurs, the indicator is 1; otherwise it is 0. They are useful for many
problems about counting how many events of some kind occur. Write X
E(X) = xP (X = x) (for discrete X)
x
(
1 if A occurs,

1.0

IA =
● ●
0 if A does not occur. Z ∞
E(X) = xf (x)dx (for continuous X)
0.8
2 −∞
● ● Note that IA = IA , IA IB = IA∩B , and IA∪B = IA + IB − IA IB .
The Law of the Unconscious Statistician (LOTUS) states that
0.6

Distribution IA ∼ Bern(p) where p = P (A). you can find the expected value of a function of a random variable,
cdf

Fundamental Bridge The expectation of the indicator for event A is g(X), in a similar way, by replacing the x in front of the PMF/PDF by
0.4

● ● the probability of event A: E(IA ) = P (A). g(x) but still working with the PMF/PDF of X:
0.2

X
Variance and Standard Deviation E(g(X)) = g(x)P (X = x) (for discrete X)
● ●
0.0

● x
2 2 2
Var(X) = E (X − E(X)) = E(X ) − (E(X))
0 1 2 3 4 Z ∞
E(g(X)) = g(x)f (x)dx (for continuous X)
q
x SD(X) = Var(X) −∞

The CDF is an increasing, right-continuous function with What’s a function of a random variable? A function of a random
Continuous RVs, LOTUS, UoU variable is also a random variable. For example, if X is the number of
FX (x) → 0 as x → −∞ and FX (x) → 1 as x → ∞
bikes you see in an hour, then g(X) = 2X is the number of bike wheels
X(X−1)
you see in that hour and h(X) = X

Independence Intuitively, two random variables are independent if Continuous Random Variables (CRVs) 2 = 2 is the number of
knowing the value of one gives no information about the other. pairs of bikes such that you see both of those bikes in that hour.
Discrete r.v.s X and Y are independent if for all values of x and y What’s the probability that a CRV is in an interval? Take the
difference in CDF values (or use the PDF as described later). What’s the point? You don’t need to know the PMF/PDF of g(X)
P (X = x, Y = y) = P (X = x)P (Y = y) to find its expected value. All you need is the PMF/PDF of X.
P (a ≤ X ≤ b) = P (X ≤ b) − P (X ≤ a) = FX (b) − FX (a)
Expected Value and Indicators
For X ∼ N (µ, σ 2 ), this becomes Universality of Uniform (UoU)
Expected Value and Linearity 
b−µ
 
a−µ
 When you plug any CRV into its own CDF, you get a Uniform(0,1)
P (a ≤ X ≤ b) = Φ −Φ random variable. When you plug a Uniform(0,1) r.v. into an inverse
Expected Value (a.k.a. mean, expectation, or average) is a weighted σ σ
CDF, you get an r.v. with that CDF. For example, let’s say that a
average of the possible outcomes of our random variable.
random variable X has CDF
Mathematically, if x1 , x2 , x3 , . . . are all of the distinct possible values What is the Probability Density Function (PDF)? The PDF f
that X can take, the expected value of X is is the derivative of the CDF F . −x
P F (x) = 1 − e , for x > 0
E(X) = xi P (X = xi ) ′
F (x) = f (x)
i By UoU, if we plug X into this function then we get a uniformly
A PDF is nonnegative and integrates to 1. By the fundamental distributed random variable.
X Y X+Y
3 4 7
theorem of calculus, to get from PDF back to CDF we can integrate: −X
F (X) = 1 − e ∼ Unif(0, 1)
2 2 4 Z x
6 8 14 F (x) = f (t)dt
10 23 33 −∞ Similarly, if U ∼ Unif(0, 1) then F −1 (U ) has CDF F . The key point is
1 –3 –2 that for any continuous random variable X, we can transform it into a
1 0 1 Uniform random variable and back by using its CDF.
0.30

1.0
5 9 14
4 1 5 0.8
Moments and MGFs
0.20

0.6

... ... ...


CDF
PDF

0.4

n n n
0.10

1 1 1
n∑ xi + n∑ yi = n ∑ (xi + yi)
0.2

Moments
i=1 i=1 i=1
0.00

0.0

E(X) + E(Y) = E(X + Y) −4 −2 0 2 4 −4 −2 0 2 4


x x Moments describe the shape of a distribution. Let X have mean µ and
Linearity For any r.v.s X and Y , and constants a, b, c, To find the probability that a CRV takes on a value in an interval, standard deviation σ, and Z = (X − µ)/σ be the standardized version
integrate the PDF over that interval.
E(aX + bY + c) = aE(X) + bE(Y ) + c of X. The kth moment of X is µk = E(X k ) and the kth standardized
Z b moment of X is mk = E(Z k ). The mean, variance, skewness, and
Same distribution implies same mean If X and Y have the same F (b) − F (a) = f (x)dx kurtosis are important summaries of the shape of a distribution.
distribution, then E(X) = E(Y ) and, more generally, a
Mean E(X) = µ1
E(g(X)) = E(g(Y )) How do I find the expected value of a CRV? Analogous to the
Conditional Expected Value is defined like expectation, only
discrete case, where you sum x times the PMF, for CRVs you integrate Variance Var(X) = µ2 − µ21
x times the PDF.
conditioned on any event A. Z ∞
Skewness Skew(X) = m3
P E(X) = xf (x)dx
E(X|A) = xP (X = x|A) −∞
x Kurtosis Kurt(X) = m4 − 3
Moment Generating Functions Marginal Distributions Covariance Properties For random variables W, X, Y, Z and
constants a, b:
MGF For any random variable X, the function To find the distribution of one (or more) random variables from a joint
tX PMF/PDF, sum/integrate over the unwanted random variables.
MX (t) = E(e ) Cov(X, Y ) = Cov(Y, X)
is the moment generating function (MGF) of X, if it exists for all Marginal PMF from joint PMF Cov(X + a, Y + b) = Cov(X, Y )
t in some open interval containing 0. The variable t could just as well X
Cov(aX, bY ) = abCov(X, Y )
P (X = x) = P (X = x, Y = y)
have been called u or v. It’s a bookkeeping device that lets us work
y Cov(W + X, Y + Z) = Cov(W, Y ) + Cov(W, Z) + Cov(X, Y )
with the function MX rather than the sequence of moments.
Why is it called the Moment Generating Function? Because Marginal PDF from joint PDF + Cov(X, Z)
the kth derivative of the moment generating function, evaluated at 0,
Z ∞
is the kth moment of X. fX (x) = fX,Y (x, y)dy
−∞ Correlation is location-invariant and scale-invariant For any
k (k)
µk = E(X ) = MX (0) constants a, b, c, d with a and c nonzero,
Independence of Random Variables
This is true by Taylor expansion of etX since Corr(aX + b, cY + d) = Corr(X, Y )
∞ ∞
Random variables X and Y are independent if and only if any of the
k k k
tX
X E(X )t X µk t following conditions holds:
MX (t) = E(e )= =
k! k! • Joint CDF is the product of the marginal CDFs
k=0 k=0
• Joint PMF/PDF is the product of the marginal PMFs/PDFs
Transformations
MGF of linear functions If we have Y = aX + b, then • Conditional distribution of Y given X is the marginal
distribution of Y One Variable Transformations Let’s say that we have a random
t(aX+b) bt (at)X bt
MY (t) = E(e ) = e E(e ) = e MX (at) variable X with PDF fX (x), but we are also interested in some
Write X ⊥
⊥ Y to denote that X and Y are independent. function of X. We call this function Y = g(X). Also let y = g(x). If g
Uniqueness If it exists, the MGF uniquely determines the is differentiable and strictly increasing (or strictly decreasing), then
distribution. This means that for any two random variables X and Y , Multivariate LOTUS the PDF of Y is
they are distributed the same (their PMFs/PDFs are equal) if and
LOTUS in more than one dimension is analogous to the 1D LOTUS.
only if their MGFs are equal.
dx

For discrete random variables: = fX (g −1 (y)) d g −1 (y)

fY (y) = fX (x)
Summing Independent RVs by Multiplying MGFs. If X and Y XX dy dy
are independent, then E(g(X, Y )) = g(x, y)P (X = x, Y = y)
x y
t(X+Y ) tX tY The derivative of the inverse transformation is called the Jacobian.
MX+Y (t) = E(e ) = E(e )E(e ) = MX (t) · MY (t)
For continuous random variables:
The MGF of the sum of two random variables is the product of the Z ∞ Z ∞ Two Variable Transformations Similarly, let’s say we know the
MGFs of those two random variables. E(g(X, Y )) = g(x, y)fX,Y (x, y)dxdy joint PDF of U and V but are also interested in the random vector
−∞ −∞ (X, Y ) defined by (X, Y ) = g(U, V ). Let
Joint PDFs and CDFs Covariance and Transformations ∂u ∂u
!
∂(u, v) ∂x ∂y
= ∂v ∂v
Joint Distributions ∂(x, y) ∂x ∂y
Covariance and Correlation
The joint CDF of X and Y is
Covariance is the analog of variance for two random variables. be the Jacobian matrix. If the entries in this matrix exist and are
F (x, y) = P (X ≤ x, Y ≤ y) continuous, and the determinant of the matrix is never 0, then
Cov(X, Y ) = E ((X − E(X))(Y − E(Y ))) = E(XY ) − E(X)E(Y )
In the discrete case, X and Y have a joint PMF
∂(u, v)
pX,Y (x, y) = P (X = x, Y = y). Note that fX,Y (x, y) = fU,V (u, v)
2 2
Cov(X, X) = E(X ) − (E(X)) = Var(X) ∂(x, y)
In the continuous case, they have a joint PDF
Correlation is a standardized version of covariance that is always
∂ 2 The inner bars tells us to take the matrix’s determinant, and the outer
between −1 and 1.
fX,Y (x, y) = FX,Y (x, y). bars tell us to take the absolute value. In a 2 × 2 matrix,
∂x∂y Cov(X, Y )
Corr(X, Y ) = p
The joint PMF/PDF must be nonnegative and sum/integrate to 1. Var(X)Var(Y ) a b

c = |ad − bc|
d
Covariance and Independence If two random variables are
independent, then they are uncorrelated. The converse is not
necessarily true (e.g., consider X ∼ N (0, 1) and Y = X 2 ).
X ⊥
⊥ Y −→ Cov(X, Y ) = 0 −→ E(XY ) = E(X)E(Y )
Convolutions
Covariance and Variance The variance of a sum can be found by Convolution Integral If you want to find the PDF of the sum of two
Conditional Distributions independent CRVs X and Y , you can do the following integral:
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
Conditioning and Bayes’ rule for discrete r.v.s Z ∞
n
P (X = x, Y = y) P (X = x|Y = y)P (Y = y) X X fX+Y (t) = fX (x)fY (t − x)dx
P (Y = y|X = x) = = Var(X1 + X2 + · · · + Xn ) = Var(Xi ) + 2 Cov(Xi , Xj ) −∞
P (X = x) P (X = x) i=1 i<j
Conditioning and Bayes’ rule for continuous r.v.s If X and Y are independent then they have covariance 0, so Example Let X, Y ∼ N (0, 1) be i.i.d. Then for each fixed t,
fX,Y (x, y) fX|Y (x|y)fY (y)
fY |X (y|x) = = X ⊥
⊥ Y =⇒ Var(X + Y ) = Var(X) + Var(Y ) ∞ 1 −x2 /2 1
Z
fX (x) fX (x) −(t−x)2 /2
If X1 , X2 , . . . , Xn are identically distributed and have the same fX+Y (t) = √ e √ e dx
Hybrid Bayes’ rule −∞ 2π 2π
covariance relationships (often by symmetry), then
P (A|X = x)fX (x)  n By completing the square and using the fact that a Normal PDF
fX (x|A) = Var(X1 + X2 + · · · + Xn ) = nVar(X1 ) + 2 Cov(X1 , X2 )
P (A) 2 integrates to 1, this works out to fX+Y (t) being the N (0, 2) PDF.
Poisson Process • Let T ∼ Expo(1/10) be how long you have to wait until the Central Limit Theorem (CLT)
shuttle comes. Given that you have already waited t minutes,
the expected additional waiting time is 10 more minutes, by the Approximation using CLT
Definition We have a Poisson process of rate λ arrivals per unit memoryless property. That is, E(T |T > t) = t + 10.
time if the following conditions hold: We use ∼ ˙ to denote is approximately distributed. We can use the
Central Limit Theorem to approximate the distribution of a random
1. The number of arrivals in a time interval of length t is Pois(λt). Discrete Y Continuous Y variable Y = X1 + X2 + · · · + Xn that is a sum of n i.i.d. random
2
variables Xi . Let E(Y ) = µY and Var(Y ) = σY . The CLT says
2. Numbers of arrivals in disjoint time intervals are independent. P R∞
E(Y ) = y yP (Y = y) E(Y ) = −∞
yfY (y)dy 2
For example, the numbers of arrivals in the time intervals [0, 5], R∞
Y ∼
˙ N (µY , σY )
P
E(Y |A) = y yP (Y = y|A) E(Y |A) = yf (y|A)dy
(5, 12), and [13, 23) are independent with Pois(5λ), Pois(7λ), Pois(10λ) −∞ 2
If the Xi are i.i.d. with mean µX and variance σX , then µY = nµX
distributions, respectively. 2 2
and σY = nσX . For the sample mean X̄n , the CLT says
Conditioning on a Random Variable We can also find E(Y |X),
+

+
+

+
+
1 2
0 T1 T2 T3 T4 T5 the expected value of Y given the random variable X. This is a X̄n = (X1 + X2 + · · · + Xn ) ∼
˙ N (µX , σX /n)
function of the random variable X. It is not a number except in n
Count-Time Duality Consider a Poisson process of emails arriving certain special cases such as if X ⊥
⊥ Y . To find E(Y |X), find
in an inbox at rate λ emails per hour. Let Tn be the time of arrival of E(Y |X = x) and then plug in X for x. For example: Asymptotic Distributions using CLT
the nth email (relative to some starting time 0) and Nt be the number D
of emails that arrive in [0, t]. Let’s find the distribution of T1 . The • If E(Y |X = x) = x3 + 5x, then E(Y |X) = X 3 + 5X. We use −→ to denote converges in distribution to as n → ∞. The
event T1 > t, the event that you have to wait more than t hours to get CLT says that if we standardize the sum X1 + · · · + Xn then the
the first email, is the same as the event Nt = 0, which is the event that • Let Y be the number of successes in 10 independent Bernoulli distribution of the sum converges to N (0, 1) as n → ∞:
there are no emails in the first t hours. So trials with probability p of success and X be the number of
successes among the first 3 trials. Then E(Y |X) = X + 7p. 1 D
−λt −λt √ (X1 + · · · + Xn − nµX ) −→ N (0, 1)
P (T1 > t) = P (Nt = 0) = e −→ P (T1 ≤ t) = 1 − e σ n
2 2
• Let X ∼ N (0, 1) and Y = X . Then E(Y |X = x) = x since if
Thus we have T1 ∼ Expo(λ). By the memoryless property and similar we know X = x then we know Y = x2 . And E(X|Y = y) = 0 In other words, the CDF of the left-hand side goes to the standard
√ Normal CDF, Φ. In terms of the sample mean, the CLT says
reasoning, the interarrival times between emails are i.i.d. Expo(λ), i.e., since if we know Y = y then we know X = ± y, with equal
the differences Tn − Tn−1 are i.i.d. Expo(λ). probabilities (by symmetry). So E(Y |X) = X 2 , E(X|Y ) = 0. √
n(X̄n − µX ) D
−→ N (0, 1)
σX
Order Statistics Properties of Conditional Expectation

1. E(Y |X) = E(Y ) if X ⊥


⊥Y Markov Chains
Definition Let’s say you have n i.i.d. r.v.s X1 , X2 , . . . , Xn . If you
arrange them from smallest to largest, the ith element in that list is 2. E(h(X)W |X) = h(X)E(W |X) (taking out what’s known)
the ith order statistic, denoted X(i) . So X(1) is the smallest in the list In particular, E(h(X)|X) = h(X). Definition
and X(n) is the largest in the list.
3. E(E(Y |X)) = E(Y ) (Adam’s Law, a.k.a. Law of Total 5/12 7/12 7/8
Note that the order statistics are dependent, e.g., learning X(4) = 42 Expectation) 1 1/2 1/3 1/4
gives us the information that X(1) , X(2) , X(3) are ≤ 42 and 1
1/2
2
1/4
3
1/6
4
1/8
5

X(5) , X(6) , . . . , X(n) are ≥ 42. Adam’s Law (a.k.a. Law of Total Expectation) can also be
Distribution Taking n i.i.d. random variables X1 , X2 , . . . , Xn with written in a way that looks analogous to LOTP. For any events A Markov chain is a random walk in a state space, which we will
CDF F (x) and PDF f (x), the CDF and PDF of X(i) are: A1 , A2 , . . . , An that partition the sample space, assume is finite, say {1, 2, . . . , M }. We let Xt denote which element of
the state space the walk is visiting at time t. The Markov chain is the
n  
n E(Y ) = E(Y |A1 )P (A1 ) + · · · + E(Y |An )P (An ) sequence of random variables tracking where the walk is at all points
k n−k
X
FX(i) (x) = P (X(i) ≤ x) = F (x) (1 − F (x)) in time, X0 , X1 , X2 , . . . . By definition, a Markov chain must satisfy
k=i
k For the special case where the partition is A, Ac , this says the Markov property, which says that if you want to predict where
the chain will be at a future time, if we know the present state then
 n − 1 c c
fX(i) (x) = n F (x)
i−1
(1 − F (x))
n−i
f (x) E(Y ) = E(Y |A)P (A) + E(Y |A )P (A ) the entire past history is irrelevant. Given the present, the past and
i−1 future are conditionally independent. In symbols,
Uniform Order Statistics The jth order statistic of Eve’s Law (a.k.a. Law of Total Variance) P (Xn+1 = j|X0 = i0 , X1 = i1 , . . . , Xn = i) = P (Xn+1 = j|Xn = i)
i.i.d. U1 , . . . , Un ∼ Unif(0, 1) is U(j) ∼ Beta(j, n − j + 1).
Var(Y ) = E(Var(Y |X)) + Var(E(Y |X))
State Properties
Conditional Expectation A state is either recurrent or transient.
MVN, LLN, CLT • If you start at a recurrent state, then you will always return
Conditioning on an Event We can find E(Y |A), the expected value back to that state at some point in the future. ♪You can
of Y given that event A occurred. A very important case is when A is check-out any time you like, but you can never leave. ♪
the event X = x. Note that E(Y |A) is a number. For example: Law of Large Numbers (LLN) • Otherwise you are at a transient state. There is some positive
• The expected value of a fair die roll, given that it is prime, is Let X1 , X2 , X3 . . . be i.i.d. with mean µ. The sample mean is probability that once you leave you will never return. ♪You
1 1 1 10 don’t have to go home, but you can’t stay here. ♪
3 ·2+ 3 ·3+ 3 ·5 = 3 .
X1 + X2 + X3 + · · · + Xn A state is either periodic or aperiodic.
• Let Y be the number of successes in 10 independent Bernoulli X̄n =
trials with probability p of success. Let A be the event that the n • If you start at a periodic state of period k, then the GCD of
first 3 trials are all successes. Then the possible numbers of steps it would take to return back is
The Law of Large Numbers states that as n → ∞, X̄n → µ with k > 1.
E(Y |A) = 3 + 7p probability 1. For example, in flips of a coin with probability p of
Heads, let Xj be the indicator of the jth flip being Heads. Then LLN • Otherwise you are at an aperiodic state. The GCD of the
since the number of successes among the last 7 trials is Bin(7, p). says the proportion of Heads converges to p (with probability 1). possible numbers of steps it would take to return back is 1.
Transition Matrix Continuous Distributions Gamma Distribution
Let the state space be {1, 2, . . . , M }. The transition matrix Q is the Gamma(3, 1) Gamma(3, 0.5)

M × M matrix where element qij is the probability that the chain goes Uniform Distribution
from state i to state j in one step:

0.10
0.2
Let us say that U is distributed Unif(a, b). We know the following:

PDF

PDF
qij = P (Xn+1 = j|Xn = i) Properties of the Uniform For a Uniform distribution, the

0.05
0.1
probability of a draw from any interval within the support is
To find the probability that the chain goes from state i to state j in proportional to the length of the interval. See Universality of Uniform

0.00
0.0
exactly m steps, take the (i, j) element of Qm . and Order Statistics for other properties. 0 5 10 15 20 0 5 10 15 20
x x

Example William throws darts really badly, so his darts are uniform Gamma(10, 1) Gamma(5, 0.5)
(m)
qij = P (Xn+m = j|Xn = i) over the whole room because they’re equally likely to appear anywhere.

0.10
William’s darts have a Uniform distribution on the surface of the

0.10
If X0 is distributed according to the row vector PMF p
~, i.e., room. The Uniform is the only distribution where the probability of
~Qn .

0.05
hitting in any specific region is proportional to the length/area/volume

PDF

PDF
pj = P (X0 = j), then the PMF of Xn is p

0.05
of that region, and where the density of occurrence in any one specific
spot is constant throughout the whole support.
Chain Properties

0.00

0.00
0 5 10 15 20 0 5 10 15 20

A chain is irreducible if you can get from anywhere to anywhere. If a Normal Distribution x x

chain (on a finite state space) is irreducible, then all of its states are Let us say that X is distributed Gamma(a, λ). We know the following:
Let us say that X is distributed N (µ, σ 2 ). We know the following:
recurrent. A chain is periodic if any of its states are periodic, and is
aperiodic if none of its states are periodic. In an irreducible chain, all Central Limit Theorem The Normal distribution is ubiquitous Story You sit waiting for shooting stars, where the waiting time for a
states have the same period. because of the Central Limit Theorem, which states that the sample star is distributed Expo(λ). You want to see n shooting stars before
mean of i.i.d. r.v.s will approach a Normal distribution as the sample you go home. The total waiting time for the nth shooting star is
A chain is reversible with respect to ~
s if si qij = sj qji for all i, j. size grows, regardless of the initial distribution. Gamma(n, λ).
Examples of reversible chains include any chain with qij = qji , with Location-Scale Transformation Every time we shift a Normal Example You are at a bank, and there are 3 people ahead of you.
1 1 1
~
s = (M ,M ,..., M ), and random walk on an undirected network. r.v. (by adding a constant) or rescale a Normal (by multiplying by a The serving time for each person is Exponential with mean 2 minutes.
constant), we change it to another Normal r.v. For any Normal Only one person at a time can be served. The distribution of your
Stationary Distribution X ∼ N (µ, σ 2 ), we can transform it to the standard N (0, 1) by the waiting time until it’s your turn to be served is Gamma(3, 12 ).
following transformation:
Let us say that the vector ~s = (s1 , s2 , . . . , sM ) be a PMF (written as a X−µ Beta Distribution
row vector). We will call ~
s the stationary distribution for the chain Z= ∼ N (0, 1)
σ
if ~
sQ = ~s. As a consequence, if Xt has the stationary distribution, Beta(0.5, 0.5) Beta(2, 1)

2.0
Standard Normal The Standard Normal, Z ∼ N (0, 1), has mean 0

5
then all future Xt+1 , Xt+2 , . . . also have the stationary distribution.
and variance 1. Its CDF is denoted by Φ.

1.5
For irreducible, aperiodic chains, the stationary distribution exists, is

3
unique, and si is the long-run probability of a chain being at state i.

PDF

PDF
Exponential Distribution

1.0
2
The expected number of steps to return to i starting from i is 1/si .

0.5
Let us say that X is distributed Expo(λ). We know the following:

1
To find the stationary distribution, you can solve the matrix equation

0.0
0
(Q′ − I)~
s ′ = 0. The stationary distribution is uniform if the columns Story You’re sitting on an open meadow right before the break of 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

of Q sum to 1. dawn, wishing that airplanes in the night sky were shooting stars,
because you could really use a wish right now. You know that shooting Beta(2, 8) Beta(5, 5)

2.5
Reversibility Condition Implies Stationarity If you have a PMF ~ s stars come on average every 15 minutes, but a shooting star is not

2.0
and a Markov chain with transition matrix Q, then si qij = sj qji for “due” to come just because you’ve waited so long. Your waiting time

1.5
all states i, j implies that ~
s is stationary. is memoryless; the additional time until the next shooting star comes

2
PDF

PDF
does not depend on how long you’ve waited already.

1.0
1
Random Walk on an Undirected Network

0.5
Example The waiting time until the next shooting star is distributed

0.0
Expo(4) hours. Here λ = 4 is the rate parameter, since shooting

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

1
stars arrive at a rate of 1 per 1/4 hour on average. The expected time x x

until the next shooting star is 1/λ = 1/4 hour.


2
Expos as a rescaled Expo(1) Conjugate Prior of the Binomial In the Bayesian approach to
Y ∼ Expo(λ) → X = λY ∼ Expo(1) statistics, parameters are viewed as random variables, to reflect our
3 uncertainty. The prior for a parameter is its distribution before
Memorylessness The Exponential Distribution is the only observing data. The posterior is the distribution for the parameter
4
continuous memoryless distribution. The memoryless property says after observing data. Beta is the conjugate prior of the Binomial
that for X ∼ Expo(λ) and any positive numbers s and t, because if you have a Beta-distributed prior on p in a Binomial, then
P (X > s + t|X > s) = P (X > t) the posterior distribution on p given the Binomial data is also
5
Beta-distributed. Consider the following two-level model:
Equivalently,
X − a|(X > a) ∼ Expo(λ) X|p ∼ Bin(n, p)
If you have a collection of nodes, pairs of which can be connected by For example, a product with an Expo(λ) lifetime is always “as good as p ∼ Beta(a, b)
undirected edges, and a Markov chain is run by going from the new” (it doesn’t experience wear and tear). Given that the product has
current node to a uniformly random node that is connected to it by an survived a years, the additional time that it will last is still Expo(λ). Then after observing X = x, we get the posterior distribution
edge, then this is a random walk on an undirected network. The Min of Expos If we have independent Xi ∼ Expo(λi ), then
stationary distribution of this chain is proportional to the degree p|(X = x) ∼ Beta(a + x, b + n − x)
min(X1 , . . . , Xk ) ∼ Expo(λ1 + λ2 + · · · + λk ).
sequence (this is the sequence of degrees, where the degree of a node
Max of Expos If we have i.i.d. Xi ∼ Expo(λ), then Order statistics of the Uniform See Order Statistics.
is how many edges are attached to it). For example, the stationary
distribution of random walk on the network shown above is max(X1 , . . . , Xk ) has the same distribution as Y1 + Y2 + · · · + Yk , Beta-Gamma relationship If X ∼ Gamma(a, λ),
3
proportional to (3, 3, 2, 4, 2), so it’s ( 14 3
, 14 3
, 14 4
, 14 2
, 14 ). where Yj ∼ Expo(jλ) and the Yj are independent. Y ∼ Gamma(b, λ), with X ⊥
⊥ Y then
X
• X+Y ∼ Beta(a, b) • Conditional X|(X + Y = r) ∼ HGeom(n, m, r) Properties Let X ∼ Pois(λ1 ) and Y ∼ Pois(λ2 ), with X ⊥
⊥ Y.

• X +Y ⊥
⊥ X • Binomial-Poisson Relationship Bin(n, p) is approximately 1. Sum X + Y ∼ Pois(λ1 + λ2 )
X+Y
Pois(λ) if p is small.  
λ1
This is known as the bank–post office result. 2. Conditional X|(X + Y = n) ∼ Bin n,
• Binomial-Normal Relationship Bin(n, p) is approximately λ1 +λ2
N (np, np(1 − p)) if n is large and p is not near 0 or 1. 3. Chicken-egg If there are Z ∼ Pois(λ) items and we randomly
χ2 (Chi-Square) Distribution and independently “accept” each item with probability p, then
Let us say that X is distributed χ2n . We know the following: Geometric Distribution the number of accepted items Z1 ∼ Pois(λp), and the number of
rejected items Z2 ∼ Pois(λ(1 − p)), and Z1 ⊥
⊥ Z2 .
Story A Chi-Square(n) is the sum of the squares of n independent Let us say that X is distributed Geom(p). We know the following:
standard Normal r.v.s. Story X is the number of “failures” that we will achieve before we
Properties and Representations achieve our first success. Our successes have probability p. Multivariate Distributions
1
2 2 2 Example If each pokeball we throw has probability to catch Mew,
10
X is distributed as Z1 + Z2 + · · · + Zn for i.i.d. Zi ∼ N (0, 1) 1 Multinomial Distribution
the number of failed pokeballs will be distributed Geom( 10 ).
X ∼ Gamma(n/2, 1/2) Let us say that the vector X ~ = (X1 , X2 , X3 , . . . , Xk ) ∼ Multk (n, p
~)
First Success Distribution where p
~ = (p1 , p2 , . . . , pk ).
Discrete Distributions Equivalent to the Geometric distribution, except that it includes the Story We have n items, which can fall into any one of the k buckets
first success in the count. This is 1 more than the number of failures. independently with the probabilities p ~ = (p1 , p2 , . . . , pk ).
If X ∼ FS(p) then E(X) = 1/p.
Distributions for four sampling schemes Example Let us assume that every year, 100 students in the Harry
Negative Binomial Distribution Potter Universe are randomly and independently sorted into one of
Replace No Replace four houses with equal probability. The number of people in each of the
Let us say that X is distributed NBin(r, p). We know the following: houses is distributed Mult4 (100, p
~), where p
~ = (0.25, 0.25, 0.25, 0.25).
Fixed # trials (n) Binomial HGeom
(Bern if n = 1) Story X is the number of “failures” that we will have before we Note that X1 + X2 + · · · + X4 = 100, and they are dependent.
Draw until r success NBin NHGeom achieve our rth success. Our successes have probability p. Joint PMF For n = n1 + n2 + · · · + nk ,
(Geom if r = 1) Example Thundershock has 60% accuracy and can faint a wild n! n n n
~ =~
P (X n) = p 1 p 2 . . . pk k
Raticate in 3 hits. The number of misses before Pikachu faints n1 !n2 ! . . . nk ! 1 2
Raticate with Thundershock is distributed NBin(3, 0.6).
Bernoulli Distribution Marginal PMF, Lumping, and Conditionals Marginally,
The Bernoulli distribution is the simplest case of the Binomial Hypergeometric Distribution Xi ∼ Bin(n, pi ) since we can define “success” to mean category i. If
distribution, where we only have one trial (n = 1). Let us say that X is you lump together multiple categories in a Multinomial, then it is still
Let us say that X is distributed HGeom(w, b, n). We know the Multinomial. For example, Xi + Xj ∼ Bin(n, pi + pj ) for i 6= j since
distributed Bern(p). We know the following: following: we can define “success” to mean being in category i or j. Similarly, if
Story A trial is performed with probability p of “success”, and X is Story In a population of w desired objects and b undesired objects, k = 6 and we lump categories 1-2 and lump categories 3-5, then
the indicator of success: 1 means success, 0 means failure. X is the number of “successes” we will have in a draw of n objects, (X1 + X2 , X3 + X4 + X5 , X6 ) ∼ Mult3 (n, (p1 + p2 , p3 + p4 + p5 , p6 ))
Example Let X be the indicator of Heads for a fair coin toss. Then without replacement. The draw of n objects is assumed to be a
Conditioning on some Xj also still gives a Multinomial:
X ∼ Bern( 21 ). Also, 1 − X ∼ Bern( 12 ) is the indicator of Tails. simple random sample (all sets of n objects are equally likely).
p1 pk−1
  
Examples Here are some HGeom examples. X1 , . . . , Xk−1 |Xk = nk ∼ Multk−1 n − nk , ,...,
Binomial Distribution 1 − pk 1 − pk
• Let’s say that we have only b Weedles (failure) and w Pikachus
Bin(10,1/2) (success) in Viridian Forest. We encounter n Pokemon in the Variances and Covariances We have Xi ∼ Bin(n, pi ) marginally, so
forest, and X is the number of Pikachus in our encounters. Var(Xi ) = npi (1 − pi ). Also, Cov(Xi , Xj ) = −npi pj for i 6= j.
0.30
0.25


• The number of Aces in a 5 card hand. Multivariate Uniform Distribution
0.20

● ●

• You have w white balls and b black balls, and you draw n balls. See the univariate Uniform for stories and examples. For the 2D
0.15
pmf

You will draw X white balls. Uniform on some region, probability is proportional to area. Every
● ●
0.10

• You have w white balls and b black balls, and you draw n balls point in the support has equal density, of value area of1 region . For the
0.05

● ●
without replacement. The number of white balls in your sample 3D Uniform, probability is proportional to volume.
is HGeom(w, b, n); the number of black balls is HGeom(b, w, n).
0.00

● ●
● ●

0 2 4 6 8 10
• Capture-recapture A forest has N elk, you capture n of them, Multivariate Normal (MVN) Distribution
x
tag them, and release them. Then you recapture a new sample ~ = (X1 , X2 , . . . , Xk ) is Multivariate Normal if every linear
A vector X
Let us say that X is distributed Bin(n, p). We know the following: of size m. How many tagged elk are now in the new sample? combination is Normally distributed, i.e., t1 X1 + t2 X2 + · · · + tk Xk is
Story X is the number of “successes” that we will achieve in n HGeom(n, N − n, m) Normal for any constants t1 , t2 , . . . , tk . The parameters of the
independent trials, where each trial is either a success or a failure, each Multivariate Normal are the mean vector µ ~ = (µ1 , µ2 , . . . , µk ) and
with the same probability p of success. We can also write X as a sum Poisson Distribution the covariance matrix where the (i, j) entry is Cov(Xi , Xj ).
of multiple independent Bern(p) random variables. Let X ∼ Bin(n, p)
and Xj ∼ Bern(p), where all of the Bernoullis are independent. Then Let us say that X is distributed Pois(λ). We know the following: Properties The Multivariate Normal has the following properties.
Story There are rare events (low probability events) that occur many • Any subvector is also MVN.
X = X1 + X2 + X3 + · · · + Xn different ways (high possibilities of occurences) at an average rate of λ
occurrences per unit space or time. The number of events that occur • If any two elements within an MVN are uncorrelated, then they
Example If Jeremy Lin makes 10 free throws and each one in that unit of space or time is X. are independent.
independently has a 43 chance of getting in, then the number of free • The joint PDF of a Bivariate Normal (X, Y ) with N (0, 1)
throws he makes is distributed Bin(10, 43 ). Example A certain busy intersection has an average of 2 accidents
marginal distributions and correlation ρ ∈ (−1, 1) is
per month. Since an accident is a low probability event that can
Properties Let X ∼ Bin(n, p), Y ∼ Bin(m, p) with X ⊥
⊥ Y. 1 1
 
happen many different ways, it is reasonable to model the number of 2 2
fX,Y (x, y) = exp − 2 (x + y − 2ρxy) ,
accidents in a month at that intersection as Pois(2). Then the number 2πτ 2τ
• Redefine success n − X ∼ Bin(n, 1 − p)
of accidents that happen in two months at that intersection is p
• Sum X + Y ∼ Bin(n + m, p) distributed Pois(4). with τ = 1 − ρ2 .
Distribution Properties Euler’s Approximation for Harmonic Sums Linearity and First Success
This problem is commonly known as the coupon collector problem.
1 1 1 There are n coupon types. At each draw, you get a uniformly random
Important CDFs 1+ + + ··· + ≈ log n + 0.577 . . .
2 3 n coupon type. What is the expected number of coupons needed until
Standard Normal Φ you have a complete set? Answer: Let N be the number of coupons
Exponential(λ) F (x) = 1 − e−λx , for x ∈ (0, ∞) Stirling’s Approximation for Factorials needed; we want E(N ). Let N = N1 + · · · + Nn , where N1 is the
draws to get our first new coupon, N2 is the additional draws needed
Uniform(0,1) F (x) = x, for x ∈ (0, 1)
to draw our second new coupon and so on. By the story of the First

n
n

n! ≈ Success, N2 ∼ FS((n − 1)/n) (after collecting first coupon type, there’s
Convolutions of Random Variables 2πn
e (n − 1)/n chance you’ll get something new). Similarly,
A convolution of n random variables is simply their sum. For the N3 ∼ FS((n − 2)/n), and Nj ∼ FS((n − j + 1)/n). By linearity,
following results, let X and Y be independent.
1. X ∼ Pois(λ1 ), Y ∼ Pois(λ2 ) −→ X + Y ∼ Pois(λ1 + λ2 )
Miscellaneous Definitions n n n X1 n
E(N ) = E(N1 ) + · · · + E(Nn ) = + + ··· + = n
n n−1 1 j
2. X ∼ Bin(n1 , p), Y ∼ Bin(n2 , p) −→ X + Y ∼ Bin(n1 + n2 , p). j=1
Bin(n, p) can be thought of as a sum of i.i.d. Bern(p) r.v.s. Medians and Quantiles Let X have CDF F . Then X has median This is approximately n(log(n) + 0.577) by Euler’s approximation.
3. X ∼ Gamma(a1 , λ), Y ∼ Gamma(a2 , λ) m if F (m) ≥ 0.5 and P (X ≥ m) ≥ 0.5. For X continuous, m satisfies
−→ X + Y ∼ Gamma(a1 + a2 , λ). Gamma(n, λ) with n an F (m) = 1/2. In general, the ath quantile of X is min{x : F (x) ≥ a}; Orderings of i.i.d. random variables
integer can be thought of as a sum of i.i.d. Expo(λ) r.v.s. the median is the case a = 1/2.
I call 2 UberX’s and 3 Lyfts at the same time. If the time it takes for
4. X ∼ NBin(r1 , p), Y ∼ NBin(r2 , p) log Statisticians generally use log to refer to natural log (i.e., base e). the rides to reach me are i.i.d., what is the probability that all the
−→ X + Y ∼ NBin(r1 + r2 , p). NBin(r, p) can be thought of as Lyfts will arrive first? Answer: Since the arrival times of the five cars
a sum of i.i.d. Geom(p) r.v.s. i.i.d r.v.s Independent, identically-distributed random variables. are i.i.d., all 5! orderings of the arrivals are equally likely. There are
5. X ∼ N (µ1 , σ12 ), Y ∼ N (µ2 , σ22 ) 3!2! orderings that involve the Lyfts arriving first, so the probability
−→ X + Y ∼ N (µ1 + µ2 , σ12 + σ22 ) 3!2!
Example Problems = 1/10 . Alternatively, there are 53

that the Lyfts arrive first is
5!
Special Cases of Distributions ways to choose 3 of the 5 slots for the Lyfts to occupy, where each of
1. Bin(1, p) ∼ Bern(p) the choices are equally likely. One of these choices has all 3 of the
Contributions from Sebastian Chiu  5
2. Beta(1, 1) ∼ Unif(0, 1)
Lyfts arriving first, so the probability is 1/ = 1/10 .
3. Gamma(1, λ) ∼ Expo(λ) 3
Calculating Probability
4. χ2n ∼ Gamma n 1

2, 2 Expectation of Negative Hypergeometric
5. NBin(1, p) ∼ Geom(p) A textbook has n typos, which are randomly scattered amongst its n
pages, independently. You pick a random page. What is the  What is the expected number of cards that you draw before you pick
1 your first Ace in a shuffled deck (not counting the Ace)? Answer:
Inequalities probability that it has no typos? Answer: There is a 1 − n
Consider a non-Ace. Denote this to be card j. Let Ij be the indicator
probability that any specific typo isn’t on your page, and thus a
that card j will be drawn before the first Ace. Note that Ij = 1 says
p
1. Cauchy-Schwarz |E(XY )| ≤ E(X 2 )E(Y 2 )
1 n
 
1− probability that there are no typos on your page. For n that j is before all 4 of the Aces in the deck. The probability that this
E|X|
2. Markov P (X ≥ a) ≤ a for a > 0 n occurs is 1/5 by symmetry. Let X be the number of cards drawn
before the first Ace. Then X = I1 + I2 + ... + I48 , where each indicator
3. Chebyshev P (|X − µ| ≥ a) ≤ σ2
for E(X) = µ, Var(X) = σ 2 large, this is approximately e−1 = 1/e.
a2 corresponds to one of the 48 non-Aces. Thus,
4. Jensen E(g(X)) ≥ g(E(X)) for g convex; reverse if g is E(X) = E(I1 ) + E(I2 ) + ... + E(I48 ) = 48/5 = 9.6 .
concave Linearity and Indicators (1)
Minimum and Maximum of RVs
Formulas In a group of n people, what is the expected number of distinct
birthdays (month and day)? What is the expected number of birthday What is the CDF of the maximum of n independent Unif(0,1) random
matches? Answer: Let X be the number of distinct birthdays and Ij variables? Answer: Note that for r.v.s X1 , X2 , . . . , Xn ,
Geometric Series be the indicator for the jth day being represented. P (min(X1 , X2 , . . . , Xn ) ≥ a) = P (X1 ≥ a, X2 ≥ a, . . . , Xn ≥ a)
n−1 n n
Similarly,
2 n−1
X k 1−r E(Ij ) = 1 − P (no one born on day j) = 1 − (364/365)
1 + r + r + ··· + r = r = P (max(X1 , X2 , . . . , Xn ) ≤ a) = P (X1 ≤ a, X2 ≤ a, . . . , Xn ≤ a)
k=0
1−r
n We will use this principle to find the CDF of U(n) , where
2 1 By linearity, E(X) = 365 (1 − (364/365) ) . Now let Y be the U(n) = max(U1 , U2 , . . . , Un ) and Ui ∼ Unif(0, 1) are i.i.d.
1 + r + r + ··· = if |r| < 1
1−r number of birthday matches and Ji be the indicator that the ith pair P (max(U1 , U2 , . . . , Un ) ≤ a) = P (U1 ≤ a, U2 ≤ a, . . . , Un ≤ a)
of people have the same birthday. The probability that any two
Exponential Function (ex )  n = P (U1 ≤ a)P (U2 ≤ a) . . . P (Un ≤ a)
∞ specific people share a birthday is 1/365, so E(Y ) = /365 .
xn x2 x3 x n 2
 
n
x = a
X
e = =1+x+ + + · · · = lim 1+
n! 2! 3! n→∞ n
n=0 for 0 < a < 1 (and the CDF is 0 for a ≤ 0 and 1 for a ≥ 1).
Gamma and Beta Integrals Linearity and Indicators (2)
Pattern-matching with ex Taylor series
You can sometimes solve complicated-looking integrals by This problem is commonly known as the hat-matching problem. 
1

pattern-matching to a gamma or beta integral: There are n people at a party, each with hat. At the end of the party, For X ∼ Pois(λ), find E . Answer: By LOTUS,
Z ∞ Z 1 they each leave with a random hat. What is the expected number of X +1
t−1 −x a−1 b−1 Γ(a)Γ(b)
x e dx = Γ(t) x (1 − x) dx = people who leave with the right hat? Answer: Each hat has a 1/n ∞ ∞
1 1 e−λ λk e−λ X λk+1 e−λ λ
 
Γ(a + b)
X
0 0 chance of going to the right person. By linearity, the average number E = = = (e − 1)
X +1 k+1 k! λ k=0 (k + 1)! λ
Also, Γ(a + 1) = aΓ(a), and Γ(n) = (n − 1)! if n is a positive integer. of hats that go to their owners is n(1/n) = 1 . k=0
Adam’s Law and Eve’s Law 4. Calculating expectation. If it has a named distribution,
check out the table of distributions. If it’s a function of an r.v.
William really likes speedsolving Rubik’s Cubes. But he’s pretty bad with a named distribution, try LOTUS. If it’s a count of
at it, so sometimes he fails. On any given day, William will attempt something, try breaking it up into indicator r.v.s. If you can
N ∼ Geom(s) Rubik’s Cubes. Suppose each time, he has probability p condition on something natural, consider using Adam’s law.
of solving the cube, independently. Let T be the number of Rubik’s
Cubes he solves during a day. Find the mean and variance of T . 5. Calculating variance. Consider independence, named
Answer: Note that T |N ∼ Bin(N, p). So by Adam’s Law, Robber distributions, and LOTUS. If it’s a count of something, break it
up into a sum of indicator r.v.s. If it’s a sum, use properties of
p(1 − s) covariance. If you can condition on something natural, consider
E(T ) = E(E(T |N )) = E(N p) = using Eve’s Law.
s
6. Calculating E(X 2 ). Do you already know E(X) or Var(X)?
Similarly, by Eve’s Law, we have that Recall that Var(X) = E(X 2 ) − (E(X))2 . Otherwise try
LOTUS.
Var(T ) = E(Var(T |N )) + Var(E(T |N )) = E(N p(1 − p)) + Var(N p)
7. Calculating covariance. Use the properties of covariance. If
(a) Is this Markov chain irreducible? Is it aperiodic? Answer:
p(1 − p)(1 − s) p2 (1 − s) p(1 − s)(p + s(1 − p)) you’re trying to find the covariance between two components of
= + = Yes to both. The Markov chain is irreducible because it can a Multinomial distribution, Xi , Xj , then the covariance is
s s2 s2 −npi pj for i 6= j.
get from anywhere to anywhere else. The Markov chain is
aperiodic because the robber can return back to a square in
8. Symmetry. If X1 , . . . , Xn are i.i.d., consider using symmetry.
MGF – Finding Moments 2, 3, 4, 5, . . . moves, and the GCD of those numbers is 1.
(b) What is the stationary distribution of this Markov chain? 9. Calculating probabilities of orderings. Remember that all
Find E(X 3 ) for X ∼ Expo(λ) using the MGF of X. Answer: The n! ordering of i.i.d. continuous random variables X1 , . . . , Xn
λ Answer: Since this is a random walk on an undirected graph,
MGF of an Expo(λ) is M (t) = λ−t . To get the third moment, we can are equally likely.
the stationary distribution is proportional to the degree
take the third derivative of the MGF and evaluate at t = 0: sequence. The degree for the corner pieces is 3, the degree for 10. Determining independence. There are several equivalent
the edge pieces is 4, and the degree for the center pieces is 6. definitions. Think about simple and extreme cases to see if you
3 6 To normalize this degree sequence, we divide by its sum. The
E(X ) = 3 can find a counterexample.
λ sum of the degrees is 6(3) + 6(4) + 7(6) = 84. Thus the
stationary probability of being on a corner is 3/84 = 1/28, on 11. Do a painful integral. If your integral looks painful, see if
But a much nicer way to use the MGF here is via pattern recognition: an edge is 4/84 = 1/21, and in the center is 6/84 = 1/14. you can write your integral in terms of a known PDF (like
note that M (t) looks like it came from a geometric series: Gamma or Beta), and use the fact that PDFs integrate to 1?
(c) What fraction of the time will the robber be in the center tile
∞  n ∞ 12. Before moving on. Check some simple and extreme cases,
1 X t X n! tn in this game, in the long run? Answer: By the above, 1/14 .
check whether the answer seems plausible, check for biohazards.
t
= =
1− λ λ λn n!
n=0 n=0 (d) What is the expected amount of moves it will take for the
n robber to return to the center tile? Answer: Since this chain is Biohazards
The coefficient of tn! here is the nth moment of X, so we have irreducible and aperiodic, to get the expected time to return we
E(X n ) = λn!
n for all nonnegative integers n. can just invert the stationary probability. Thus on average it
Contributions from Jessy Hwang
will take 14 turns for the robber to return to the center tile.
Markov chains (1) 1. Don’t misuse the naive definition of probability. When
Suppose Xn is a two-state Markov chain with transition matrix Problem-Solving Strategies answering “What is the probability that in a group of 3 people,
no two have the same birth month?”, it is not correct to treat
the people as indistinguishable balls being placed into 12 boxes,
0 1 Contributions from Jessy Hwang, Yuan Jiang, Yuqi Hou
  since that assumes the list of birth months {January, January,
0 1−α α 1. Getting started. Start by defining relevant events and January} is just as likely as the list {January, April, June},
Q=
1 β 1−β random variables. (“Let A be the event that I pick the fair even though the latter is six times more likely.
coin”; “Let X be the number of successes.”) Clear notion is
Find the stationary distribution ~
s = (s0 , s1 ) of Xn by solving ~
sQ = ~
s, 2. Don’t confuse unconditional, conditional, and joint
important for clear thinking! Then decide what it is that you’re P (B|A)P (A)
and show that the chain is reversible with respect to ~ s. Answer: The supposed to be finding, in terms of your notation (“I want to probabilities. In applying P (A|B) = P (B)
, it is not
equation ~
sQ = ~s says that find P (X = 3|A)”). Think about what type of object your correct to say “P (B) = 1 because we know B happened”; P (B)
answer should be (a number? A random variable? A PMF? A is the prior probability of B. Don’t confuse P (A|B) with
s0 = s0 (1 − α) + s1 β and s1 = s0 (α) + s0 (1 − β) PDF?) and what it should be in terms of. P (A, B).
By solving this system of linear equations, we have Try simple and extreme cases. To make an abstract experiment 3. Don’t assume independence without justification. In the
more concrete, try drawing a picture or making up numbers matching problem, the probability that card 1 is a match and

β α
 that could have happened. Pattern recognition: does the card 2 is a match is not 1/n2 . Binomial and Hypergeometric
~
s= , structure of the problem resemble something we’ve seen before? are often confused; the trials are independent in the Binomial
α+β α+β
2. Calculating probability of an event. Use counting story and dependent in the Hypergeometric story.
principles if the naive definition of probability applies. Is the
To show that the chain is reversible with respect to ~ s, we must show 4. Don’t forget to do sanity checks. Probabilities must be
probability of the complement easier to find? Look for
si qij = sj qji for all i, j. This is done if we can show s0 q01 = s1 q10 . between 0 and 1. Variances must be ≥ 0. Supports must make
symmetries. Look for something to condition on, then apply
And indeed, sense. PMFs must sum to 1. PDFs must integrate to 1.
αβ Bayes’ Rule or the Law of Total Probability.
s0 q01 = = s1 q10 3. Finding the distribution of a random variable. First make 5. Don’t confuse random variables, numbers, and events.
α+β Let X be an r.v. Then g(X) is an r.v. for any function g. In
sure you need the full distribution not just the mean (see next
item). Check the support of the random variable: what values particular, X 2 , |X|, F (X), and IX>3 are r.v.s.
Markov chains (2) can it take on? Use this to rule out distributions that don’t fit. P (X 2 < X|X ≥ 0), E(X), Var(X), and g(E(X)) are numbers.
William and Sebastian play a modified game of Settlers of Catan, Is there a story for one of the named distributions that fits the X = 2R and F (X) ≥ −1 are events. It does not make sense to

where every turn they randomly move the robber (which starts on the problem at hand? Can you write the random variable as a write −∞ F (X)dx, because F (X) is a random variable. It does
center tile) to one of the adjacent hexagons. function of an r.v. with a known distribution, say Y = g(X)? not make sense to write P (X), because X is not an event.
6. Don’t confuse a random variable with its distribution. Recommended Resources
To get the PDF of X 2 , you can’t just square the PDF of X.
The right way is to use transformations. To get the PDF of
X + Y , you can’t just add the PDF of X and the PDF of Y .
The right way is to compute the convolution. • Introduction to Probability Book
(http://bit.ly/introprobability)
7. Don’t pull non-linear functions out of expectations. • Stat 110 Online (http://stat110.net)
E(g(X)) does not equal g(E(X)) in general. The St. • Stat 110 Quora Blog (https://stat110.quora.com/)
Petersburg paradox is an extreme example. See also Jensen’s
inequality. The right way to find E(g(X)) is with LOTUS. • Quora Probability FAQ (http://bit.ly/probabilityfaq)
• R Studio (https://www.rstudio.com)
• LaTeX File (github.com/wzchen/probability cheatsheet)

Please share this cheatsheet with friends!


http://wzchen.com/probability-cheatsheet

Distributions in R

Command What it does


help(distributions) shows documentation on distributions
dbinom(k,n,p) PMF P (X = k) for X ∼ Bin(n, p)
pbinom(x,n,p) CDF P (X ≤ x) for X ∼ Bin(n, p)
qbinom(a,n,p) ath quantile for X ∼ Bin(n, p)
rbinom(r,n,p) vector of r i.i.d. Bin(n, p) r.v.s
dgeom(k,p) PMF P (X = k) for X ∼ Geom(p)
dhyper(k,w,b,n) PMF P (X = k) for X ∼ HGeom(w, b, n)
dnbinom(k,r,p) PMF P (X = k) for X ∼ NBin(r, p)
dpois(k,r) PMF P (X = k) for X ∼ Pois(r)
dbeta(x,a,b) PDF f (x) for X ∼ Beta(a, b)
dchisq(x,n) PDF f (x) for X ∼ χ2n
dexp(x,b) PDF f (x) for X ∼ Expo(b)
dgamma(x,a,r) PDF f (x) for X ∼ Gamma(a, r)
dlnorm(x,m,s) PDF f (x) for X ∼ LN (m, s2 )
dnorm(x,m,s) PDF f (x) for X ∼ N (m, s2 )
dt(x,n) PDF f (x) for X ∼ tn
dunif(x,a,b) PDF f (x) for X ∼ Unif(a, b)

The table above gives R commands for working with various named
distributions. Commands analogous to pbinom, qbinom, and rbinom
work for the other distributions in the table. For example, pnorm,
qnorm, and rnorm can be used to get the CDF, quantiles, and random
generation for the Normal. For the Multinomial, dmultinom can be used
for calculating the joint PMF and rmultinom can be used for generating
random vectors. For the Multivariate Normal, after installing and
loading the mvtnorm package dmvnorm can be used for calculating the
joint PDF and rmvnorm can be used for generating random vectors.
Table of Distributions

Distribution PMF/PDF and Support Expected Value Variance MGF

Bernoulli P (X = 1) = p
Bern(p) P (X = 0) = q = 1 − p p pq q + pet

n k n−k
Binomial P (X = k) = k
p q
Bin(n, p) k ∈ {0, 1, 2, . . . n} np npq (q + pet )n

Geometric P (X = k) = q k p
p
Geom(p) k ∈ {0, 1, 2, . . . } q/p q/p2 1−qet
, qet < 1

r+n−1 r n
Negative Binomial P (X = n) = r−1
p q
p
NBin(r, p) n ∈ {0, 1, 2, . . . } rq/p rq/p2 ( 1−qe r t
t ) , qe < 1

w+b
  
P (X = k) = w b
 
Hypergeometric k n−k
/ n  
nw w+b−n µ µ
HGeom(w, b, n) k ∈ {0, 1, 2, . . . , n} µ= b+w w+b−1
nn (1 − n
) messy

e−λ λk
Poisson P (X = k) = k!
t
Pois(λ) k ∈ {0, 1, 2, . . . } λ λ eλ(e −1)

1
Uniform f (x) = b−a
a+b (b−a)2 etb −eta
Unif(a, b) x ∈ (a, b) 2 12 t(b−a)

2 2
f (x) = √1 e−(x − µ) /(2σ )
Normal σ 2π
σ 2 t2
N (µ, σ 2 ) x ∈ (−∞, ∞) µ σ2 etµ+ 2

Exponential f (x) = λe−λx


1 1 λ
Expo(λ) x ∈ (0, ∞) λ λ2 λ−t
, t<λ

1
f (x) = (λx)a e−λx x1
Gamma Γ(a)  a
a a λ
Gamma(a, λ) x ∈ (0, ∞) λ λ2 λ−t
,t<λ

Γ(a+b) a−1
f (x) = x (1 − x)b−1
Beta Γ(a)Γ(b)
a µ(1−µ)
Beta(a, b) x ∈ (0, 1) µ= a+b (a+b+1)
messy

2
1
e−(log x−µ) /(2σ 2 )
Log-Normal xσ


2 2
LN (µ, σ 2 ) x ∈ (0, ∞) θ = eµ+σ /2 θ 2 (eσ − 1) doesn’t exist

1
Chi-Square xn/2−1 e−x/2
2n/2 Γ(n/2)
χ2n x ∈ (0, ∞) n 2n (1 − 2t)−n/2 , t < 1/2

Γ((n+1)/2)

nπΓ(n/2)
(1 + x2 /n)−(n+1)/2
Student-t
n
tn x ∈ (−∞, ∞) 0 if n > 1 n−2
if n > 2 doesn’t exist
Task lubridate Date POSIXct
now (system time zone) now() Sys.time()
now (GMT) now("GMT") Sys.Date()
origin origin structure(0, class = "Date") structure(0, class = c("POSIXt", "POSIXct"))
x days since origin origin + days(x) structure(floor(x), class = "Date") structure(x*24*60*60, class=c("POSIXt","POSIXct"))
next day date + days(1) date + 1 seq(date, length = 2, by = "day")[2]
previous day date - days(1) date - 1 seq(date, length = 2, by = "-1 day")[2]
DST and time zones
x days since date
(day exactly 24 hours) date + ddays(x) seq(date, length = 2, by = paste(x, "day"))[2]
Dates and Times Made Easy with lubridate

(allowing for DST) date + days(x) date + floor(x) seq(date, length = 2, by = paste(x,"DSTday"))[2]
display date in new time zone with_tz(date, "TZ") as.POSIXct(format(as.POSIXct(date), tz = "TZ"),
tz = "TZ")
keep clock time, replace time zone force_tz(date, tz = "TZ")
Exploring
sequence date + c(0:9) * days(1) seq(date, length = 10, by = "day") seq(date, length = 10, by = "DSTday")
every 2nd week date + c(0:2) * weeks(2) seq(date, length = 3, by = "2 week") seq(date, length = 3, by = "2 week"
first day of month floor_date(date, "month") as.Date(format(date, "%Y-%m-01")) as.POSIXct(format(date, "%Y-%m-01"))
round to nearest first of month round_date(date,"month")
extract year value year(date) as.numeric(format(date,"%Y")) as.numeric(format(date, "%Y"))
change year value year(date) <- z as.Date(format(date, "z-%m-%d")) as.POSIXct(format(date, "z-%m-%d"))
day of week wday(date) # Sun = 1 as.numeric(format(date,"%w")) # Sun = 0 as.numeric(format(date, "%w")) # Sun = 0
day of year yday(date) as.numeric(format(date, "%j")) as.numeric(format(date, "%j"))
express as decimal of year decimal_date(date)
Parsing dates
z = "1970-10-15" ymd(z) as.Date(z) as.POSIXct(z)
z = "10/15/1970" mdy(z) as.Date(z, "%m/%d/%Y") as.POSIXct(strptime(z, "%m/%d/%Y"))
z = 15101970 dmy(z) as.Date(as.character(z), as.POSIXct(as.character(z),tz ="GMT",
format = "%d%m%Y") format = "%d%m%Y")

Durations comparison
Duration lubridate Base R
1 second seconds(1) as.difftime(1, unit = "secs")
5 days, 3 hours and - 1 minute new_duration(day = 5, as.difftime(60 * 24 * 5 + 60 * 3 - 1, unit = "mins")
hour = 3, minute = -1) # Time difference of 7379 mins
1 month months(1)
1 year years(1)

Table 3: lubridate provides a simple alternative for many date and time related operations. Table adapted from Grothendieck and
Petzoldt (2004).
4
Data Visualization with ggplot2 : : CHEAT SHEET
Basics Geoms Use a geom function to represent data points, use the geom’s aesthetic properties to represent variables. 

Each function returns a layer.
GRAPHICAL PRIMITIVES TWO VARIABLES 

ggplot2 is based on the grammar of graphics, the idea
that you can build every graph from the same a <- ggplot(economics, aes(date, unemploy)) continuous x , continuous y continuous bivariate distribution
components: a data set, a coordinate system, b <- ggplot(seals, aes(x = long, y = lat)) h <- ggplot(diamonds, aes(carat, price))
e <- ggplot(mpg, aes(cty, hwy))
and geoms—visual marks that represent data points. a + geom_blank()
 e + geom_label(aes(label = cty), nudge_x = 1, h + geom_bin2d(binwidth = c(0.25, 500))

(Useful for expanding limits) nudge_y = 1, check_overlap = TRUE) x, y, label, x, y, alpha, color, fill, linetype, size, weight
F M A alpha, angle, color, family, fontface, hjust,
b + geom_curve(aes(yend = lat + 1,
 lineheight, size, vjust h + geom_density2d()

+ = xend=long+1,curvature=z)) - x, xend, y, yend,
alpha, angle, color, curvature, linetype, size e + geom_jitter(height = 2, width = 2) 

x, y, alpha, colour, group, linetype, size
x, y, alpha, color, fill, shape, size h + geom_hex()

data geom coordinate plot a + geom_path(lineend="butt", linejoin="round’,
x=F·y=A system linemitre=1)
 x, y, alpha, colour, fill, size
e + geom_point(), x, y, alpha, color, fill, shape,
x, y, alpha, color, group, linetype, size size, stroke

To display values, map variables in the data to visual a + geom_polygon(aes(group = group))
 e + geom_quantile(), x, y, alpha, color, group,
properties of the geom (aesthetics) like size, color, and x x, y, alpha, color, fill, group, linetype, size linetype, size, weight
 continuous function
and y locations. i <- ggplot(economics, aes(date, unemploy))
b + geom_rect(aes(xmin = long, ymin=lat, xmax=
F M A long + 1, ymax = lat + 1)) - xmax, xmin, ymax, e + geom_rug(sides = “bl”), x, y, alpha, color, i + geom_area()

ymin, alpha, color, fill, linetype, size linetype, size x, y, alpha, color, fill, linetype, size
+ = a + geom_ribbon(aes(ymin=unemploy - 900, e + geom_smooth(method = lm), x, y, alpha, i + geom_line()

ymax=unemploy + 900)) - x, ymax, ymin, color, fill, group, linetype, size, weight x, y, alpha, color, group, linetype, size
data geom coordinate plot alpha, color, fill, group, linetype, size
x=F·y=A system
color = F e + geom_text(aes(label = cty), nudge_x = 1, i + geom_step(direction = "hv")

size = A nudge_y = 1, check_overlap = TRUE), x, y, label, x, y, alpha, color, group, linetype, size

alpha, angle, color, family, fontface, hjust, 

LINE SEGMENTS lineheight, size, vjust 

common aesthetics: x, y, alpha, color, linetype, size 

b + geom_abline(aes(intercept=0, slope=1)) visualizing error
Complete the template below to build a graph. b + geom_hline(aes(yintercept = lat)) df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
b + geom_vline(aes(xintercept = long)) discrete x , continuous y j <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se))
ggplot (data = <DATA> ) + <GEOM_FUNCTION> required e <- ggplot(mpg, aes(cty, hwy))
b + geom_segment(aes(yend=lat+1, xend=long+1)) j + geom_crossbar(fatten = 2)

(mapping = ase ( <MAPPINGS> ), x, y, ymax, ymin, alpha, color, fill, group, linetype,
b + geom_spoke(aes(angle = 1:1155, radius = 1)) f + geom_col(), x, y, alpha, color, fill, group,
stat = <STAT> , position = <POSITION> ) + Not 
 linetype, size size
<COORDINATE_FUNCTION> + required,
sensible j + geom_errorbar(), x, ymax, ymin, alpha, color,
f + geom_boxplot(), x, y, lower, middle, upper, group, linetype, size, width (also
<FACET_FUNCTION> + defaults
supplied ONE VARIABLE continuous ymax, ymin, alpha, color, fill, group, linetype, geom_errorbarh())
<SCALE_FUNCTION> + shape, size, weight
c <- ggplot(mpg, aes(hwy)); c2 <- ggplot(mpg)
j + geom_linerange()

<THEME_FUNCTION> f + geom_dotplot(binaxis = “y”, stackdir = x, ymin, ymax, alpha, color, group, linetype, size
c + geom_area(stat = "bin")
 “center”), x, y, alpha, color, fill, group
x, y, alpha, color, fill, linetype, size j + geom_pointrange()

ggplot(data = mpg, aes(x = cty, y = hwy)) Begins a plot f + geom_violin(scale = “area”), x, y, alpha, color, x, y, ymin, ymax, alpha, color, fill, group, linetype,
that you finish by adding layers to. Add one geom c + geom_density(kernel = "gaussian")
 fill, group, linetype, size, weight shape, size
function per layer. 
 x, y, alpha, color, fill, group, linetype, size, weight
aesthetic mappings data geom
c + geom_dotplot() 
 visualizing error
qplot(x = cty, y = hwy, data = mpg, geom = “point") x, y, alpha, color, fill data <- data.frame(murder = USArrests$Murder,

Creates a complete plot with given data, geom, and discrete x , discrete y state = tolower(rownames(USArrests)))

mappings. Supplies many useful defaults. c + geom_freqpoly() x, y, alpha, color, group, g <- ggplot(diamonds, aes(cut, color)) map <- map_data("state")

linetype, size k <- ggplot(data, aes(fill = murder))
last_plot() Returns the last plot g + geom_count(), x, y, alpha, color, fill, shape,
c + geom_histogram(binwidth = 5) x, y, alpha, k + geom_map(aes(map_id = state), map = map)
ggsave("plot.png", width = 5, height = 5) Saves last plot color, fill, linetype, size, weight size, stroke + expand_limits(x = map$long, y = map$lat),
as 5’ x 5’ file named "plot.png" in working directory. map_id, alpha, color, fill, linetype, size
Matches file type to file extension. c2 + geom_qq(aes(sample = hwy)) x, y, alpha,
color, fill, linetype, size, weight
THREE VARIABLES
seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2))l <- ggplot(seals, aes(long, lat))
discrete l + geom_contour(aes(z = z))
 l + geom_raster(aes(fill = z), hjust=0.5, vjust=0.5,
d <- ggplot(mpg, aes(fl)) x, y, z, alpha, colour, group, linetype, 
 interpolate=FALSE)

size, weight x, y, alpha, fill
d + geom_bar() 

x, alpha, color, fill, linetype, size, weight l + geom_tile(aes(fill = z)), x, y, alpha, color, fill,
linetype, size, width

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01
Stats An alternative way to build a layer Scales Coordinate Systems Faceting
A stat builds new variables to plot (e.g., count, prop). Scales map data values to the visual values of an r <- d + geom_bar() Facets divide a plot into 

fl cty cyl aesthetic. To change a mapping, add a new scale. r + coord_cartesian(xlim = c(0, 5)) 
 subplots based on the 

xlim, ylim
 values of one or more 

x ..count.. (n <- d + geom_bar(aes(fill = fl))) The default cartesian coordinate system discrete variables.
+ = aesthetic prepackaged
scale_ to adjust scale to use
scale-specific
arguments
r + coord_fixed(ratio = 1/2) 

ratio, xlim, ylim
 t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
data stat geom coordinate plot Cartesian coordinates with fixed aspect ratio
x = x ·
 system n + scale_fill_manual( between x and y units
y = ..count.. values = c("skyblue", "royalblue", "blue", “navy"), r + coord_flip() 
 t + facet_grid(. ~ fl)

Visualize a stat by changing the default stat of a geom limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", “r"), xlim, ylim
 facet into columns based on fl
name = "fuel", labels = c("D", "E", "P", "R")) Flipped Cartesian coordinates
function, geom_bar(stat="count") or by using a stat t + facet_grid(year ~ .)

r + coord_polar(theta = "x", direction=1 ) 
 facet into rows based on year
function, stat_count(geom="bar"), which calls a default range of title to use in labels to use breaks to use in theta, start, direction

geom to make a layer (equivalent to a geom function). values to include legend/axis in legend/axis legend/axis
in mapping Polar coordinates t + facet_grid(year ~ fl)

Use ..name.. syntax to map stat variables to aesthetics. r + coord_trans(ytrans = “sqrt") 
 facet into both rows and columns
xtrans, ytrans, limx, limy
 t + facet_wrap(~ fl)

GENERAL PURPOSE SCALES Transformed cartesian coordinates. Set xtrans and wrap facets into a rectangular layout
geom to use stat function geommappings ytrans to the name of a window function.
Use with most aesthetics
i + stat_density2d(aes(fill = ..level..), Set scales to let axis limits vary across facets
scale_*_continuous() - map cont’ values to visual ones π + coord_quickmap()
geom = "polygon") 60

variable created by stat scale_*_discrete() - map discrete values to visual ones π + coord_map(projection = "ortho", t + facet_grid(drv ~ fl, scales = "free")


lat
scale_*_identity() - use data values as visual ones orientation=c(41, -74, 0))projection, orienztation, x and y axis limits adjust to individual facets

xlim, ylim "free_x" - x axis limits adjust

c + stat_bin(binwidth = 1, origin = 10)
 scale_*_manual(values = c()) - map discrete values to long

Map projections from the mapproj package


manually chosen visual ones "free_y" - y axis limits adjust
x, y | ..count.., ..ncount.., ..density.., ..ndensity.. (mercator (default), azequalarea, lagrange, etc.)
scale_*_date(date_labels = "%m/%d"), date_breaks = "2 Set labeller to adjust facet labels
c + stat_count(width = 1) x, y, | ..count.., ..prop.. weeks") - treat data values as dates.
c + stat_density(adjust = 1, kernel = “gaussian") 
 scale_*_datetime() - treat data x values as date times. t + facet_grid(. ~ fl, labeller = label_both)
x, y, | ..count.., ..density.., ..scaled..

e + stat_bin_2d(bins = 30, drop = T)



Use same arguments as scale_x_date(). See ?strptime for
label formats. Position Adjustments fl: c fl: d fl: e fl: p fl: r

t + facet_grid(fl ~ ., labeller = label_bquote(alpha ^ .(fl)))


x, y, fill | ..count.., ..density.. Position adjustments determine how to arrange geoms αc αd αe αp αr
e + stat_bin_hex(bins=30) x, y, fill | ..count.., ..density..
X & Y LOCATION SCALES that would otherwise occupy the same space. t + facet_grid(. ~ fl, labeller = label_parsed)
e + stat_density_2d(contour = TRUE, n = 100)
 Use with x or y aesthetics (x shown here) c d e p r
x, y, color, size | ..level.. s <- ggplot(mpg, aes(fl, fill = drv))
scale_x_log10() - Plot x on log10 scale
scale_x_reverse() - Reverse direction of x axis s + geom_bar(position = "dodge")

e + stat_ellipse(level = 0.95, segments = 51, type = "t")
l + stat_contour(aes(z = z)) x, y, z, order | ..level..
scale_x_sqrt() - Plot x on square root scale Arrange elements side by side
s + geom_bar(position = "fill")

Labels
Stack elements on top of one another, 
 t + labs( x = "New x axis label", y = "New y axis label",

l + stat_summary_hex(aes(z = z), bins = 30, fun = max)
 COLOR AND FILL SCALES (DISCRETE) normalize height
x, y, z, fill | ..value.. title ="Add a title above the plot", 

n <- d + geom_bar(aes(fill = fl)) e + geom_point(position = "jitter")
 Use scale functions
subtitle = "Add a subtitle below title",
 to update legend
l + stat_summary_2d(aes(z = z), bins = 30, fun = mean)
 Add random noise to X and Y position of each caption = "Add a caption below plot", labels
x, y, z, fill | ..value.. n + scale_fill_brewer(palette = "Blues") 
 element to avoid overplotting
For palette choices: A <aes> = "New <aes>
<AES> <AES> legend title")
RColorBrewer::display.brewer.all() e + geom_label(position = "nudge")

f + stat_boxplot(coef = 1.5) x, y | ..lower.., 
 B Nudge labels away from points
 t + annotate(geom = "text", x = 8, y = 9, label = "A")
..middle.., ..upper.., ..width.. , ..ymin.., ..ymax.. n + scale_fill_grey(start = 0.2, end = 0.8, 

na.value = "red") geom to place manual values for geom’s aesthetics
f + stat_ydensity(kernel = "gaussian", scale = “area") x, y | s + geom_bar(position = "stack")

..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. Stack elements on top of one another
COLOR AND FILL SCALES (CONTINUOUS)
e + stat_ecdf(n = 40) x, y | ..x.., ..y..
e + stat_quantile(quantiles = c(0.1, 0.9), formula = y ~
o <- c + geom_dotplot(aes(fill = ..x..)) Each position adjustment can be recast as a function with
manual width and height arguments Legends
log(x), method = "rq") x, y | ..quantile.. o + scale_fill_distiller(palette = "Blues") s + geom_bar(position = position_dodge(width = 1)) n + theme(legend.position = "bottom")


 Place legend at "bottom", "top", "let", or "right"
e + stat_smooth(method = "lm", formula = y ~ x, se=T, o + scale_fill_gradient(low="red", high="yellow")
level=0.95) x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax.. n + guides(fill = "none")


 Set legend type for each aesthetic: colorbar, legend, or
ggplot() + stat_function(aes(x = -3:3), n = 99, fun =
dnorm, args = list(sd=0.5)) x | ..x.., ..y..
o + scale_fill_gradient2(low="red", high=“blue",
mid = "white", midpoint = 25) Themes none (no legend)
n + scale_fill_discrete(name = "Title", 


 labels = c("A", "B", "C", "D", "E"))

e + stat_identity(na.rm = TRUE) o + scale_fill_gradientn(colours=topo.colors(6)) r + theme_bw()
 r + theme_classic() Set legend title and labels with a scale function.
White background

ggplot() + stat_qq(aes(sample=1:100), dist = qt, Also: rainbow(), heat.colors(), terrain.colors(), with grid lines r + theme_light()
dparam=list(df=5)) sample, x, y | ..sample.., ..theoretical.. cm.colors(), RColorBrewer::brewer.pal() r + theme_linedraw()
e + stat_sum() x, y, size | ..n.., ..prop..
e + stat_summary(fun.data = "mean_cl_boot") SHAPE AND SIZE SCALES
r + theme_gray()

Grey background 

(default theme) r + theme_minimal()

Minimal themes
Zooming
r + theme_dark()
 r + theme_void()
 Without clipping (preferred)
h + stat_summary_bin(fun.y = "mean", geom = "bar") p <- e + geom_point(aes(shape = fl, size = cyl)) dark for contrast
p + scale_shape() + scale_size() Empty theme t + coord_cartesian(

e + stat_unique() xlim = c(0, 100), ylim = c(10, 20))
p + scale_shape_manual(values = c(3:7))
With clipping (removes unseen data points)
t + xlim(0, 100) + ylim(10, 20)
p + scale_radius(range = c(1,6))
p + scale_size_area(max_size = 6) t + scale_x_continuous(limits = c(0, 100)) +
scale_y_continuous(limits = c(0, 100))

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01
Geoms - Use a geom to represent data points, use the geom’s aesthetic properties to represent variables. Each function returns a layer.
Data Visualization One Variable Two Variables
with ggplot2 Continuous Continuous X, Continuous Y Continuous Bivariate Distribution
Cheat Sheet a <- ggplot(mpg, aes(hwy))
f <- ggplot(mpg, aes(cty, hwy)) i <- ggplot(movies, aes(year, rating))
f + geom_blank() i + geom_bin2d(binwidth = c(5, 0.5))
a + geom_area(stat = "bin")
xmax, xmin, ymax, ymin, alpha, color, fill,
x, y, alpha, color, fill, linetype, size linetype, size, weight
b + geom_area(aes(y = ..density..), stat = "bin") i + geom_density2d()
f + geom_jitter()
a + geom_density(kernel = "gaussian") x, y, alpha, color, fill, shape, size x, y, alpha, colour, linetype, size
Basics x, y, alpha, color, fill, linetype, size, weight
b + geom_density(aes(y = ..county..)) f + geom_point() i + geom_hex()
ggplot2 is based on the grammar of graphics, the
a + geom_dotplot() x, y, alpha, color, fill, shape, size x, y, alpha, colour, fill size
idea that you can build every graph from the same
few components: a data set, a set of geoms—visual x, y, alpha, color, fill
marks that represent data points, and a coordinate f + geom_quantile() Continuous Function
system. F M A x, y, alpha, color, linetype, size, weight j <- ggplot(economics, aes(date, unemploy))
4 4
a + geom_freqpoly()
3 3
x, y, alpha, color, linetype, size j + geom_area()
+ 2
= 2

1 b + geom_freqpoly(aes(y = ..density..)) f + geom_rug(sides = "bl") x, y, alpha, color, fill, linetype, size


1

alpha, color, linetype, size


0
0 1 2 3 4
0
0 1 2 3 4 a + geom_histogram(binwidth = 5)
data geom
x=F
coordinate
system
plot
x, y, alpha, color, fill, linetype, size, weight
j + geom_line()
y=A
f + geom_smooth(model = lm) x, y, alpha, color, linetype, size
To display data values, map variables in the data set b + geom_histogram(aes(y = ..density..))
x, y, alpha, color, fill, linetype, size, weight
to aesthetic properties of the geom like size, color, Discrete j + geom_step(direction = "hv")
and x and y locations. b <- ggplot(mpg, aes(fl)) C f + geom_text(aes(label = cty)) x, y, alpha, color, linetype, size
F MA AB x, y, label, alpha, angle, color, family, fontface,
4

3
4

3
b + geom_bar() hjust, lineheight, size, vjust
+ 2
= 2
x, alpha, color, fill, linetype, size, weight Visualizing error
1 1
df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
0 0

data geom
0 1 2
coordinate
3 4 0 1 2

plot
3 4

Discrete X, Continuous Y k <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se))
x=F system
y=A
color = F
size = A
Graphical Primitives g <- ggplot(mpg, aes(class, hwy))
k + geom_crossbar(fatten = 2)
Build a graph with qplot() or ggplot() g + geom_bar(stat = "identity") x, y, ymax, ymin, alpha, color, fill, linetype,
c <- ggplot(map, aes(long, lat)) x, y, alpha, color, fill, linetype, size, weight size
aesthetic mappings data geom
c + geom_polygon(aes(group = group)) k + geom_errorbar()
qplot(x = cty, y = hwy, color = cyl, data = mpg, geom = "point") g + geom_boxplot() x, ymax, ymin, alpha, color, linetype, size,
x, y, alpha, color, fill, linetype, size
Creates a complete plot with given data, geom, and lower, middle, upper, x, ymax, ymin, alpha, width (also geom_errorbarh())
mappings. Supplies many useful defaults. color, fill, linetype, shape, size, weight k + geom_linerange()
g + geom_dotplot(binaxis = "y", x, ymin, ymax, alpha, color, linetype, size
ggplot(data = mpg, aes(x = cty, y = hwy)) d <- ggplot(economics, aes(date, unemploy))
stackdir = "center")
Begins a plot that you finish by adding layers to. No d + geom_path(lineend="butt", x, y, alpha, color, fill k + geom_pointrange()
defaults, but provides more control than qplot(). linejoin="round’, linemitre=1) g + geom_violin(scale = "area") x, y, ymin, ymax, alpha, color, fill, linetype,
data x, y, alpha, color, linetype, size shape, size
add layers, x, y, alpha, color, fill, linetype, size, weight
ggplot(mpg, aes(hwy, cty)) +
elements with + d + geom_ribbon(aes(ymin=unemploy - 900, Maps
geom_point(aes(color = cyl)) + layer = geom + ymax=unemploy + 900)) data <- data.frame(murder = USArrests$Murder,
geom_smooth(method ="lm") + default stat + x, ymax, ymin, alpha, color, fill, linetype, size Discrete X, Discrete Y state = tolower(rownames(USArrests)))
coord_cartesian() + layer specific map <- map_data("state")
scale_color_gradient() + mappings h <- ggplot(diamonds, aes(cut, color)) l <- ggplot(data, aes(fill = murder))
theme_bw() additional l + geom_map(aes(map_id = state), map = map) +
elements e <- ggplot(seals, aes(x = long, y = lat)) h + geom_jitter()
x, y, alpha, color, fill, shape, size expand_limits(x = map$long, y = map$lat)
Add a new layer to a plot with a geom_*() e + geom_segment(aes( map_id, alpha, color, fill, linetype, size
or stat_*() function. Each provides a geom, a xend = long + delta_long,
set of aesthetic mappings, and a default stat yend = lat + delta_lat)) Three Variables
and position adjustment. x, xend, y, yend, alpha, color, linetype, size
seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2)) m + geom_raster(aes(fill = z), hjust=0.5,
last_plot() e + geom_rect(aes(xmin = long, ymin = lat, m <- ggplot(seals, aes(long, lat)) vjust=0.5, interpolate=FALSE)
Returns the last plot xmax= long + delta_long,
ymax = lat + delta_lat)) x, y, alpha, fill
ggsave("plot.png", width = 5, height = 5) m + geom_contour(aes(z = z)) m + geom_tile(aes(fill = z))
Saves last plot as 5’ x 5’ file named "plot.png" in xmax, xmin, ymax, ymin, alpha, color, fill,
linetype, size x, y, z, alpha, colour, linetype, size, weight x, y, alpha, color, fill, linetype, size
working directory. Matches file type to file extension.

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at docs.ggplot2.org • ggplot2 0.9.3.1 • Updated: 3/15
Stats - An alternative way to build a layer Scales Coordinate Systems Faceting
Scales control how a plot maps data values to the visual Facets divide a plot into subplots based on the values
Some plots visualize a transformation of the original data set. r <- b + geom_bar()
Use a stat to choose a common transformation to visualize, values of an aesthetic. To change the mapping, add a of one or more discrete variables.
e.g. a + geom_bar(stat = "bin")
r + coord_cartesian(xlim = c(0, 5)) t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
custom scale.
fl cty cyl 4 4 n <- b + geom_bar(aes(fill = fl)) xlim, ylim
x ..count.. t + facet_grid(. ~ fl)
3 3
n The default cartesian coordinate system
+ 2
= 2

r + coord_fixed(ratio = 1/2)
facet into columns based on fl
1 1
aesthetic prepackaged scale specific
scale_ t + facet_grid(year ~ .)
0
0 1 2 3 4
0
0 1 2 3 4 to adjust scale to use arguments ratio, xlim, ylim
data stat geom coordinate
system
plot facet into rows based on year
x=x
y = ..count.. n + scale_fill_manual( Cartesian coordinates with fixed aspect
Each stat creates additional variables to map aesthetics values = c("skyblue", "royalblue", "blue", "navy"), ratio between x and y units t + facet_grid(year ~ fl)
limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", "r"),
facet into both rows and columns
to. These variables use a common ..name.. syntax. r + coord_flip()
name = "fuel", labels = c("D", "E", "P", "R")) t + facet_wrap(~ fl)
stat functions and geom functions both combine a stat xlim, ylim wrap facets into a rectangular layout
with a geom to make a layer, i.e. stat_bin(geom="bar") range of values to title to use in labels to use in breaks to use in Flipped Cartesian coordinates
include in mapping legend/axis legend/axis legend/axis Set scales to let axis limits vary across facets
does the same as geom_bar(stat="bin") r + coord_polar(theta = "x", direction=1 )
layer specific variable created theta, start, direction t + facet_grid(y ~ x, scales = "free")
stat function mappings by transformation General Purpose scales
Polar coordinates x and y axis limits adjust to individual facets
Use with any aesthetic:
i + stat_density2d(aes(fill = ..level..), alpha, color, fill, linetype, shape, size r + coord_trans(ytrans = "sqrt") • "free_x" - x axis limits adjust
geom = "polygon", n = 100) xtrans, ytrans, limx, limy • "free_y" - y axis limits adjust
scale_*_continuous() - map cont’ values to visual values
geom for layer parameters for stat scale_*_discrete() - map discrete values to visual values Transformed cartesian coordinates. Set
extras and strains to the name Set labeller to adjust facet labels
a + stat_bin(binwidth = 1, origin = 10) 1D distributions scale_*_identity() - use data values as visual values of a window function. 60 t + facet_grid(. ~ fl, labeller = label_both)
x, y | ..count.., ..ncount.., ..density.., ..ndensity.. scale_*_manual(values = c()) - map discrete values to
a + stat_bindot(binwidth = 1, binaxis = "x") manually chosen visual values z + coord_map(projection = "ortho", fl: c fl: d fl: e fl: p fl: r
orientation=c(41, -74, 0))

lat
x, y, | ..count.., ..ncount.. t + facet_grid(. ~ fl, labeller = label_bquote(alpha ^ .(x)))
a + stat_density(adjust = 1, kernel = "gaussian") X and Y location scales projection, orientation, xlim, ylim αc αd αe αp αr
x, y, | ..count.., ..density.., ..scaled.. Use with x or y aesthetics (x shown here) Map projections from the mapproj package long t + facet_grid(. ~ fl, labeller = label_parsed)
f + stat_bin2d(bins = 30, drop = TRUE) scale_x_date(labels = date_format("%m/%d"), (mercator (default), azequalarea, lagrange, etc.) c d e p r
2D distributions
x, y, fill | ..count.., ..density.. breaks = date_breaks("2 weeks")) - treat x
f + stat_binhex(bins = 30) values as dates. See ?strptime for label formats.
x, y, fill | ..count.., ..density.. Position Adjustments Labels
f + stat_density2d(contour = TRUE, n = 100)
scale_x_datetime() - treat x values as date times. Use
same arguments as scale_x_date(). Position adjustments determine how to arrange t + ggtitle("New Plot Title")
x, y, color, size | ..level..
scale_x_log10() - Plot x on log10 scale geoms that would otherwise occupy the same space. Add a main title above the plot
m + stat_contour(aes(z = z)) 3 Variables scale_x_reverse() - Reverse direction of x axis s <- ggplot(mpg, aes(fl, fill = drv)) t + xlab("New X label") Use scale functions
x, y, z, order | ..level.. to update legend
s + geom_bar(position = "dodge") Change the label on the X axis
m+ stat_spoke(aes(radius= z, angle = z)) scale_x_sqrt() - Plot x on square root scale labels
angle, radius, x, xend, y, yend | ..x.., ..xend.., ..y.., ..yend.. Arrange elements side by side t + ylab("New Y label")
m + stat_summary_hex(aes(z = z), bins = 30, fun = mean) Color and fill scales Change the label on the Y axis
x, y, z, fill | ..value.. Discrete Continuous s + geom_bar(position = "fill") t + labs(title =" New title", x = "New x", y = "New y")
m + stat_summary2d(aes(z = z), bins = 30, fun = mean) Stack elements on top of one another,
x, y, z, fill | ..value.. n <- b + geom_bar( o <- a + geom_dotplot( All of the above
aes(fill = fl)) aes(fill = ..x..))
normalize height
g + stat_boxplot(coef = 1.5) Comparisons n + scale_fill_brewer( o + scale_fill_gradient( s + geom_bar(position = "stack") Legends
x, y | ..lower.., ..middle.., ..upper.., ..outliers.. palette = "Blues") low = "red",
For palette choices: high = "yellow") Stack elements on top of one another t + theme(legend.position = "bottom")
g + stat_ydensity(adjust = 1, kernel = "gaussian", scale = "area") o + scale_fill_gradient2(
library(RcolorBrewer) Place legend at "bottom", "top", "let", or "right"
x, y | ..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. display.brewer.all() low = "red", hight = "blue", f + geom_point(position = "jitter")
mid = "white", midpoint = 25)
f + stat_ecdf(n = 40) n + scale_fill_grey( Add random noise to X and Y position t + guides(color = "none")
Functions o + scale_fill_gradientn(
x, y | ..x.., ..y.. start = 0.2, end = 0.8, colours = terrain.colors(6)) of each element to avoid overplotting Set legend type for each aesthetic: colorbar, legend,
na.value = "red") Also: rainbow(), heat.colors(),
f + stat_quantile(quantiles = c(0.25, 0.5, 0.75), formula = y ~ log(x), or none (no legend)
method = "rq")
topo.colors(), cm.colors(), Each position adjustment can be recast as a function
RColorBrewer::brewer.pal() with manual width and height arguments t + scale_fill_discrete(name = "Title",
x, y | ..quantile.., ..x.., ..y.. labels = c("A", "B", "C"))
f + stat_smooth(method = "auto", formula = y ~ x, se = TRUE, n = 80, Shape scales s + geom_bar(position = position_dodge(width = 1))
fullrange = FALSE, level = 0.95) Manual shape values Set legend title and labels with a scale function.
p <- f + geom_point(
x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax..
aes(shape = fl))
0 6 12 18 24 0
ggplot() + stat_function(aes(x = -3:3), General Purpose
1 7 13 19 25 + Themes Zooming
p + scale_shape(
fun = dnorm, n = 101, args = list(sd=0.5)) solid = FALSE) 2 8 14 20 *
* - 150
r + theme_bw() 150
r + theme_classic()
Without clipping (preferred)
x | ..y.. 3 9 15 21 .
| t + coord_cartesian(
count
100
count

100

f + stat_identity() p + scale_shape_manual( White background White background


o %
50
50

values = c(3:7)) 4 10 16 22 o % 0

with grid lines


0

no gridlines xlim = c(0, 100), ylim = c(10, 20))


ggplot() + stat_qq(aes(sample=1:100), distribution = qt,
c d e p r
c d e p r
fl fl

Shape values shown in


dparams = list(df=5)) chart on right
5 11 17 23 O
O # 150 r + theme_grey() 150
r + theme_minimal() With clipping (removes unseen data points)
sample, x, y | ..x.., ..y..
count

count

100

Grey background
100

50 50
Minimal theme t + xlim(0, 100) + ylim(10, 20)
f + stat_sum() Size scales 0
c d e
fl
p r
(default theme) 0
c d e
fl
p r

x, y, size | ..size.. q + scale_size_area(max = 6) t + scale_x_continuous(limits = c(0, 100)) +


f + stat_summary(fun.data = "mean_cl_boot") q <- f + geom_point( Value mapped to area of circle ggthemes - Package with additional ggplot2 themes scale_y_continuous(limits = c(0, 100))
aes(size = cyl)) (not radius)
f + stat_unique()
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at docs.ggplot2.org • ggplot2 0.9.3.1 • Updated: 3/15
Package Development Setup ( " DESCRIPTION)
The " DESCRIPTION file describes your work and sets
with devtools Cheat Sheet up how your package will work with other packages. Package: mypackage
Title: Title of Package

% You must have a DESCRIPTION file Version: 0.1.0


Authors@R: person("Hadley", "Wickham", email =

% Adddevtools::use_package()
"hadley@me.com", role = c("aut", "cre"))
the packages that yours relies on with
Package Structure Description: What the package does (one paragraph)
Depends: R (>= 3.1.0)
A package is a convention for organizing files into Adds a package to the Imports field (or Suggests License: GPL-2
directories. field (if second argument is "Suggests"). LazyData: true Import packages that your package
Imports: must have to work. R will install
dplyr (>= 0.4.0), them when it installs your package.
This sheet shows how to work with the 7 most common CC0 MIT GPL-2 ggvis (>= 0.2)
parts of an R package: No strings attached. MIT license applies to GPL-2 license applies to your Suggests: Suggest packages that are not very
your code if re-shared. code, and all code anyone essential to yours. Users can install
knitr (>= 0.1.0)
# Package bundles with it, if re-shared. them manually, or not, as they like.
" DESCRIPTION Setup
$ R/
$ tests/
Write code
Test
Write code ( $ R/) Setup ( "( $
Test tests/)
DESCRIPTION)
$ man/ Document All of the R code in your package goes in $ R/. A package The$"tests/
Use to store unit
DESCRIPTION tests thatyour
file describes will inform yousets
work and if
$ vignettes/ Teach with just an R/ directory is still a very useful package. your code
up how your package willever
workbreaks.
with other packages.
$ data/ Add data
" NAMESPACE Organize
% Create a new package project with
%%Add
You amust have
tests/ a DESCRIPTION
directory file testthat with
and import

% Adddevtools::use_package()
devtools::create("path/to/name") devtools::use_testthat()
the packages that yours relies on with
The contents of a package can be stored on disk as a: Sets up package to use automated tests with
• source - a directory with sub-directories (as above) Create a template to develop into a package. testthat
• bundle - a single compressed file (.tar.gz)
• binary - a single compressed file optimized for a specific
% Save your code in $ R/ as scripts (extension .R) Write
Adds a package to the Imports file (default) or
% Suggests fieldcontext(),
tests with (if second test(),
argumentandisexpectations
"Suggests").

OS
Or installed into an R library (loaded into memory during an
Workflow % SaveImports
Import packages that your package
must have to work. R will install
Suggests
your tests as .R files in tests/testthat/
Suggest packages that re not really
essential to yours. Users can install
R session) or archived online in a repository. Use the
1. Edit your code. them when it installs your package. them manually, or not, as they like.
functions below to move between these states. 2. Load your code with one of Workflow Example test
1. Modify yourmypackage
Package: code or tests.
Repository

devtools::load_all()
In memory

Title: Title of Package


Installed

Re-loads all saved files in $ R/ into memory. 2. Test your code with one of
Bundle
Source

context("Arithmetic")
Binary

Version: 0.1.0
Authors@R: person("Hadley",
devtools::test() "Wickham", email =
Ctrl/Cmd + Shit + L (keyboard shortcut) "hadley@me.com", role test_that("Math
= c("aut", "cre", works",
"cst")){
install.packages() CRAN ○ Runs all tests saved in
Description: What the package expect_equal(1 + 1, 2)
does (one paragraph)
Saves all open files then calls load_all(). $ tests/.
install.packages(type = "source") CRAN ○ Depends: R (>= 3.1.0) expect_equal(1 + 2, 3)

○ ○ 3. Experiment in the console. Ctrl/CmdGPL-2


License: + Shit + T expect_equal(1 + 3, 4)
LazyData: true })
R CMD install ○ ○ 4. Repeat. (keyboard shortcut)
Imports:
○ ○ 3. Repeat until (>=
dplyr all tests pass
0.4.0),
devtools::install() ○ • Use consistent style with r-pkgs.had.co.nz/r.html#style ggvis (>= 0.2)
expect_equal()
Suggests: is equal within small numerical tolerance?
devtools::build() ○ ○ • Click on a function and press F2 to open its definition
knitr (>=
expect_identical() 0.1.0)
is exactly equal?
devtools::install_github() github ○ • Search for a function with Ctrl + . expect_match() matches specified string or regular expression?
devtools::load_all() ○ ○ expect_output() prints specified output?
Build & Reload (RStudio) ○ ○ ○ expect_message() displays specified message?
library()
Internet On disk

library memory
○ Visit r-pkgs.had.co.nz for more expect_warning()
expect_error()
displays specified warning?
throws specified error?
devtools::add_build_ignore("file") Learn more at http://r-pkgs.had.co.nz • devtools 1.6.1 • Updated: 1/15 expect_is() output inherits from certain class?
RStudio® is a trademark of RStudio, Inc. • All rights reserved
Adds file to .Rbuildignore, a list of files that will not be included expect_false() returns FALSE?
info@rstudio.com • 844-448-1212 • rstudio.com
when package is built. expect_true() returns TRUE?
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at http://r-pkgs.had.co.nz • devtools 1.6.1 • Updated: 1/15
Document ( $ man/) Add data ( $ data/)
$ man/ contains the documentation for your functions, the help pages in your package. The $ data/ directory allows you to include data with
your package.
% Use roxygen comments to document each
function beside its definition
The roxygen package

% Document the name of each exported data set


roxygen lets you write documentation inline in your .R files
with a shorthand syntax. % Store data in one of data/, R/Sysdata.rda, inst/
extdata

% Include helpful examples for each function • Add roxygen documentation as comment lines that begin % Always use LazyData: true in your DESCRIPTION file.
% Save data as .Rdata files (suggested)
with #’.
Workflow • Place comment lines directly above the code that defines
1. Add roxygen comments in your .R files the object documented.
• Place a roxygen @ tag (right) ater #’ to supply a specific devtools::use_data()
2. Convert roxygen comments into documentation section of documentation. Adds a data object to data/
with one of (R/Sysdata.rda if internal = TRUE)
• Untagged lines will be used to generate a title, description,
devtools::document() and details section (in that order)
devtools::use_data_raw()
Converts roxygen comments to .Rd files and Adds an R Script used to clean a data set to data-
places them in $ man/. Builds NAMESPACE. #' Add together two numbers.
raw/. Includes data-raw/ on .Rbuildignore.
#'
Ctrl/Cmd + Shit + D (Keyboard Shortcut) #' @param x A number.
#' @param y A number. Store data in
3. Open help pages with ? to preview documentation #' @return The sum of \code{x} and \code{y}. • data/ to make data available to package users
#' @examples • R/sysdata.rda to keep data internal for use by your
4. Repeat #' add(1, 1) functions.
#' @export
• inst/extdata to make raw data available for loading and
.Rd formatting tags \email{name@@foo.com} add <- function(x, y) {
x + y
parsing examples. Access this data with system.file()
\href{url}{display}
\emph{italic text} }
\url{url}
\strong{bold text} Organize ( " NAMESPACE)
\code{function(args)} \link[=dest]{display}
Common roxygen tags
The " NAMESPACE file helps you make your package
\pkg{package} \linkS4class{class} @aliases @inheritParams @seealso
self-contained: it won’t interfere with other packages,
\code{\link{function}} @concepts @keywords @format and other packages won’t interfere with it.
\dontrun{code} \code{\link[package]{function}} @describeIn @param @source data
\dontshow{code}
\donttest{code} \tabular{lcr}{
@examples
@export
@rdname
@return
@include
@slot
% Export functions for users by placing @export in their
roxygen comments
let \tab centered \tab right \cr S4 Import objects from other packages with
\deqn{a + b (block)}
\eqn{a + b (inline)} }
cell \tab cell \tab cell \cr @family @section @field RC % package::object (recommended) or @import,
@importFrom, @importClassesFrom,
@importMethodsFrom (not always recommended)
Teach ( $ vignettes/)
$ vignettes/ holds documents that teach your users how
Workflow
to solve real problems with your tools. ---
title: "Vignette Title"
1. Modify your code or tests.
2. Document your package (devtools::document())
%
author: "Vignette Author"
Create a $ vignettes/ directory and a template vignette with date: "`r Sys.Date()`"
devtools::use_vignette() output: rmarkdown::html_vignette 3. Check NAMESPACE
Adds template vignette as vignettes/my-vignette.Rmd. vignette: >
%\VignetteIndexEntry{Vignette Title}
4. Repeat until NAMESPACE is correct
% Append YAML headers to your vignettes (like right) %\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
Submit your package
%
Write the body of your vignettes in R Markdown ---
(rmarkdown.rstudio.com) r-pkgs.had.co.nz/release.html
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at http://r-pkgs.had.co.nz • devtools 1.6.1 • Updated: 1/15
Data Wrangling Tidy Data - A foundation for wrangling in R
F MA F MA
with dplyr and tidyr Tidy data complements R’s vectorized M * A F

Cheat Sheet In a tidy


data set: & operations. R will automatically preserve
observations as you manipulate variables.
Each variable is saved Each observation is No other format works as intuitively with R. M * A
in its own column saved in its own row

Syntax - Helpful conventions for wrangling Reshaping Data - Change the layout of a data set
dplyr::data_frame(a = 1:3, b = 4:6)
dplyr::tbl_df(iris)
Converts data to tbl class. tbl’s are easier to examine than w
ww
w w
w
ww
w
w
Combine vectors into data frame
data frames. R displays only the data that fits onscreen: w
Aw
1005
1013
A 1005
A
1013
A
(optimized).
dplyr::arrange(mtcars, mpg)
Source: local data frame [150 x 5]
Gather columns into rows.
1010
A
1010
tidyr::gather(cases, "year", "n", 2:4)
A 1010
A
tidyr::spread(pollution, size, amount) Order rows by values of a column
1010
A
Spread rows into columns.
(low to high).
Sepal.Length Sepal.Width Petal.Length dplyr::arrange(mtcars, desc(mpg))
1 5.1 3.5 1.4
2 4.9 3.0 1.4 Order rows by values of a column
3 4.7 3.2 1.3
(high to low).
4
5
4.6
5.0
3.1
3.6
1.5
1.4
w
110w
110p ww
110
1007 p
110
1007 w
110w
110p
1007 w
110w
110p
1007 dplyr::rename(tb, y = year)
.. ... ... ...
Variables not shown: Petal.Width (dbl),
Species (fctr)
45 45
45
10091009
45
tidyr::separate(storms, date, c("y", "m", "d"))
Separate one column into several. 45 45
45
1009 1009
45
tidyr::unite(data, col, ..., sep)
Unite several columns into one.
Rename the columns of a data
frame.

dplyr::glimpse(iris) Subset Observations (Rows) Subset Variables (Columns)


Information dense summary of tbl data.
utils::View(iris)
View data set in spreadsheet-like display (note capital V). w
110w
110w
110ww wwww
110
110 wp110
110
1007
1007pw
dplyr::filter(iris, Sepal.Length > 7) 1009
45
100945
dplyr::select(iris, Sepal.Width, Petal.Length, Species)
Extract rows that meet logical criteria. Select columns by name or helper function.
dplyr::distinct(iris)
Helper functions for select - ?select
Remove duplicate rows. select(iris, contains("."))
dplyr::sample_frac(iris, 0.5, replace = TRUE) Select columns whose name contains a character string.
Randomly select fraction of rows. select(iris, ends_with("Length"))
Select columns whose name ends with a character string.
dplyr::sample_n(iris, 10, replace = TRUE) select(iris, everything())
dplyr::%>% Randomly select n rows. Select every column.
Passes object on let hand side as first argument (or . dplyr::slice(iris, 10:15) select(iris, matches(".t."))
Select columns whose name matches a regular expression.
argument) of function on righthand side. Select rows by position.
select(iris, num_range("x", 1:5))
dplyr::top_n(storms, 2, date) Select columns named x1, x2, x3, x4, x5.
x %>% f(y) is the same as f(x, y)
Select and order top n entries (by group if grouped data). select(iris, one_of(c("Species", "Genus")))
y %>% f(x, ., z) is the same as f(x, y, z )
Select columns whose names are in a group of names.
Logic in R - ?Comparison, ?base::Logic select(iris, starts_with("Sepal"))
"Piping" with %>% makes code more readable, e.g. < Less than != Not equal to Select columns whose name starts with a character string.
> Greater than %in% Group membership select(iris, Sepal.Length:Petal.Width)
iris %>%
group_by(Species) %>% == Equal to is.na Is NA Select all columns between Sepal.Length and Petal.Width (inclusive).
summarise(avg = mean(Sepal.Width)) %>% <= Less than or equal to !is.na Is not NA select(iris, -Species)
arrange(avg) >= Greater than or equal to &,|,!,xor,any,all Boolean operators Select all columns except Species.
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15
Summarise Data Make New Variables Combine Data Sets
a b
x1 x2 x1 x3

+ =
A 1 A T
B 2 B F
C 3 D T
dplyr::summarise(iris, avg = mean(Sepal.Length))
dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) Mutating Joins
Summarise data into single row of values.
Compute and append one or more new columns. x1 x2 x3
dplyr::let_join(a, b, by = "x1")
dplyr::summarise_each(iris, funs(mean)) A 1 T
dplyr::mutate_each(iris, funs(min_rank)) B 2 F
Join matching rows from b to a.
Apply summary function to each column. C 3 NA
Apply window function to each column.
dplyr::count(iris, Species, wt = Sepal.Length) x1 x3 x2
dplyr::right_join(a, b, by = "x1")
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) A T 1
Count number of rows with each unique value of B F 2 Join matching rows from a to b.
Compute one or more new columns. Drop original columns. D T NA
variable (with or without weights).
x1 x2 x3 dplyr::inner_join(a, b, by = "x1")
A 1 T
summary window B 2 F Join data. Retain only rows in both sets.
function function x1
A
x2
1
x3
T
dplyr::full_join(a, b, by = "x1")
Summarise uses summary functions, functions that Mutate uses window functions, functions that take a vector of B
C
2
3
F
NA
Join data. Retain all values, all rows.
take a vector of values and return a single value, such as: values and return another vector of values, such as: D NA T

dplyr::first min Filtering Joins


dplyr::lead dplyr::cumall
First value of a vector. Minimum value in a vector. x1 x2 dplyr::semi_join(a, b, by = "x1")
Copy with values shited by 1. Cumulative all A 1
dplyr::last max B 2 All rows in a that have a match in b.
dplyr::lag dplyr::cumany
Last value of a vector. Maximum value in a vector. x1 x2 dplyr::anti_join(a, b, by = "x1")
Copy with values lagged by 1. Cumulative any C 3
dplyr::nth mean All rows in a that do not have a match in b.
dplyr::dense_rank dplyr::cummean
Nth value of a vector. Mean value of a vector.
Ranks with no gaps. Cumulative mean y z
dplyr::n median
dplyr::min_rank cumsum x1 x2 x1 x2
# of values in a vector. Median value of a vector.
+ =
A 1 B 2
dplyr::n_distinct var Ranks. Ties get min rank. Cumulative sum B 2 C 3
C 3 D 4
# of distinct values in Variance of a vector. dplyr::percent_rank cummax
Set Operations
a vector. sd Ranks rescaled to [0, 1]. Cumulative max
x1 x2
IQR Standard deviation of a dplyr::row_number cummin B 2 dplyr::intersect(y, z)
C 3
IQR of a vector. vector. Ranks. Ties got to first value. Cumulative min Rows that appear in both y and z.
dplyr::ntile cumprod x1 x2

Group Data A 1 dplyr::union(y, z)


Bin vector into n buckets. Cumulative prod B 2
C 3 Rows that appear in either or both y and z.
dplyr::group_by(iris, Species) dplyr::between pmax D 4

Group data into rows with the same value of Species. Are values between a and b? Element-wise max x1 x2 dplyr::setdiff(y, z)
A 1
dplyr::ungroup(iris) dplyr::cume_dist pmin Rows that appear in y but not z.
Remove grouping information from data frame. Cumulative distribution. Element-wise min Binding
x1 x2
iris %>% group_by(Species) %>% summarise(…) iris %>% group_by(Species) %>% mutate(…) A 1

Compute separate summary row for each group. Compute new variables by group.
B 2 dplyr::bind_rows(y, z)
C 3
B
C
2
3
Append z to y as new rows.
D 4
ir ir dplyr::bind_cols(y, z)
C x1 x2 x1 x2
A 1 B 2 Append z to y as new columns.
B 2 C 3
C 3 D 4 Caution: matches rows by position.
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com devtools::install_github("rstudio/EDAWR") for data sets Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15
Manipulate Cases Manipulate Variables
Data Transformation
with dplyr Cheat Sheet Extract Cases Extract Variables
Row functions return a subset of rows as a new table. Use a variant Column functions return a set of columns as a new table. Use a
that ends in _ for non-standard evaluation friendly code. variant that ends in _ for non-standard evaluation friendly code.
dplyr functions work with pipes and expect tidy data. In tidy data: filter(.data, …) select(.data, …)
A B C A B C
w
www
ww Extract rows that meet logical criteria. Also
filter_(). filter(iris, Sepal.Length > 7) w
www Extract columns by name. Also select_if()
select(iris, Sepal.Length, Species)

& pipes
x %>% f(y) distinct(.data, ..., .keep_all = FALSE)
Use these helpers with select(),
e.g. select(iris, starts_with("Sepal"))
Each variable is
in its own column
Each observation, or
case, is in its own row
becomes f(x, y)
w
www
ww Remove rows with duplicate values. Also
distinct_(). distinct(iris, Species)
contains(match)
ends_with(match)
num_range(prefix, range)
one_of(…)
:, e.g. mpg:cyl
-, e.g, -Species
everything()
matches(match) starts_with(match)
Summarise Cases sample_frac(tbl, size = 1, replace = FALSE,
These apply summary functions to
columns to create a new table.
Summary functions take vectors as
summary
function
w
www
ww weight = NULL, .env = parent.frame())
Randomly select fraction of rows.
sample_frac(iris, 0.5, replace = TRUE)
Make New Variables
These apply vectorized functions to
columns. Vectorized funs take vectors vectorized
input and return one value (see back). as input and return vectors of the function
sample_n(tbl, size, replace = FALSE, same length as output (see back).
summarise(.data, …) mutate(.data, …)
weight = NULL, .env = parent.frame())
w
ww Compute table of summaries. Also
summarise_().
summarise(mtcars, avg = mean(mpg))
Randomly select size rows.
sample_n(iris, 10, replace = TRUE) w
www
ww Compute new column(s).
mutate(mtcars, gpm = 1/mpg)

transmute(.data, …)
count(x, ..., wt = NULL, sort = FALSE)
slice(.data, …) Compute new column(s), drop others.

w
ww Count number of rows in each group defined
by the variables in … Also tally().
count(iris, Species) w
www
ww Select rows by position. Also slice_().
slice(iris, 10:15)
w
ww transmute(mtcars, gpm = 1/mpg)

mutate_all(.tbl, .funs, ...)


Variations top_n(x, n, wt)
• summarise_all() - Apply funs to every column.
• summarise_at() - Apply funs to specific columns.
Select and order top n entries (by group if
grouped data). top_n(iris, 5, Sepal.Width)
w
www Apply funs to every column. Use with
funs(). mutate_all(faithful, funs(log(.),
log2(.)))
• summarise_if() - Apply funs to all cols of one type.
Logical and boolean operators to use with filter()
mutate_at(.tbl, .cols, .funs, ...)
< <= is.na() %in% | xor()
Group Cases > >= !is.na() ! &
See ?base::logic and ?Comparison for help. ww
w Apply funs to specific columns. Use with
funs(), vars() and the helper functions for
select().
Use group_by() to created a "grouped" copy of a table. dplyr mutate_at(iris, vars( -Species), funs(log(.)))
functions will manipulate each "group" separately and then
combine the results. Arrange Cases mutate_if(.tbl, .predicate, .funs, ...)
arrange(.data, ...) Apply funs to all columns of one type. Use
mtcars %>%

w
www
ww group_by(cyl) %>% w
www
ww Order rows by values of a column (low to high),
use with desc() to order from high to low.
with funs().
mutate_if(iris, is.numeric, funs(log(.)))

ww
w summarise(avg = mean(mpg)) arrange(mtcars, mpg)
arrange(mtcars, desc(mpg)) add_column(.data, ..., .before =

group_by(.data, ..., add = FALSE)


Returns copy of table grouped by …
Add Cases w
www
ww NULL, .ater = NULL)
Add new column(s).
add_column(mtcars, new = 1:32)
g_iris <- group_by(iris, Species) add_row(.data, ..., .before = NULL,
ungroup(x, ...)
Returns ungrouped copy of table.
w
www
ww .ater = NULL)
Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1) w
www
w
rename(.data, …)
Rename columns.
rename(iris, Length = Sepal.Length)
ungroup(g_iris)
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01
Vectorized Functions Summary Functions Combine Tables
to use with mutate() to use with summarise() Combine Variables Combine Cases
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A
A B
B C
C AA B
B D
D A B C
Vectorized functions take vectors as input and functions take vectors as input and return single a t 1 aa tt 33
return vectors of the same length as output. values as output. b
c
u
v
2
3 + bb
dd
uu
w
w
22
11 = x
a t
b u
c v
1
2
3

A B C
vectorized
function
summary
function
Use bind_cols() to paste tables beside each other
as they are. + z
c
d
v 3
w 4

A B C A B D bind_cols(…)
Offsets Counts a t 1 a t 3 Returns tables placed side by
b u 2 b u 2
Use bind_rows() to paste tables below each other as
dplyr::lag() - Offset elements by 1 dplyr::n() - number of values/rows c v 3 d w 1 side as a single table. they are.
dplyr::n_distinct() - # of uniques BE SURE THAT ROWS ALIGN.
dplyr::lead() - Offset elements by -1
sum(!is.na()) - # of non-NA’s DF A B C bind_rows(…, .id = NULL)
Cumulative Aggregates Use a "Mutating Join" to join one table to columns x a t 1
Location from another, matching values with the rows that
x b u 2 Returns tables one on top of the other
x c v 3 as a single table. Set .id to a column
dplyr::cumall() - Cumulative all() mean() - mean, also mean(!is.na()) they correspond to. Each join retains a different z c v 3
dplyr::cumany() - Cumulative any() z d w 4 name to add a column of the original
median() - median combination of values from the tables. table names (as pictured)
cummax() - Cumulative max()
dplyr::cummean() - Cumulative mean() Logicals A B C D
a t 1 3
let_join(x, y, by = NULL,
mean() - Proportion of TRUE’s b u 2 2 copy=FALSE, suffix=c(“.x”,“.y”),…) A B C intersect(x, y, …)
cummin() - Cumulative min() c v 3 NA c v 3
Rows that appear in both x and z.
sum() - # of TRUE’s Join matching values from y to x.
cumprod() - Cumulative prod()
cumsum() - Cumulative sum() Position/Order A B C D right_join(x, y, by = NULL, copy = A B
A B CC setdiff(x, y, …)
aa tt 11
dplyr::first() - first value a t 1 3
FALSE, suffix=c(“.x”,“.y”),…) bb uu 22 Rows that appear in x but not z.
Rankings dplyr::last() - last value
b u 2 2
d w NA 1 Join matching values from x to y.
dplyr::cume_dist() - Proportion of all values <= dplyr::nth() - value in nth location of vector
AA B
B C
C union(x, y, …)
dplyr::dense_rank() - rank with ties = min, no aa tt 1
1
gaps Rank A B C D
a t 1 3
inner_join(x, y, by = NULL, copy = bb u 2 Rows that appear in x or z. (Duplicates
FALSE, suffix=c(“.x”,“.y”),…) c v 3 removed). union_all() retains
dplyr::min_rank() - rank with ties = min quantile() - nth quantile b u 2 2 d w 4
Join data. Retain only rows with duplicates.
min() - minimum value
dplyr::ntile() - bins into n bins max() - maximum value matches.
dplyr::percent_rank() - min_rank scaled to [0,1] Use setequal() to test whether two data sets contain
Spread the exact same rows (in any order).
dplyr::row_number() - rank with ties = "first" A B C D
a t 1 3
full_join(x, y, by = NULL,
IQR() - Inter-Quartile Range b u 2 2 copy=FALSE, suffix=c(“.x”,“.y”),…)
Math mad() - mean absolute deviation c v 3 NA
Join data. Retain all values, all
sd() - standard deviation
d w NA 1
rows.
Extract Rows
+, - , *, /, ^, %/%, %% - arithmetic ops
var() - variance x y
log(), log2(), log10() - logs A
A BB CC A B
A B C D
<, <=, >, >=, !=, == - logical comparisons A B.x C B.y D Use by = c("col1", "col2") to a t 1 a t 33

Misc
Row names a
b
c
t
u
v
1
2
3
t
u
3
2
NA NA
specify the column(s) to match
on.
b u 2
cc vv 3
+ b uu 22
d w 11 =
dplyr::between() - x >= let & x <= right Tidy data does not use rownames, which store let_join(x, y, by = "A")
a variable outside of the columns. To work with Use a "Filtering Join" to filter one table against the
dplyr::case_when() - multi-case if_else() the rownames, first move them into a column. rows of another.
A.x B.x C A.y B.y Use a named vector, by =
dplyr::coalesce() - first non-NA values by rownames_to_column() a t 1 d w
C A B C A B c("col1" = "col2"), to match on
element across a set of vectors 1 a t 1 a t b u 2 b u
A B C semi_join(x, y, by = NULL, …)
2 b u 2 b u Move row names into col. c v 3 a t columns with different names in a t 1
dplyr::if_else() - element-wise if() + else() 3 c v 3 c v a <- rownames_to_column(iris, each data set. b u 2 Return rows of x that have a match in y.
dplyr::na_if() - replace specific values with NA var = "C") let_join(x, y, by = c("C" = "D")) USEFUL TO SEE WHAT WILL BE JOINED.
pmax() - element-wise max()
A B C column_to_rownames()
C A B
pmin() - element-wise min() a t 1 1 a t A1 B1 C A2 B2 Use suffix to specify suffix to give A B C anti_join(x, y, by = NULL, …)
b u 2 Move col in row names.
2 b u
a t 1 d w
to duplicate column names. c v 3
dplyr::recode() - Vectorized switch() b u 2 b u Return rows of x that do not have a
c v 3 column_to_rownames(a,
3 c v
match in y. USEFUL TO SEE WHAT WILL
c v 3 a t let_join(x, y, by = c("C" = "D"),
dplyr::recode_factor() - Vectorized switch() for var = "C") NOT BE JOINED.
factors Also has_rownames(), remove_rownames() suffix = c("1", "2"))

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.5.0 • tibble 1.2.0 • Updated: 2017-01
R For Data Science Cheat Sheet General form: DT[i, j, by] Advanced Data Table Operations
> DT[.N-1] Return the penultimate row of the DT
data.table “Take DT, subset rows using i, then calculate j grouped by by” > DT[,.N] Return the number of rows
> DT[,.(V2,V3)] Return V2 and V3 as a data.table
Learn R for data science Interactively at www.DataCamp.com
Adding/Updating Columns By Reference in j Using := >
>
DT[,list(V2,V3)]
DT[,mean(V3),by=.(V1,V2)]
Return V2 and V3 as a data.table
Return the result of j, grouped by all possible
> DT[,V1:=round(exp(V1),2)] V1 is updated by what is ater := V1 V2 V1 combinations of groups speciied in by
> DT Return the result by calling DT 1: 1 A 0.4053
2: 1 B 0.4053
V1 V2 V3 V4
data.table 1: 2.72 A -0.1107 1
2: 7.39 B -0.1427 2
3:
4:
1 C 0.4053
2 A -0.6443
5: 2 B -0.6443
data.table is an R package that provides a high-performance 3: 2.72 C -1.8893 3 6: 2 C -0.6443
4: 7.39 A -0.3571 4
version of base R’s data.frame with syntax and feature ... .SD & .SDcols
enhancements for ease of use, convenience and > DT[,c("V1","V2"):=list(round(exp(V1),2), Columns V1 and V2 are updated by
> DT[,print(.SD),by=V2] Look at what .SD contains
LETTERS[4:6])] what is ater :=
programming speed. > DT[,':='(V1=round(exp(V1),2), Alternative to the above one. With [], > DT[,.SD[c(1,.N)],by=V2] Select the irst and last row grouped by V2
V2=LETTERS[4:6])][] you print the result to the screen > DT[,lapply(.SD,sum),by=V2] Calculate sum of columns in .SD grouped by
Load the package: V1 V2 V3 V4
V2
1: 15.18 D -0.1107 1 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> library(data.table) .SDcols=c("V3","V4")] V2
2: 1619.71 E -0.1427 2
V2 V3 V4
3: 15.18 F -1.8893 3
Creating A data.table 4: 1619.71 D -0.3571 4
1:
2:
A -0.478 22
B -0.478 26

> set.seed(45L) Create a data.table > DT[,V1:=NULL] Remove V1 3: C -0.478 30


> DT[,c("V1","V2"):=NULL] Remove columns V1 and V2 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> DT <- data.table(V1=c(1L,2L), and call it DT .SDcols=paste0("V",3:4)] V2
V2=LETTERS[1:3], > Cols.chosen=c("A","B")
V3=round(rnorm(4),4), > DT[,Cols.Chosen:=NULL] Delete the column with column name
V4=1:12)
> DT[,(Cols.Chosen):=NULL]
Cols.chosen
Delete the columns speciied in the Chaining
variable Cols.chosen Calculate sum of V4, grouped by V1
Subseting Rows Using i > DT <- DT[,.(V4.SUM=sum(V4)),
by=V1]
> DT[3:5,] Select 3rd to 5th row Indexing And Keys 1:
V1 V4.Sum
1 36
> DT[3:5] Select 3rd to 5th row
2: 2 42
> DT[V2=="A"] Select all rows that have value A in column V2 > setkey(DT,V2) A key is set on V2; output is returned invisibly
> DT["A"] Return all rows where the key column (set to V2) has > DT[V4.Sum>40] Select that group of which the sum is >40
> DT[V2 %in% c("A","C")] Select all rows that have value A or C in column V2
V1 V2 V3 V4 the value A > DT[,.(V4.Sum=sum(V4)), Select that group of which the sum is >40
1: 1 A -0.2392 1 by=V1][V4.Sum>40] (chaining)
Manipulating on Columns in j 2: 2 A -1.6148 4
1:
V1 V4.Sum
2 42
3: 1 A 1.0498 7
> DT[,V2] Return V2 as a vector 4: 2 A 0.3262 10 2: 1 36
[1] “A” “B” “C” “A” “B” “C” ... > DT[c("A","C")] Return all rows where the key column (V2) has value A or C > DT[,.(V4.Sum=sum(V4)), Calculate sum of V4, grouped by V1,
> DT[,.(V2,V3)] Return V2 and V3 as a data.table > DT["A",mult="irst"] Return irst row of all rows that match value A in key by=V1][order(-V1)] ordered on V1
> DT[,sum(V1)] Return the sum of all elements of V1 in a column V2 V1 V4.Sum
[1] 18 vector > DT["A",mult="last"] Return last row of all rows that match value A in key 1: 2 42
> DT[,.(sum(V1),sd(V3))] Return the sum of all elements of V1 and the column V2
2: 1 36
V1 V2 std. dev. of V3 in a data.table > DT[c("A","D")] Return all rows where key column V2 has value A or D
1: 18 0.4546055
> DT[,.(Aggregate=sum(V1), The same as the above, with new names
V1 V2 V3 V4
1: 1 A -0.2392 1 set()-Family
Sd.V3=sd(V3))] 2: 2 A -1.6148 4
Aggregate Sd.V3 3: 1 A 1.0498 7 set()
1: 18 0.4546055 4: 2 A 0.3262 10
> DT[,.(V1,Sd.V3=sd(V3))] Select column V2 and compute std. dev. of V3, 5: NA D NA NA Syntax: for (i in from:to) set(DT, row, column, new value)
which returns a single value and gets recycled > DT[c("A","D"),nomatch=0] Return all rows where key column V2 has value A or D > rows <- list(3:4,5:6)
V1 V2 V3 V4
> DT[,.(print(V2), Print column V2 and plot V3 > cols <- 1:2
1: 1 A -0.2392 1
plot(V3), > for(i in seq_along(rows)) Sequence along the values of rows, and
2: 2 A -1.6148 4
NULL)] {set(DT, for the values of cols, set the values of
3: 1 A 1.0498 7
4: 2 A 0.3262 10 i=rows[[i]], those elements equal to NA (invisible)
j=cols[i],
Doing j by Group > DT[c("A","C"),sum(V4)] Return total sum of V4, for rows of key column V2 that
have values A or C value=NA)}
> DT[,.(V4.Sum=sum(V4)),by=V1] Calculate sum of V4 for every group in V1 > DT[c("A","C"), Return sum of column V4 for rows of V2 that have value A,
V1 V4.Sum sum(V4), and anohter sum for rows of V2 that have value C setnames()
1: 1 36 by=.EACHI] Syntax: setnames(DT,"old","new")[]
2: 2 42 V2 V1
1: A 22 > setnames(DT,"V2","Rating") Set name of V2 to Rating (invisible)
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 Change 2 column names (invisible)
by=.(V1,V2)] and V2 2: C 30 > setnames(DT,
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in > setkey(DT,V1,V2) Sort by V1 and then by V2 within each group of V1 (invisible) c("V2","V3"),
by=sign(V1-1)] sign(V1-1) > DT[.(2,"C")] Select rows that have value 2 for the irst key (V1) and the c("V2.rating","V3.DC"))
V1 V2 V3 V4 value C for the second key (V2)
1:
sign V4.Sum
0 36 1: 2 C 0.3262 6 setnames()
2: 1 42 2: 2 C -1.6148 12
Syntax: setcolorder(DT,"neworder")
> DT[,.(V4.Sum=sum(V4)), The same as the above, with new name > DT[.(2,c("A","C"))] Select rows that have value 2 for the irst key (V1) and within
V1 V2 V3 V4 those rows the value A or C for the second key (V2) > setcolorder(DT, Change column ordering to contents
by=.(V1.01=sign(V1-1))] for the variable you’re grouping by
> DT[1:5,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 1: 2 A -1.6148 4 c("V2","V1","V4","V3")) of the speciied vector (invisible)
2: 2 A 0.3262 10
by=V1] ater subseting on the irst 5 rows
3: 2 C 0.3262 6
> DT[,.N,by=V1] Count number of rows for every group in
4: 2 C -1.6148 12
DataCamp
V1 Learn Python for Data Science Interactively
Read functions Parsing data types
Data
TidyImport Data
with readr, tibble, and tidyr
Read tabular data to tibbles readr functions guess the types of each column
and convert types when appropriate (but will
with tidyr Cheat Sheet These functions share the common arguments: NOT convert strings to factors automatically).
Cheat Sheet read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"),
A message shows the type of each column in
quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max =
the result.
min(1000, n_max), progress = interactive())
## Parsed with column specification:
A B C read_csv() ## cols(
a,b,c age is an
R’s tidyverse is built around tidy data stored in 1 2 3 Reads comma delimited files. ## age = col_integer(),
tibbles, an enhanced version of a data frame. 1,2,3 4 5 NA read_csv("file.csv") ## sex = col_character(), integer
4,5,NA ## earn = col_double()
The front side of this sheet shows how ## ) sex is a
earn is a double (numeric) character
to read text files into R with readr. A B C read_csv2()
a;b;c 1 2 3 1. Use problems() to diagnose problems
The reverse side shows how to create Reads Semi-colon delimited files.
1;2;3 4 5 NA x <- read_csv("file.csv"); problems(x)
tibbles with tibble and to layout tidy read_csv2("file2.csv")
4;5;NA
data with tidyr. 2. Use a col_ function to guide parsing
A B C read_delim(delim, quote = "\"", escape_backslash = FALSE, • col_guess() - the default
Other types of data a|b|c 1 2 3 escape_double = TRUE) Reads files with any delimiter. • col_character()
Try one of the following packages to import 1|2|3 4 5 NA read_delim("file.txt", delim = "|") • col_double()
other types of files 4|5|NA • col_euro_double()
• haven - SPSS, Stata, and SAS files • col_datetime(format = "") Also
• readxl - excel files (.xls and .xlsx) A B C read_fwf(col_positions)
abc 1 2 3
col_date(format = "") and col_time(format = "")
• DBI - databases Reads fixed width files.
123 4 5 NA • col_factor(levels, ordered = FALSE)
• jsonlite - json read_fwf("file.fwf", col_positions = c(1, 3, 5))
4 5 NA • col_integer()
• xml2 - XML read_tsv() • col_logical()
• httr - Web APIs Reads tab delimited files. Also read_table(). • col_number()
• rvest - HTML (Web Scraping) read_tsv("file.tsv") • col_numeric()
• col_skip()
x <- read_csv("file.csv", col_types = cols(
Write functions Useful arguments
A = col_double(),
Save x, an R object, to path, a file path, with: a,b,c Example file B = col_logical(),
1 2 3 Skip lines C = col_factor()
1,2,3 write_csv (path = "file.csv", read_csv("file.csv",
write_csv(x, path, na = "NA", append = FALSE, 4 5 NA ))
4,5,NA x = read_csv("a,b,c\n1,2,3\n4,5,NA")) skip = 1)
col_names = !append)
3. Else, read in as character vectors then parse
Tibble/df to comma delimited file. A B C with a parse_ function.
A B C No header Read in a subset
write_delim(x, path, delim = " ", na = "NA", 1 2 3 1 2 3 • parse_guess(x, na = c("", "NA"), locale =
append = FALSE, col_names = !append) read_csv("file.csv", read_csv("file.csv",
4 5 NA default_locale())
col_names = FALSE) n_max = 1)
Tibble/df to file with any delimiter. • parse_character(x, na = c("", "NA"), locale =
A B C
write_excel_csv(x, path, na = "NA", append = x y z default_locale())
Provide header 1 2 3
FALSE, col_names = !append) A B C NA NA NA Missing Values • parse_datetime(x, format = "", na = c("", "NA"),
read_csv("file.csv", locale = default_locale()) Also parse_date()
Tibble/df to a CSV for excel 1 2 3 read_csv("file.csv",
4 5 NA
col_names = c("x", "y", "z")) and parse_time()
write_file(x, path, append = FALSE) na = c("4", "5", "."))
• parse_double(x, na = c("", "NA"), locale =
String to file. default_locale())
Read non-tabular data
write_lines(x, path, na = "NA", append = • parse_factor(x, levels, ordered = FALSE, na =
FALSE) read_file(file, locale = default_locale())
read_lines_raw(file, skip = 0, n_max = -1L, c("", "NA"), locale = default_locale())
String vector to file, one element per line. Read a file into a single string. progress = interactive()) • parse_integer(x, na = c("", "NA"), locale =
write_rds(x, path, compress = c("none", "gz", read_file_raw(file) Read each line into a raw vector. default_locale())
"bz2", "xz"), ...) Read a file into a raw vector. read_log(file, col_names = FALSE, col_types = • parse_logical(x, na = c("", "NA"), locale =
Object to RDS file. read_lines(file, skip = 0, n_max = -1L, locale = NULL, skip = 0, n_max = -1, progress = default_locale())
write_tsv(x, path, na = "NA", append = FALSE, default_locale(), na = character(), progress = interactive()) • parse_number(x, na = c("", "NA"), locale =
col_names = !append) interactive()) Apache style log files. default_locale())
Tibble/df to tab delimited files. Read each line into its own string. x$A <- parse_number(x$A)

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio info@rstudio.com • 844-448-1212 • rstudio.com Learn more at browseVignettes(package = c("readr", "tibble", "tidyr")) • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01
Tibbles - an enhanced data frame Tidy Data with tidyr
The tibble package provides a new S3 class for Tidy data is a way to organize tabular data. It provides a consistent data structure across packages.
A table is tidy if: Tidy data: Split and Combine Cells
storing tabular data, the tibble. Tibbles inherit the A * B -> C
data frame class, but improve two behaviors: A B C A B C A B C A * B Use these functions to split or combine cells into
C
individual, isolated values.
• Display - When you print a tibble, R provides a
concise view of the data that fits on one screen. & separate(data, col, into, sep = "[^[:alnum:]]+",
remove = TRUE, convert = FALSE,
• Subsetting - [ always returns a new tibble,
Each variable is in Each observation, or Makes variables easy Preserves cases during
[[ and $ always return a vector. extra = "warn", fill = "warn", ...)
its own column case, is in its own row to access as vectors vectorized operations
• No partial matching - You must use full Separate each cell in a column to make several
column names when subsetting Reshape Data - change the layout of values in a table columns.
# A tibble: 234 × 6 table3
manufacturer model displ Use gather() and spread() to reorganize the values of a table into a new layout. Each uses the idea of a
<chr> <chr> <dbl> country year rate country year cases pop
1
2
audi
audi
a4
a4
1.8
1.8
key column: value column pair. A 1999 0.7K/19M A 1999 0.7K 19M
3 audi a4 2.0
4 audi a4 2.0 A 2000 2K/20M A 2000 2K 20M
5
6
audi
audi
a4
a4
2.8
2.8
gather(data, key, value, ..., na.rm = FALSE, spread(data, key, value, fill = NA, convert = FALSE, B 1999 37K/172M B 1999 37K 172
7 audi a4 3.1
convert = FALSE, factor_key = FALSE) drop = TRUE, sep = NULL) B 2000 80K/174M B 2000 80K 174

w
w
8 audi a4 quattro 1.8
9 audi a4 quattro 1.8 C 1999 212K/1T C 1999 212K 1T
10 audi a4 quattro 2.0
# ... with 224 more rows, and 3
# more variables: year <int>,
Gather moves column names into a key Spread moves the unique values of a key column C 2000 213K/1T C 2000 213K 1T
# cyl <int>, trans <chr> column, gathering the column values into a into the column names, spreading the values of a
tibble display single value column. value column across the new columns that result. separate(table3, rate,
156 1999 6 auto(l4) table4a table2
into = c("cases", "pop"))
157 1999 6 auto(l4)
158 2008 6 auto(l4)
159 2008 8 auto(s4)
country 1999 2000 country year cases country year type count country year cases pop
160 1999
161 1999
4 manual(m5)
4 auto(l4)
A 0.7K 2K A 1999 0.7K A 1999 cases 0.7K A 1999 0.7K 19M separate_rows(data, ..., sep = "[^[:alnum:].]+",
162 2008 4 manual(m5) B 37K 80K B 1999 37K A 1999 pop 19M A 2000 2K 20M
163 2008
164 2008
4 manual(m5)
4 auto(l4) C 212K 213K C 1999 212K A 2000 cases 2K B 1999 37K 172M convert = FALSE)
165 2008 4 auto(l4)
166 1999 4 auto(l4) A 2000 2K A 2000 pop 20M B 2000 80K 174M
A large table [ reached
-- omitted
getOption("max.print")
68 rows ] B 2000 80K B 1999 cases 37K C 1999 212K 1T Separate each cell in a column to make several
to display data frame display C 2000 213K B 1999 pop 172M C 2000 213K 1T rows. Also separate_rows_().
key value B 2000 cases 80K
• Control the default appearance with options: table3
B 2000 pop 174M
country year rate country year rate
options(tibble.print_max = n, C 1999 cases 212K
A 1999 0.7K/19M A 1999 0.7K
C 1999 pop 1T
tibble.print_min = m, tibble.width = Inf) C 2000 cases 213K A 2000 2K/20M A 1999 19M
C 2000 pop 1T B 1999 37K/172M A 2000 2K
• View entire data set with View(x, title) or B 2000 80K/174M A 2000 20M
gather(table4a, `1999`, `2000`, key value
glimpse(x, width = NULL, …) C 1999 212K/1T B 1999 37K
key = "year", value = "cases") spread(table2, type, count) C 2000 213K/1T B 1999 172M
• Revert to data frame with as.data.frame() B 2000 80K
(required for some older packages) B 2000 174M
Handle Missing Values C 1999 212K
Construct a tibble in two ways C 1999 1T
drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
tibble(…) replace = list(), ...) C 2000 1T
Drop rows containing Fill in NA’s in … columns with most
Construct by columns. Both make
NA’s in … columns. recent non-NA values. Replace NA’s by column. separate_rows(table3, rate)
tibble(x = 1:3, this tibble x x x
y = c("a", "b", "c")) x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2

tribble(…)
A 1 A 1 A 1 A 1 A 1 A 1 unite(data, col, ..., sep = "_", remove = TRUE)
A tibble: 3 × 2 B NA D 3 B NA B 1 B NA B 2
C NA C NA C 1 C NA C 2
Construct by rows. x y
D 3 D 3 D 3 D 3 D 3 Collapse cells across several columns to
tribble( <int> <dbl>
1 1 a
E NA E NA E 3 E NA E 2 make a single column.
~x, ~y, 2 2 b table5
1, "a", 3 3 c drop_na(x, x2) fill(x, x2) replace_na(x,list(x2 = 2), x2)
country century year country year
2, "b", Afghan 19 99 Afghan 1999
3, "c")
Expand Tables - quickly create tables with combinations of values Afghan
Brazil
20
19
0
99
Afghan
Brazil
2000
1999
as_tibble(x, …) Convert data frame to tibble.
Brazil 20 0 Brazil 2000
enframe(x, name = "name", value = "value")
complete(data, ..., fill = list()) expand(data, ...) China 19 99 China 1999
Converts named vector to a tibble with a Adds to the data missing combinations of the Create new tibble with all possible combinations China 20 0 China 2000

names column and a values column. values of the variables listed in … of the values of the variables listed in …
unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb)
is_tibble(x) Test whether x is a tibble. col = "year", sep = "")
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio info@rstudio.com • 844-448-1212 • rstudio.com Learn more at browseVignettes(package = c("readr", "tibble", "tidyr")) • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01
Data StructureS
R Programming Cheat Sheet vEctOr
• Group of elements of the SAME type
data.frame while using single-square brackets, use
‘drop’: df1[, 'col1', drop = FALSE]
juSt the baSicS • R is a vectorized language, operations are applied to
each element of the vector automatically data.tabLE
• R has no concept of column vectors or row vectors What is a data.table
Created By: arianne Colton and Sean Chen • Special vectors: letters and LETTERS, that contain • Extends and enhances the functionality of data.frames
lower-case and upper-case letters Differences: data.table vs. data.frame
Create Vector v1 <- c(1, 2, 3) • By default data.frame turns character data into factors,
General ManipulatinG StrinGS Get Length length(v1) while data.table does not
Check if All or Any is True all(v1); any(v1) • When you print data.frame data, all data prints to the
• R version 3.0 and greater adds support for 64 bit paste('string1', 'string2', sep Integer Indexing v1[1:3]; v1[c(1,6)]
console, with a data.table, it intelligently prints the irst
integers = '/') and last ive rows
Boolean Indexing v1[is.na(v1)] <- 0
• R is case sensitive Putting # separator ('sep') is a space by default • Key Difference: Data.tables are fast because
Together c(irst = 'a', ..)or
• R index starts from 1 Strings paste(c('1', '2'), collapse = Naming
names(v1) <- c('irst', ..) they have an index like a database.
'/')
i.e., this search, dt1$col1 > number, does a
HELP # returns '1/2' FactOr sequential scan (vector scan). After you create a key
stringr::str_split(string = v1,
pattern = '-') • as.factor(v1) gets you the levels which is the for this, it will be much faster via binary search.
help(functionName) or ?functionName Split String number of unique values Create data.table from data.frame data.table(df1)
# returns a list
Help Home Page help.start() stringr::str_sub(string = v1, • Factors can reduce the size of a variable because they dt1[, 'col1', with
Get Substring start = 1, end = 3) only store unique values, but could be buggy if not Index by Column(s)* = FALSE] or
Special Character Help help('[') isJohnFound <- stringr::str_ used properly dt1[, list(col1)]
Search Help help.search(..)or ??.. detect(string = df1$col1, Show info for each data.table in tables()
Search Function - with pattern = ignore.case('John')) List memory (i.e., size, ...)
apropos('mea') Match String
Partial Name # returns True/False if John was found Store any number of items of ANY type Show Keys in data.table key(dt1)
See Example(s) example(topic) df1[isJohnFound, c('col1', Create index for col1 and
...)] Create List list1 <- list(irst = 'a', ...) setkey(dt1, col1)
reorder data according to col1
vector(mode = 'list', length dt1[c('col1Value1',
ObjEcts in current environment Create Empty List = 3) Use Key to Select Data
'col1Value2'), ]
Get Element list1[[1]] or list1[['irst']] Multiple Key Select dt1[J('1', c('2', '3')), ]
objects() or ls()
Display Object Name
Remove Object rm(object1, object2,..)
Data typeS Append Using
Numeric Index
list1[[6]] <- 2 dt1[, list(col1 =
mean(col1)), by =
Append Using Name list1[['newElement']] <- 2 col2]
Check data type: class(variable) Aggregation** dt1[, list(col1 =
Notes: mean(col1), col2Sum
Note: repeatedly appending to list, vector, data.frame
1. .name starting with a period are accessible but FOur basic data tyPEs etc. is expensive, it is best to create a list of a certain
= sum(col2)), by =
list(col3, col4)]
invisible, so they will not be found by ‘ls’ 1. Numeric - includes loat/double, int, etc. size, then ill it.
2. To guarantee memory removal, use ‘gc’, releasing is.numeric(variable) * Accessing columns must be done via list of actual
unused memory to the OS. R performs automatic ‘gc’
data.FramE names, not as characters. If column names are
periodically 2. Character(string) • Each column is a variable, each row is an observation characters, then "with" argument should be set to
• Internally, each column is a vector FALSE.
nchar(variable) # length of a character or numeric
symbOL NamE ENvirONmENt • idata.frame is a data structure that creates a reference ** Aggregate and d*ply functions will work, but built-in
3. Date/POSIXct to a data.frame, therefore, no copying is performed aggregation functionality of data table is faster
• If multiple packages use the same function name the • Date: stores just a date. In numeric form, number
df1 <- data.frame(col1 = v1,
function that the package loaded the last will get called. of days since 1/1/1970 (see below). Create Data Frame col2 = v2, v3) matrix
date1 <- as.Date('2012-06-28'), Dimension nrow(df1); ncol(df1); dim(df1) • Similar to data.frame except every element must be
• To avoid this precede the function with the name of the as.numeric(date1) Get/Set Column names(df1) the SAME type, most commonly all numerics
package. e.g. packageName::functionName(..)
• POSIXct: stores a date and time. In numeric Names names(df1) <- c(...) • Functions that work with data.frame should work with
form, number of seconds since 1/1/1970. Get/Set Row rownames(df1) matrix as well
Names rownames(df1) <- c(...)
Library date2 <- as.POSIXct('2012-06-28 18:00') Preview head(df1, n = 10); tail(...) Create Matrix matrix1 <- matrix(1:10, nrow = 5), # ills
rows 1 to 5, column 1 with 1:5, and column 2 with 6:10
Only trust reliable R packages i.e., 'ggplot2' for plotting, Get Data Type class(df1) # is data.frame Matrix matrix1 %*% t(matrix2)
'sp' for dealing spatial data, 'reshape2', 'survival', etc. Note: Use 'lubridate' and 'chron' packages to work df1['col1']or df1[1];† Multiplication # where t() is transpose
with Dates Index by Column(s) df1[c('col1', 'col3')] or
library(packageName)or
df1[c(1, 3)] array
Load Package
require(packageName) 4. Logical Index by Rows and df1[c(1, 3), 2:3] # returns data
Columns from row 1 & 3, columns 2 to 3
• Multidimensional vector of the SAME type
Unload Package detach(packageName) • (TRUE = 1, FALSE = 0) • array1 <- array(1:12, dim = c(2, 3, 2))
• Use ==/!= to test equality and inequality † Index method: df1$col1 or df1[, 'col1'] or • Using arrays is not recommended
Note: require() returns the status(True/False) df1[, 1] returns as a vector. To return single column • Matrices are restricted to two dimensions while array
as.numeric(TRUE) => 1 can have any dimension
Data MunGinG FunctionS anD controlS Data reShapinG
aPPLy (apply, tapply, lapply, mapply) group_by(), sample_n() say_hello <- function(irst,
Create Function last = 'hola') { } rEarraNgE
• Apply - most restrictive. Must be used on a matrix, all • Chain functions reshape2.melt(df1, id.vars =
Call Function say_hello(irst = 'hello')
elements must be the same type df1 %>% group_by(year, month) %>% Melt Data - from c('col1', 'col2'), variable.
• If used on some other object, such as a data.frame, it
select(col1, col2) %>% summarise(col1mean • R automatically returns the value of the last line of column to row name = 'newCol1', value.name =
= mean(col1)) code in a function. This is bad practice. Use return() 'newCol2')
will be converted to a matrix irst reshape2.dcast(df1, col1 +
explicitly instead. Cast Data - from col2 ~ newCol1, value.var =
apply(matrix1, 1 - rows or 2 - columns, • Much faster than plyr, with four types of easy-to-use
function to apply) joins (inner, left, semi, anti) • do.call() - specify the name of a function either as row to column 'newCol2')
string (i.e. 'mean') or as object (i.e. mean) and provide
# if rows, then pass each row as input to the function • Abstracts the way data is stored so you can work with arguments as a list. If df1 has 3 more columns, col3 to col5, 'melting' creates
• By default, computation on NA (missing data) always data frames, data tables, and remote databases with a new df that has 3 rows for each combination of col1
returns NA, so if a matrix contains NAs, you can the same set of functions do.call(mean, args = list(irst = '1st')) and col2, with the values coming from the respective col3
ignore them (use na.rm = TRUE in the apply(..) HELPEr FuNctiONs to col5.
which doesn’t pass NAs to your function) iF /ELsE /ELsE iF /switcH
each() - supply multiple functions to a function like aggregate cOmbiNE (mutiple sets into one)
lapply if { } else ifelse
aggregate(price ~ cut, diamonds, each(mean, 1. cbind - bind by columns
Applies a function to each element of a list and returns median)) Works with Vectorized Argument No Yes
the results as a list Most Eficient for Non-Vectorized Argument Yes No data.frame from two vectors cbind(v1, v2)
sapply
Works with NA * No Yes data.frame combining df1 and cbind(df1, df2)
Same as lapply except return the results as a vector Data Use &&, || **† Yes No
df2 columns

2. rbind - similar to cbind but for rows, you can assign


Note: lapply & sapply can both take a vector as input, a Use &, | ***† No Yes
vector is technically a form of list new column names to vectors in cbind
LOad data FrOm csv
cbind(col1 = v1, ...)
aggrEgatE (SQL GROUPBY) • Read csv * NA == 1 result is NA, thus if won’t work, it’ll be an
read.table(ile = url or ilepath, header = error. For ifelse, NA will return instead 3. Joins - (merge, join, data.table) using common keys
• aggregate(formulas, data, function)
TRUE, sep = ',') ** &&, || is best used in if, since it only compares the 3.1 Merge
• Formulas: y ~ x, y represents a variable that we irst element of vector from each side
• “stringAsFactors” argument defaults to TRUE, set it to • by.x and by.y specify the key columns use in the
want to make a calculation on, x represents one or FALSE to prevent converting columns to factors. This
more variables we want to group the calculation by *** &, | is necessary for ifelse, as it compares every join() operation
saves computation time and maintains character data element of vector from each side
• Can only use one function in aggregate(). To apply • Other useful arguments are "quote" and "colClasses", • Merge can be much slower than the alternatives
more than one function, use the plyr() package specifying the character used for enclosing cells and † &&, || are similar to if in that they don’t work with
vectors, where ifelse, &, | work with vectors merge(x = df1, y = df2, by.x = c('col1',
In the example below diamonds is a data.frame; price, the data type for each column. 'col3'), by.y = c('col3', 'col6'))
cut, color etc. are columns of diamonds. • If cell separator has been used inside a cell, then use • Similar to C++/Java, for &, |, both sides of operator
read.csv2() or read delim2() instead of read. 3.2 Join
aggregate(price ~ cut, diamonds, mean)
table()
are always checked. For &&, ||, if left side fails, no • Join in plyr() package works similar to merge but
# get the average price of different cuts for the diamonds need to check the right side. much faster, drawback is key columns in each
aggregate(price ~ cut + color, diamonds, databasE • } else, else must be on the same line as } table must have the same name
mean) # group by cut and color
aggregate(cbind(price, carat) ~ cut,
Connect to
Database
db1 <- RODBC::odbcConnect('conStr') • join() has an argument for specifying left, right,
diamonds, mean) # get the average price and average Query df1 <- RODBC::sqlQuery(db1, 'SELECT inner joins
carat of different cuts Database
Close
..', stringAsFactors = FALSE) GraphicS join(x = df1, y = df2, by = c('col1',
PLyr ('split-apply-combine') Connection
RODBC::odbcClose(db1) 'col3'))

• ddply(), llply(), ldply(), etc. (1st letter = the type of • Only one connection may be open at a time. The dEFauLt basic graPHic
connection automatically closes if R closes or another 3.3 data.table
input, 2nd = the type of output
connection is opened. hist(df1$col1, main = 'title', xlab = 'x
• plyr can be slow, most of the functionality in plyr dt1 <- data.table(df1, key = c('1',
can be accomplished using base function or other • If table name has space, use [ ] to surround the table axis label') '2')), dt2 <- ...‡
packages, but plyr is easier to use name in the SQL string. plot(col2 ~ col1, data = df1),
• which() in R is similar to ‘where’ in SQL
aka y ~ x or plot(x, y) • Left Join
ddply
Takes a data.frame, splits it according to some iNcLudEd data LatticE aNd ggPLOt2 (more popular) dt1[dt2]
variable(s), performs a desired action on it and returns a R and some packages come with data included.
data.frame • Initialize the object and add layers (points, lines, ‡ Data table join requires specifying the keys for the data
List Available Datasets data() histograms) using +, map variable in the data to an
List Available Datasets in data(package = tables
llply axis or aesthetic using ‘aes’
a Speciic Package 'ggplot2')
• Can use this instead of lapply ggplot(data = df1) + geom_histogram(aes(x
• For sapply, can use laply (‘a’ is array/vector/matrix), missiNg data (NA and NULL) = col1)) Created by Arianne Colton and Sean Chen
however, laply result does not include the names. NULL is not missing, it’s nothingness. NULL is atomical data.scientist.info@gmail.com
and cannot exist within a vector. If used inside a vector, it • Normalized histogram (pdf, not relative frequency
dPLyr (for data.frame ONLY) simply disappears. histogram) Based on content from
• Basic functions: ilter(), slice(), arrange(), select(), Check Missing Data is.na() ggplot(data = df1) + geom_density(aes(x = 'R for Everyone' by Jared Lander
rename(), distinct(), mutate(), summarise(), col1), fill = 'grey50')
Avoid Using is.null() Updated: December 2, 2015
Vectors Programming
Base R Creating Vectors For Loop While Loop
Cheat Sheet c(2, 4, 6) 2 4 6
Join elements into
for (variable in sequence){ while (condition){
a vector
Do something Do something
An integer
Getting Help 2:6 2 3 4 5 6
sequence } }

Accessing the help files seq(2, 3, by=0.5) 2.0 2.5 3.0


A complex Example Example
sequence
?mean for (i in 1:4){ while (i < 5){
Get help of a particular function. rep(1:2, times=3) 1 2 1 2 1 2 Repeat a vector
j <- i + 10 print(i)
help.search(‘weighted mean’)
Repeat elements print(j) i <- i + 1
Search the help files for a word or phrase. rep(1:2, each=3) 1 1 1 2 2 2
of a vector
help(package = ‘dplyr’) } }
Find help for a package. Vector Functions
More about an object If Statements Functions
sort(x) rev(x)
Return x sorted. Return x reversed. if (condition){ function_name <- function(var){
str(iris)
table(x) unique(x) Do something
Get a summary of an object’s structure. Do something
See counts of values. See unique values. } else {
class(iris) Do something different return(new_variable)
Find the class an object belongs to. } }
Selecting Vector Elements
Example Example
Using Packages By Position if (i > 3){ square <- function(x){
install.packages(‘dplyr’) x[4] The fourth element. print(‘Yes’)
squared <- x*x
Download and install a package from CRAN. } else {
print(‘No’) return(squared)
library(dplyr) x[-4] All but the fourth.
} }
Load the package into the session, making all
its functions available to use. x[2:4] Elements two to four.
Reading and Writing Data Also see the readr package.
dplyr::select All elements except
x[-(2:4)] Input Ouput Description
Use a particular function from a package. two to four.

Elements one and Read and write a delimited text


data(iris) df <- read.table(‘file.txt’) write.table(df, ‘file.txt’)
x[c(1, 5)] file.
Load a built-in dataset into the environment. five.

By Value Read and write a comma


Working Directory x[x == 10]
Elements which df <- read.csv(‘file.csv’) write.csv(df, ‘file.csv’) separated value file. This is a
special case of read.table/
are equal to 10.
write.table.
getwd()
All elements less
Find the current working directory (where x[x < 0]
than zero. Read and write an R data file, a
inputs are found and outputs are sent). load(‘file.RData’) save(df, file = ’file.Rdata’)
file type special for R.
x[x %in% Elements in the set
setwd(‘C://file/path’) c(1, 2, 5)] 1, 2, 5.
Change the current working directory.
Named Vectors Greater than
a == b Are equal a > b Greater than a >= b is.na(a) Is missing
Conditions or equal to
Use projects in RStudio to set the working Element with Less than or
x[‘apple’] a != b Not equal a < b Less than a <= b is.null(a) Is null
directory to the folder you are working in. name ‘apple’. equal to

RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • mhairihmcneill@gmail.com Learn more at web page or vignette • package version • Updated: 3/15
Types Matrices Strings Also see the stringr package.
m <- matrix(x, nrow = 3, ncol = 3) paste(x, y, sep = ' ')
Converting between common data types in R. Can always go Join multiple vectors together.
Create a matrix from x.
from a higher value in the table to a lower value.
paste(x, collapse = ' ') Join elements of a vector together.
m[2, ] - Select a row t(m)
as.logical TRUE, FALSE, TRUE Boolean values (TRUE or FALSE).
w
ww
ww
w m[ , 1] - Select a column
Transpose
m %*% n
grep(pattern, x) Find regular expression matches in x.

gsub(pattern, replace, x) Replace matches in x with a string.

as.numeric 1, 0, 1 Integers or floating point


numbers. w
ww
ww
w m[2, 3] - Select an element
Matrix Multiplication
solve(m, n)
toupper(x)

tolower(x)
Convert to uppercase.

Convert to lowercase.
Find x in: m * x = n
as.character '1', '0', '1'

'1', '0', '1',


Character strings. Generally
preferred to factors.

Character strings with preset


w
ww
ww
w
nchar(x) Number of characters in a string.

as.factor
levels: '1', '0' levels. Needed for some Lists Factors
statistical models.

l <- list(x = 1:5, y = c('a', 'b')) factor(x) cut(x, breaks = 4)


A list is a collection of elements which can be of different types. Turn a vector into a factor. Can
Maths Functions set the levels of the factor and
Turn a numeric vector into a
factor by ‘cutting’ into
log(x) Natural log. sum(x) Sum. l[[2]] l[1] l$x l['y'] the order. sections.
New list with New list with
exp(x) Exponential. mean(x) Mean. Second element Element named
only the first only element
max(x) Largest element. median(x) Median.
of l.
element.
x.
named y. Statistics
min(x) Smallest element. quantile(x) Percentage
lm(y ~ x, data=df) prop.test
quantiles. Also see the t.test(x, y)
dplyr package. Data Frames Linear model. Perform a t-test for Test for a
round(x, n) Round to n decimal rank(x) Rank of elements. difference
difference between
places. glm(y ~ x, data=df) between
df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) means.
Generalised linear model. proportions.
signif(x, n) Round to n var(x) The variance. A special case of a list where all elements are the same length.
significant figures. pairwise.t.test
List subsetting summary aov
Perform a t-test for
cor(x, y) Correlation. sd(x) The standard Get more detailed information Analysis of
x y paired data.
deviation. out a model. variance.
df$x df[[2]]
1 a
Variable Assignment Distributions
2 b Understanding a data frame
> a <- 'apple' Random Density Cumulative
Quantile
> a See the full data Variates Function Distribution
3 c View(df)
[1] 'apple' frame. Normal rnorm dnorm pnorm qnorm
See the first 6 rpois dpois ppois qpois
Matrix subsetting head(df) Poisson
rows.
The Environment Binomial rbinom dbinom pbinom qbinom
df[ , 2]
ls() List all variables in the nrow(df) cbind - Bind columns. Uniform runif dunif punif qunif
environment. Number of rows.

rm(x) Remove x from the ncol(df)


environment. df[2, ] Number of
Plotting Also see the ggplot2 package.
columns.
rm(list = ls()) Remove all variables from the rbind - Bind rows. plot(x) plot(x, y) hist(x)
environment. Values of x in Values of x Histogram of
dim(df)
Number of order. against y. x.
You can use the environment panel in RStudio to columns and
df[2, 2]
browse variables in your environment. rows.
Dates See the lubridate package.

RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • mhairihmcneill@gmail.com • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15
R Reference Card 2.0  Operators  Indexing vectors 
<‐ Left assignment, binary x[n]   nth element
Public domain, v2.0 2012-12-24. ‐> Right assignment, binary x[‐n] all but the nth element
V 2 by Matt Baggott, matt@baggott.net = Left assignment, but not recommended x[1:n] first n elements
V 1 by Tom Short, t.short@ieee.org <<‐  Left assignment in outer lexical scope; not x[‐(1:n)] elements from n+1 to end
Material from R for Beginners by permission of for beginners x[c(1,4,2)] specific elements
Emmanuel Paradis. $ List subset, binary x[ʺnameʺ] element named "name"
‐ Minus, can be unary or binary x[x > 3] all elements greater than 3
Getting help and info  + Plus, can be unary or binary x[x > 3 & x < 5] all elements between 3 and 5
help(topic) documentation on topic ~ Tilde, used for model formulae x[x %in% c(ʺaʺ,ʺifʺ)] elements in the given set
?topic same as above; special chars need quotes: for : Sequence, binary (in model formulae:  
example ?’&&’ interaction) Indexing lists 
help.search(ʺtopicʺ) search the help system; same :: Refer to function in a package, i.e, x[n]  list with elements n
as ??topic pkg::function; usually not needed x[[n]] nth element of the list
apropos(ʺtopicʺ) the names of all objects in the * Multiplication, binary x[[ʺnameʺ]]  element named "name"
search list matching the regular expression / Division, binary x$name as above (w. partial matching)
“topic” ^ Exponentiation, binary  
help.start() start the HTML version of help %x%  Special binary operators, x can be Indexing matrices 
summary(x) generic function to give a “summary” replaced by any valid name x[i,j] element at row i, column j
of x, often a statistical one %% Modulus, binary x[i,]  row i
str(x) display the internal structure of an R object %/% Integer divide, binary x[,j]  column j
ls() show objects in the search path; specify %*% Matrix product, binary x[,c(1,3)]  columns 1 and 3
pat="pat" to search on a pattern %o% Outer product, binary x[ʺnameʺ,] row named "name"
ls.str() str for each variable in the search path %x% Kronecker product, binary  
dir() show files in the current directory %in% Matching operator, binary (in model Indexing matrices data frames (same as matrices 
methods(x) shows S3 methods of x formulae: nesting) plus the following) 
methods(class=class(x)) lists all the methods to ! x logical negation, NOT x X[[ʺnameʺ]]  column named "name"
handle objects of class x x & y elementwise logical AND x$name as above (w. partial matching)
findFn() searches a database of help packages for x && y vector logical AND  
functions and returns a data.frame (sos)   x | y elementwise logical OR Input and output (I/O) 
  x || y vector logical OR  
Other R References   xor(x, y) elementwise exclusive OR R data object I/O 
CRAN task views are summaries of R resources for < Less than, binary data(x) loads specified data set; if no arg is given it
task domains at: cran.r-project.org/web/views > Greater than, binary lists all available data sets
Can be accessed via ctv package == Equal to, binary save(file,...) saves the specified objects (...) in XDR
R FAQ:  cran.r-project.org/doc/FAQ/R-FAQ.html >= Greater than or equal to, binary platform-independent binary format
R Functions for Regression Analysis, by Vito <= Less than or equal to, binary save.image(file) saves all objects
Ricci: cran.r-project.org/doc/contrib/Ricci- load(file) load datasets written with save
refcard-regression.pdf Packages 
R Functions for Time Series Analysis, by Vito install.packages(“pkgs”, lib) download and install Database I/O 
Ricci: cran.r-project.org/doc/contrib/Ricci- pkgs from repository (lib) or other external Useful packages: DBI interface between R and
refcard-ts.pdf source relational DBMS; RJDBC access to databases
R Reference Card for Data Mining, by Yanchang update.packages checks for new versions and through the JDBC interface; RMySQL interface to
Zhao: www.rdatamining.com/docs/R-refcard- offers to install MySQL database; RODBC ODBC database access;
data-mining.pdf library(pkg) loads pkg, if pkg is omitted it lists ROracle Oracle database interface driver; RpgSQL
R Reference Card, by Jonathan Baron: cran.r- packages interface to PostgreSQL database; RSQLite SQLite
project.org/doc/contrib/refcard.pdf detach(ʺpackage:pkgʺ) removes pkg from memory interface for R

R p 1 of 6
Other file I/O  array(x,dim=)  array with data x; specify Data selection and manipulation 
read.table(file), read.csv(file),  dimensions like dim=c(3,4,2); elements of x which.max(x), which.min(x) returns the index of
read.delim(“file”), read.fwf(“file”) read a recycle if x is not long enough the greatest/smallest element of x
file using defaults sensible for a matrix(x,nrow,ncol) matrix; elements of x recycle rev(x) reverses the elements of x
table/csv/delimited/fixed-width file and create a factor(x,levels) encodes a vector x as a factor sort(x) sorts the elements of x in increasing order; to
data frame from it. gl(n, k, length=n*k, labels=1:n) generate levels sort in decreasing order: rev(sort(x))
write.table(x,file), write.csv(x,file) saves x after (factors) by specifying the pattern of their levels; cut(x,breaks) divides x into intervals (factors); breaks
converting to a data frame k is the number of levels, and n is the number of is the number of cut intervals or a vector of cut
txtStart and txtStop: saves a transcript of replications points
commands and/or output to a text file expand.grid() a data frame from all combinations of match(x, y) returns a vector of the same length as x
(TeachingDemos) the supplied vectors or factors with the elements of x that are in y (NA
download.file(url) from internet   otherwise)
url.show(url) remote input Data conversion  which(x == a) returns a vector of the indices of x if
cat(..., file=ʺʺ, sep=ʺ ʺ) prints the arguments after as.array(x), as.character(x), as.data.frame(x),  the comparison operation is true (TRUE), in this
coercing to character; sep is the character as.factor(x), as.logical(x), as.numeric(x),  example the values of i for which x[i] == a (the
separator between arguments convert type; for a complete list, use argument of this function must be a variable of
print(x, ...) prints its arguments; generic, meaning it methods(as)  mode logical)
can have different methods for different objects   choose(n, k) computes the combinations of k events
format(x,...) format an R object for pretty printing Data information  among n repetitions = n!/[(n − k)!k!]
sink(file) output to file, until sink() is.na(x), is.null(x), is.nan(x); is.array(x),  na.omit(x) suppresses the observations with missing
is.data.frame(x), is.numeric(x),  data (NA)
Clipboard I/O   is.complex(x), is.character(x); for a complete na.fail(x) returns an error message if x contains at
File connections of functions can also be used to read list, use methods(is)  least one NA
and write to the clipboard instead of a file. x prints x complete.cases(x) returns only observations (rows)
Mac OS: x <‐ read.delim(pipe(“pbpaste”)) head(x), tail(x) returns first or last parts of an object   with no NA
Windows: x <‐ read.delim(ʺclipboardʺ)  summary(x) generic function to give a summary unique(x) if x is a vector or a data frame, returns a
See also read.clipboard (psych) str(x) display internal structure of the data similar object but with the duplicates suppressed
  length(x) number of elements in x table(x) returns a table with the numbers of the
Data creation  dim(x) Retrieve or set the dimension of an object; different values of x (typically for integers or
c(...) generic function to combine arguments with the dim(x) <‐ c(3,2)  factors)
default forming a vector; with recursive=TRUE dimnames(x) Retrieve or set the dimension names split(x, f) divides vector x into the groups based on f 
descends through lists combining all elements of an object subset(x, ...) returns a selection of x with respect to
into one vector nrow(x), ncol(x) number of rows/cols; NROW(x),  criteria (..., typically comparisons: x$V1 < 10); if
from:to generates a sequence; “:” has operator NCOL(x) is the same but treats a vector as a x is a data frame, the option select gives variables
priority; 1:4 + 1 is “2,3,4,5” one-row/col matrix to be kept (or dropped, using a minus)
seq(from,to) generates a sequence by= specifies class(x) get or set the class of x; class(x) <‐  sample(x, size) resample randomly and without
increment; length= specifies desired length ʺmyclassʺ; replacement size elements in the vector x, for
seq(along=x) generates 1, 2, ..., length(along); unclass(x) removes the class attribute of x sample with replacement use: replace = TRUE
useful in for loops attr(x,which) get or set the attribute which of x sweep(x, margin, stats) transforms an array by
rep(x,times) replicate x times; use each to repeat attributes(obj) get or set the list of attributes of obj sweeping out a summary statistic
“each” element of x each times; rep(c(1,2,3),2) is   prop.table(x,margin) table entries as fraction of
1 2 3 1 2 3; rep(c(1,2,3),each=2) is 1 1 2 2 3 3 marginal table
data.frame(...) create a data frame of the named or xtabs(a b,data=x) a contingency table from cross-
unnamed arguments data.frame (v=1:4, ch= classifying factors
c("a","B","c","d"), n=10); shorter vectors are replace(x, list, values) replace elements of x listed in
recycled to the length of the longest index with values
list(...) create a list of the named or unnamed  
arguments; list(a=c(1,2),b="hi", c=3);
R p 2 of 6
Data reshaping  Math  Matrices 
merge(a,b) merge two data frames by common col Many math functions have a logical parameter t(x) transpose
or row names na.rm=FALSE to specify missing data removal. diag(x) diagonal
stack(x, ...) transform data available as separate cols %*% matrix multiplication
sin,cos,tan,asin,acos,atan,atan2,log,log10,exp 
in a data frame or list into a single col solve(a,b) solves a %*% x = b for x solve(a) matrix
min(x), max(x) min/max of elements of x
unstack(x, ...) inverse of stack() inverse of a
range(x) min and max elements of x
rbind(...) , cbind(...) combines supplied matrices, rowsum(x), colsum(x) sum of rows/cols for a
sum(x) sum of elements of x
data frames, etc. by rows or cols matrix-like object (consider rowMeans(x),
diff(x) lagged and iterated differences of vector x
melt(data, id.vars, measure.vars) changes an colMeans(x))
prod(x) product of the elements of x
object into a suitable form for easy casting,
round(x, n) rounds the elements of x to n decimals
(reshape2 package) Distributions 
log(x, base) computes the logarithm of x
cast(data, formula, fun) applies fun to melted data Family of distribution functions, depending on first
scale(x) centers and reduces the data; can center only
using formula (reshape2 package) letter either provide: r(andom sample) ; p(robability
(scale=FALSE) or reduce only (center=FALSE)
recast(data, formula) melts and casts in a single density), c(umulative probability density),or
pmin(x,y,...), pmax(x,y,...) parallel
step (reshape2 package) q(uantile):
minimum/maximum, returns a vector in which
reshape(x, direction...) reshapes data frame rnorm(n, mean=0, sd=1) Gaussian (normal)  
ith element is the min/max of x[i], y[i], . . .
between ’wide’ (repeated measurements in rexp(n, rate=1) exponential 
cumsum(x), cummin(x), cummax(x), 
separate cols) and ’long’ (repeated measurements rgamma(n, shape, scale=1) gamma 
cumprod(x) a vector which ith element is the
in separate rows) format based on direction rpois(n, lambda) Poisson 
sum/min/max from x[1] to x[i]
  rweibull(n, shape, scale=1) Weibull  
union(x,y), intersect(x,y), setdiff(x,y), 
Applying functions repeatedly  rcauchy(n, location=0, scale=1) Cauchy  
setequal(x,y), is.element(el,set) “set”
(m=matrix, a=array, l=list; v=vector, d=dataframe)  rbeta(n, shape1, shape2) beta 
functions 
apply(x,index,fun) input: m; output: a or l; applies rt(n, df) ‘Student’ (t) 
Re(x) real part of a complex number
function fun to rows/cols/cells (index) of x rf(n, df1, df2) Fisher-Snedecor (F) (!!!2) 
Im(x) imaginary part
lapply(x,fun) input l; output l; apply fun to each rchisq(n, df) Pearson 
Mod(x) modulus; abs(x) is the same
element of list x rbinom(n, size, prob) binomial  
Arg(x) angle in radians of the complex number
sapply(x,fun) input l; output v; user friendly rgeom(n, prob) geometric 
Conj(x) complex conjugate
wrapper for lapply(); see also replicate() rhyper(nn, m, n, k) hypergeometric  
convolve(x,y) compute convolutions of sequences
tapply(x,index,fun) input l output l; applies fun to rlogis(n, location=0, scale=1) logistic  
 fft(x) Fast Fourier Transform of an array
subsets of x, as grouped based on index rlnorm(n, meanlog=0, sdlog=1) lognormal  
mvfft(x) FFT of each column of a matrix
by(data,index,fun) input df; output is class “by”, rnbinom(n, size, prob) negative binomial  
filter(x,filter) applies linear filtering to a univariate
wrapper for tapply runif(n, min=0, max=1) uniform 
time series or to each series separately of a
aggregate(x,by,fun) input df; output df; applies fun rwilcox(nn, m, n), rsignrank(nn, n) Wilcoxon
multivariate time series
to subsets of x, as grouped based on index. Can
 
use formula notation. Descriptive statistics 
Correlation and variance 
ave(data, by, fun = mean) gets mean (or other fun) mean(x) mean of the elements of x
cor(x) correlation matrix of x if it is a matrix or a
of subsets of x based on list(s) by   median(x) median of the elements of x
data frame (1 if x is a vector)
quantile(x,probs=) sample quantiles corresponding
cor(x, y) linear correlation (or correlation matrix)
plyr package functions have a consistent names: to the given probabilities (defaults to
between x and y
The first character is input data type, second is 0,.25,.5,.75,1)
var(x) or cov(x) variance of the elements of x
output. These may be d(ataframe), l(ist), a(rray), or weighted.mean(x, w) mean of x with weights w
(calculated on n − 1); if x is a matrix or a data
_(discard). Functions have two or three main rank(x) ranks of the elements of x
frame, the variance-covariance matrix is
arguments, depending on input: describe(x) statistical description of data (in Hmisc
calculated
a*ply(.data, .margins, .fun, ...)  package)
var(x, y) or cov(x, y) covariance between x and y, or
d*ply(.data, .variables, .fun, ...)  describe(x) statistical description of data useful for
between the columns of x and those of y if they
l*ply(.data, .fun, ...)  psychometrics (in psych package)
are matrices or data frames
Three commonly used functions with ply functions sd(x) standard deviation of x
are summarise(), mutate(), and transform()  density(x) kernel density estimates of x
R p 3 of 6
Some statistical tests df.residual(fit) returns residual degrees of freedom Strings 
cor.test(a,b) test correlation; t.test() t test;   coef(fit) returns the estimated coefficients paste(vectors, sep, collapse) concatenate vectors
prop.test(), binom.test() sign test; chisq.test() chi- (sometimes with standard-errors) after converting to character; sep is a string to
square test; fisher.test() Fisher exact test;  residuals(fit) returns the residuals separate terms; collapse is optional string to
friedman.test() Friedman test; ks.test() deviance(fit) returns the deviance separate “collapsed” results; see also str_c below
Kolmogorov-Smirnov test... use help.search(ʺtestʺ)  fitted(fit) returns the fitted values substr(x,start,stop) get or assign substrings in a
  logLik(fit) computes the logarithm of the likelihood character vector. See also str_sub below
Models  
  and the number of parameters strsplit(x,split) split x according to the substring split
Model formulas  AIC(fit), BIC(fit) compute Akaike or Bayesian grep(pattern,x) searches for matches to pattern within
Formulas use the form: response ~ termA + termB ... information criterion x; see ?regex 
Other formula operators are: influence.measures(fit) diagnostics for lm & glm gsub(pattern,replacement,x) replace pattern in x
1 intercept, meaning depdendent variable has approx(x,y) linearly interpolate given data points; x using regular expression matching; sub() is similar
its mean value when independent variables can be an xy plotting structure but only replaces the first occurrence.
are zeros or have no influence spline(x,y) cubic spline interpolation tolower(x), toupper(x) convert to lower/uppercase
: interaction term loess(formula) fit polynomial surface using local match(x,table) a vector of the positions of first
* factor crossing, a*b is same as a+b+a:b fitting matches for the elements of x among table
^ crossing to the specified degree, so optim(par, fn, method = c(ʺNelder‐Meadʺ,  x %in% table as above but returns a logical vector
(a+b+c)^2 is same as (a+b+c)*(a+b+c) ʺBFGSʺ, ʺCGʺ, ʺL‐BFGS‐Bʺ, ʺSANNʺ) pmatch(x,table) partial matches for the elements of x
‐ removes specified term, can be used to general-purpose optimization; par is initial among table
remove intercept as in resp ~ a - 1 values, fn is function to optimize (normally nchar(x) # of characters. See also str_length below
%in% left term nested within the right: a + b minimize)
%in% a is same as a + a:b nlm(f,p) minimize function f using a Newton-type stringr package provides a nice interface for string
I() operators inside parens are used literally: algorithm with starting values p functions:
I(a*b) means a multiplied by b str_detect detects the presence of a pattern; returns a
Flow control 
|  conditional on, should be parenthetical logical vector
if(cond) expr 
Formula-based modeling functions commonly take str_locate locates the first position of a pattern; returns
if(cond) cons.expr else alt.expr  
the arguments: data, subset, and na.action. a numeric matrix with col start and end.
  for(var in seq) expr 
(str_locate_all locates all matches)
Model functions  while(cond) expr repeat expr 
str_extract extracts text corresponding to the first
aov(formula, data) analysis of variance model break  
match; returns a character vector (str_extract_all
lm(formula, data) fit linear models; next 
extracts all matches)
glm(formula, family, data) fit generalized linear switch 
str_match extracts “capture groups” formed by () from
models; family is description of error distribution Use braces {} around statements
the first match; returns a character matrix with one
and link function to be used; see ?family ifelse(test, yes, no) a value with the same shape as
column for the complete match and one column for
nls(formula, data) nonlinear least-squares test filled with elements from either yes or no
each group
estimates of the nonlinear model parameters do.call(funname, args) executes a function call
str_match_all extracts “capture groups” from all
lmer(formula, data) fit mixed effects model from the name of the function and a list of
matches ; returns a list of character matrices
(lme4); see also lme() (nlme) arguments to be passed to it
str_replace replaces the first matched pattern; returns a
anova(fit, data...) provides sequential sums of
Writing functions  character vector
squares and corresponding F-test for objects
function( arglist ) expr function definition, str_replace_all replaces all matches.
contrasts(fit, contrasts = TRUE) view contrasts
missing test whether a value was specified as an str_split_fixed splits string into a fixed number of
associated with a factor; to set use:
argument to a function pieces based on a pattern; returns character matrix
contrasts(fit, how.many) <‐ value 
require load a package within a function str_split splits a string into a variable number of
glht(fit, linfct) makes multiple comparisons using a
<<‐ attempts assignment within parent environment pieces; returns a list of character vectors
linear function linfct (mutcomp)
before search up thru environments str_c joins multiple strings, similar to paste
summary(fit) summary of model, often w/ t-values
on.exit(expr) executes an expression at function end str_length gets length of a string, similar to nchar
confint(parameter) confidence intervals for one or
return(value) or invisible str_sub extracts substrings from character vector,
more parameters in a fitted model.
similar to substr
predict(fit,...) predictions from fit
R p 4 of 6
Dates and Times  Graphs  fun=mean)
Class Date is dates without times. Class POSIXct is There are three main classes of plots in R: base plots, matplot(x,y) bivariate plot of the first column of x
dates and times, including time zones. Class grid & lattice plots, and ggplot2 package. They have vs. the first one of y, the second one of x vs. the
timeDate in timeDateincludes financial centers. limited interoperability. Base, grid, and lattice are second one of y, etc.
lubridate package is great for manipulating covered here. ggplot2 needs its own reference sheet. fourfoldplot(x) visualizes, with quarters of circles,
time/dates and has 3 new object classes:   the association between two dichotomous
interval class: time between two specific instants. Base graphics  variables for different populations (x must be an
Create with new_interval() or subtract two Common arguments for base plots:   array with dim=c(2, 2, k), or a matrix with
times. Access with int_start() and int_end() add=FALSE if TRUE superposes the plot on the dim=c(2, 2) if k=1)
duration class:  time spans with exact lengths  previous one (if it exists) assocplot(x) Cohen-Friendly graph showing the
new_duration() creates generic time span axes=TRUE if FALSE does not draw the axes and deviations from independence of rows and
that can be added to a date; other functions the box columns in a two dimensional contingency table
that create duration objects start with d: type=ʺpʺ specifies the type of plot, "p": points, "l": mosaicplot(x) ‘mosaic’ graph of the residuals from
dyears(), dweeks()…   lines, "b": points connected by lines, "o": same a log-linear regression of a contingency table
period class: time spans that may not have a as previous but lines are over the points, "h": pairs(x) if x is a matrix or a data frame, draws all
consistent lengths in seconds; functions vertical lines, "s": steps, data are represented by possible bivariate plots between the columns of x
include: years(), months(), weeks(), days(), the top of the vertical lines, "S": same as plot.ts(x) if x is an object of class "ts", plot of x with
hours(), minutes(), and seconds() previous but data are represented by the bottom respect to time, x may be multivariate but the
ymd(date, tz), mdy(date, tz), dmy(date, tz) of the vertical lines series must have the same frequency and dates
transform character or numeric dates to xlim=, ylim= specifies the lower and upper limits of ts.plot(x) same as above but if x is multivariate the
POSIXct object using timezone tz (lubridate) the axes, for example with xlim=c(1, 10) or series may have different dates and must have
  xlim=range(x) the same frequency
Other time packages: zoo, xts, its do irregular time xlab=, ylab= annotates the axes, must be variables of qqnorm(x) quantiles of x with respect to the values
series; TimeWarp has a holiday database from 1980+; mode character main= main title, must be a expected under a normal distribution
timeDate also does holidays; tseries for analysis and variable of mode character qqplot(x, y) diagnostic plotr of quantiles of y vs.
computational finance; forecast for modeling sub= sub-title (written in a smaller font) quantiles of x; see also qqPlot in cars package
univariate time series forecasts; fts for faster   and distplot in vcd package
operations; tis for time indexes and time indexed Base plot functions  contour(x, y, z) contour plot (data are interpolated
series, compatible with FAME frequencies. plot(x) plot of the values of x (on the y-axis) ordered to draw the curves), x and y must be vectors and
on the x-axis z must be a matrix so that dim(z)= c(length(x),
Date and time formats are specified with:
plot(x, y) bivariate plot of x (on the x-axis) and y (on length(y)) (x and y may be omitted). See also
%a, %A Abbreviated and full weekday name.
the y-axis) filled.contour, image, and persp
%b, %B Abbreviated and full month name.
hist(x) histogram of the frequencies of x symbols(x, y, ...) draws, at the coordinates given by
%d Day of the month (01-31)
barplot(x) histogram of the values of x; use x and y, symbols (circles, squares, rectangles,
%H Hours (00-23)
horiz=TRUE for horizontal bars stars, thermometers or “boxplots”) with sizes,
%I Hours (01-12)
dotchart(x) if x is a data frame, plots a Cleveland colours . . . are specified by supplementary
%j Day of year (001-366)
dot plot (stacked plots line-by-line and column- arguments
%m Month (01-12)
by-column) termplot(mod.obj) plot of the (partial) effects of a
%M Minute (00-59)
boxplot(x) “box-and-whiskers” plot regression model (mod.obj)
%p AM/PM indicator
stripplot(x) plot of the values of x on a line (an colorRampPalette creates a color palette (use:
%S Second as decimal number (00-61)
alternative to boxplot() for small sample sizes) colfunc <- colorRampPalette(c("black",
%U Week (00-53); first Sun is day 1 of wk 1
coplot(x˜y | z) bivariate plot of x and y for each "white")); colfunc(10)
%w Weekday (0-6, Sunday is 0)
value or interval of values of z  
%W Week (00-53); 1st Mon is day 1 of wk 1
interaction.plot (f1, f2, y) if f1 and f2 are factors, Low‐level base plot arguments 
%y Year without century (00-99) Don’t use
plots the means of y (on the y-axis) with respect points(x, y) adds points (the option type= can be
%Y Year with century
to the values of f1 (on the x-axis) and of f2 used)
%z (output only) signed offset from Greenwich;
(different curves); the option fun allows to lines(x, y) same as above but with lines
-0800 is 8 hours west of
choose the summary statistic of y (by default text(x, y, labels, ...) adds text given by labels at
%Z (output only) Time zone as a character string
R p 5 of 6
coordinates (x,y); a typical use is: plot(x, y, (the box looks like the corresponding character); tck a value that specifies the length of tick-marks on
type="n"); text(x, y, names) if bty="n" the box is not drawn the axes as a fraction of the smallest of the width
mtext(text, side=3, line=0, ...) adds text given by cex a value controlling the size of texts and symbols or height of the plot; if tck=1 a grid is drawn
text in the margin specified by side (see axis() with respect to the default; the following tcl a value that specifies the length of tick-marks on
below); line specifies the line from the plotting parameters have the same control for numbers on the axes as a fraction of the height of a line of
area segments(x0, y0, x1, y1) draws lines from the axes, cex.axis, the axis labels, cex.lab, the text (by default tcl=-0.5)
points (x0,y0) to points (x1,y1) title, cex.main, and the sub-title, cex.sub xaxt if xaxt="n" the x-axis is set but not drawn (useful
arrows(x0, y0, x1, y1, angle= 30, code=2) same as col controls the color of symbols and lines; use color in conjonction with
above with arrows at points (x0,y0) if code=2, at names: "red", "blue" see colors() or as axis(side=1, ...))
points (x1,y1) if code=1, or both if code=3; angle "#RRGGBB"; see rgb(), hsv(), gray(), and yaxt if yaxt="n" the y-axis is set but not drawn (useful
controls the angle from the shaft of the arrow to rainbow(); as for cex there are: col.axis, col.lab, in conjonction with axis(side=2, ...))
the edge of the arrow head col.main, col.sub
abline(a,b) draws a line of slope b and intercept a font an integer that controls the style of text (1:
L ce graphics
Lattice functions return objects of class trellis and
abline(h=y) draws a horizontal line at ordinate y normal, 2: italics, 3: bold, 4: bold italics); as for
must be printed. Use print(xyplot(...)) inside functions
abline(v=x) draws a vertical line at abcissa x cex there are: font.axis, font.lab, font.main,
where automatic printing doesn’t work. Use
abline(lm.obj) draws the regression line given by font.sub
lattice.theme and lset to change Lattice defaults.
lm.obj las an integer that controls the orientation of the axis
In the normal Lattice formula, y x|g1*g2 has
rect(x1, y1, x2, y2) draws a rectangle with left, right, labels (0: parallel to the axes, 1: horizontal, 2:
combinations of optional conditioning variables g1
bottom, and top limits of x1, x2, y1, and y2, perpendicular to the axes, 3: vertical)
and g2 plotted on separate panels. Lattice functions
respectively lty controls the type of lines, can be an integer or
take many of the same args as base graphics plus also
polygon(x, y) draws a polygon linking the points string (1: "solid", 2: "dashed", 3: "dotted", 4:
data= the data frame for the formula variables and
with coordinates given by x and y "dotdash", 5: "longdash", 6: "twodash", or a
subset= for subsetting. Use panel= to define a custom
legend(x, y, legend) adds the legend at the point string of up to eight characters (between "0" and
panel function (see apropos("panel") and ?llines).
(x,y) with the symbols given by legend "9") that specifies alternatively the length, in
title() adds a title and optionally a sub-title points or pixels, of the drawn elements and the xyplot(y˜x) bivariate plots (with many functionalities)
axis(side, vect) adds an axis at the bottom (side=1), blanks, for example lty="44" will have the same barchart(y˜x) histogram of the values of y with
on the left (2), at the top (3), or on the right (4); effect than lty=2 respect to those of x
vect (optional) gives the abcissa (or ordinates) lwd numeric that controls the width of lines, default 1 dotplot(y˜x) Cleveland dot plot (stacked plots line-
where tick-marks are drawn mar a vector of 4 numeric values that control the by-line and column-by-column)
rug(x) draws the data x on the x-axis as small space between the axes and the border of the densityplot(˜x) density functions plot histogram(˜x)
vertical lines graph of the form c(bottom, left, top, right), the histogram of the frequencies of x bwplot(y˜x)
locator(n, type="n", ...) returns the coordinates (x, default values are c(5.1, 4.1, 4.1, 2.1) “box-and-whiskers” plot
y) after the user has clicked n times on the plot mfcol a vector of the form c(nr,nc) that partitions the qqmath(˜x) quantiles of x with respect to the values
with the mouse; also draws symbols (type="p") graphic window as a matrix of nr lines and nc expected under a theoretical distribution
or lines (type="l") with respect to optional columns, the plots are then drawn in columns stripplot(y˜x) single dimension plot, x must be
graphic parameters (...); by default nothing is mfrow same as above but the plots are drawn by row numeric, y may be a factor
drawn (type="n") pch controls the type of symbol, either an integer qq(y˜x) quantiles to compare two distributions, x
between 1 and 25, or any single char within "" must be numeric, y may be numeric, character, or
Plot parameters factor but must have two ‘levels’
1 2 3 4 5 6 7 8
These can be set globally with par(...); many can be splom(˜x) matrix of bivariate plots
passed as parameters to plotting commands. 9 10 11 12 13 14 15 parallel(˜x) parallel coordinates plot
adj controls text justification (0 left-justified, 0.5 16 ● 17 18 19 ● 20 ● 21 ● 22 23 levelplot(z˜x*y|g1*g2) coloured plot of the values
centred, 1 right-justified) of z at the coordinates given by x and y (x, y and z
bg specifies the colour of the background (ex. : 24 25 * * . X X a a ? ? are all of the same length)
bg="red", bg="blue", . . the list of the 657 ps an integer that controls the size in points of texts wireframe(z˜x*y|g1*g2) 3d surface plot
available colours is displayed with colors()) and symbols cloud(z˜x*y|g1*g2) 3d scatter plot
bty controls the type of box drawn around the plot, pty a character that specifies the type of the plotting
allowed values are: "o", "l", "7", "c", "u" ou "]" region, "s": square, "m": maximal
R p 6 of 6
environments
R Programming Cheat Sheet Access any
environment
Note: Every R package has two environments
associated with it (package and namespace).
as.environment('package:base')
advanced on the Every exported function is bound into the package
search list environment, but enclosed by the namespace
environment.
Created By: arianne Colton and Sean Chen Find the
environment 3. Execution environment
where a pryr::where('func1')
name is • Each time a function is called, a new environment
deined is created to host execution. The parent of the
environments execution environment is the enclosing environment
of the function.
Function environments • Once the function has completed, this environment
environment Basics search Path There are 4 environments for functions.
is thrown away.
What is an Environment? What is the Search Path?
1. Enclosing environment (used for lexical Note: Each execution environment has two
Data structure (that powers lexical scoping) is made up An R internal mechansim to look up objects, speciically, scoping) parents: a calling environment and an enclosing
of two components, the frame, which contains the name- functions. environment.
object bindings (and behaves much like a named list), • When a function is created, it gains a reference
and the parent environment. • Access with search(), which lists all parents of the to the environment where it was made. This is the
enclosing environment. • R’s regular scoping rules only use the enclosing
global environment. (See Figure 1) parent; parent.frame() allows you to access the
Named List • The enclosing environment belongs to the function,
• It contains one environment for each attached and never changes, even if the function is moved
calling parent.
• You can think of an environment as a bag of names. package. to a different environment.
Each name points to an object stored elsewhere in 4. Calling environment
memory. • Objects in the search path environments can be • Every function has one and only one enclosing • This is the environment where the function was
found from the top-level interactive workspace. environment. For the three other types of called.
• If an object has no names pointing to it, it gets environment, there may be 0, 1, or many
environments associated with each function. • Looking up variables in the calling environment
automatically deleted by the garbage collector. rather than in the enclosing environment is called
• You can determine the enclosing environment of a dynamic scoping.
Parent Environment function by calling i.e. environment(func1)
• Dynamic scoping is primarily useful for developing
functions that aid interactive data analysis.
• Every environment has a parent, another 2. Binding environment
environment. Only one environment doesn’t have a • The binding environments of a function are all the Binding names to values
parent: the empty environment. Figure 1. The Search Path environments which have a binding to it.
• The enclosing environment determines how the Assignment
• The parent is used to implement lexical scoping: if function inds values; the binding environments • Assignment is the act of binding (or rebinding) a
a name is not found in an environment, then R will • If you look for a name in a search, it will always start
from global environment irst, then inside the latest determine how we ind the function. name to a value in an environment.
look in its parent (and so on).
attached package. Name rules
Example for enclosing and binding environment • A complete list of reserved words can be found in
Environments can also be useful data structures in their
own right because they have reference semantics. If there are functions with the same name in two y <- 1 ?Reserved.
different packages, the latest package will get called. e <- new.env()
Regular assignment arrow, <-
Four sPecial environments • Each time you load a new package with library()/ e$g <- function(x) x + y • The regular assignment arrow always creates a
require() it is inserted between the global # function g enclosing environment is the global variable in the current environment.
1. Global environment, access with globalenv(), environment, and the binding environment is “e”.
environment and the package that was previously at Deep assignment arrow, <<-
is the interactive workspace. This is the environment the top of the search path.
in which you normally work. • The deep assignment arrow modiies an
existing variable found by walking up the parent
The parent of the global environment is the last search() : environments. If <<- doesn’t ind an existing variable,
package that you attached with library() or it will create one in the global environment. This
'.GlobalEnv' ... 'Autoloads' 'package:base' is usually undesirable, because global variables
require().
library(reshape2); search() introduce non-obvious dependencies between
functions.
2. Base environment, access with baseenv(), is '.GlobalEnv' 'package:reshape2' ...
the environment of the base package. Its parent is the
empty environment.
'Autoloads' 'package:base' environment creation

3. Empty environment, access with emptyenv(), • To create an environment manually, use new.env().
is the ultimate ancestor of all environments, and Note: There is a special environment called Autoloads You can list the bindings in the environment’s frame
the only environment without a parent. Empty which is used to save memory by only loading package with ls() and see its parent with parent.env().
environments contain nothing. objects (like big datasets) when needed. • When creating your own environment, note that you
Figure 2. Function Environment should set its parent environment to be the empty
4. Current environment, access with environment. This ensures you don’t accidentally
environment() inherit objects from somewhere else.
Functions data structures
Function Basics • You can also create inix functions where the function 2. Class: represents the object’s abstract type;
i <- function(a, b) {
name comes in between its arguments, like + or -.
The most important thing to understand about R is that missing(a) -> # return true or false • ‘type’ of the object from R’s object-oriented
functions are objects in their own right. } • All user-created inix functions must start and end programming point of view
with %.
• Access with class()
All R functions have three parts: • By default, R function arguments are lazy -- they're `%+%` <- function(a, b) paste0(a, b)
body() code inside the function only evaluated if they're actually used 'new' %+% 'string' typeof() class()
f <- function(x) { strings or vector of strings character character
formals() list of arguments which controls how you 10 • Useful way of providing a default value in case the
can call the function } output of another function is NULL: numbers or vector of numbers numeric numeric
f(stop('This is an error!')) -> 10
“map” of the location of the function’s `%||%` <- function(a, b) if (!is.
environment()
variables (see “Enclosing Environment”) list list list
null(a)) a else b
However, since x is not used. stop("This is an function_that_might_return_null() %||% data.frame* list data.frame
• When you print(func1) a function in R, it shows error!") never get evaluated. default value
you these three important components. If the • Default arguments are evaluated inside the function.
environment isn't displayed, it means that the function This means that if the expression depends on the * Internally, data.frame is a list of equal-length vectors.
was created in the global environment. current environment the results will differ depending rePlacement Functions
• Like all objects in R, functions can also possess any on whether you use the default value or explicitly
• Act like they modify their arguments in place, and 1d (vectors: atomic vector and list)
number of additional attributes(). provide one:
have the special name xxx <-
Every operation is a function call f <- function(x = ls()) { • They typically have two arguments (x and value), • Use is.atomic() || is.list() to test if an object
a <- 1
• Everything that exists is an object x although they can have more, and they must return is actually a vector, not is.vector().
• Everything that happens in R is a function call, even if } the modiied object.
Type typeof() what it is
it doesn’t look like it. (i.e. +, for, if, [, $, { ...) `second<-` <- function(x, value) {
f() -> 'a' 'x' ls() evaluated inside f x[2] <- value Length length() how many elements
Note: the backtick (`), lets you refer to functions or x
f(ls()) ls() evaluated in global environment } Attributes attributes() additonal arbitrary metadata
variables that have otherwise reserved or illegal names: x <- 1:10
e.g. x + y is the same as `+`(x, y) second(x) <- 5L
return values Factors
lexical scoPing • I say they "act" like they modify their arguments in • Factors are built on top of integer vectors using two
• The last expression evaluated in a function becomes place, because they actually create a modiied copy. attributes :
the return value, the result of invoking the function.
What is Lexical Scoping?
• Only use explicit return() for when you are • We can see that by using pryr::address() to ind class(x) -> 'factor'

Looks up value of a symbol. (See "Enclosing returning early, such as for an error. the memory address of the underlying object. levels(x) # deines the set of allowed values
Environment" in the "Environment" section.) • Functions can return only a single object. But this
• indGlobals() # lists all the external dependencies of a is not a limitation because you can return a list • While factors look (and often behave) like character
function containing any number of objects. vectors, they are actually integers. Be careful when
treating them like strings.
• Functions can return invisible values, which are not
f <- function() x + 1
printed out by default when you call the function. • Factors are useful when you know the possible
codetools::indGlobals(f)
> '+' 'x' f1 <- function() 1 data structures values a variable may take, even if you don’t see all
values in a given dataset.
environment(f) <- emptyenv() f2 <- function() invisible(1)
• Most data loading functions in R automatically
f()
• The most common function that returns invisibly is <- Homogeneous Heterogeneous
convert character vectors to factors, use the
# error in f(): could not ind function “+” * argument stringsAsFactors = FALSE to suppress
Primitive Functions 1d Atomic vector List this behavior.

* This doesn’t work because R relies on lexical scoping • There is one exception to the rule that functions 2d Matrix Data frame attriButes
to ind everything, even the + operator. It’s never have three components.
nd Array • All objects can have arbitrary additional attributes.
possible to make a function completely self-contained • Primitive functions, like sum(), call C code directly • Attributes can be accessed individually with attr() or
because you must always rely on functions deined in with .Primitive() and contain no R code.
base R or other packages. Note: R has no 0-dimensional or scalar types. Individual all at once (as a list) with attributes().
• Therefore their formals(), body(), and environment() numbers or strings, are actually vectors of length one,
are all NULL: attr(v1, 'attr1') <- 'my vector'
NOT scalars.
Function arguments sum : function (..., na.rm = FALSE) • By default, most attributes are lost when modifying a
Human readable description of any R data structure:
.Primitive('sum') vector. The only attributes not lost are the three most
When calling a function you can specify arguments by important:
position, by complete name, or by partial name. • Primitive functions are only found in the base str(variable)
Arguments are matched irst by exact name (perfect package, and since they operate at a low level, they Names
a character vector giving names(x)
matching), then by preix matching, and inally by position. can be more eficient. Every Object has a mode and a class each element a name
1. Mode: represents how an object is stored in memory; used to turn vectors into
• Function arguments are passed by reference and inFix Functions Dimensions
matrices and arrays
dim(x)
copied on modify. • ‘type’ of the object from R’s point of view
• Most functions in R are ‘preix’ operators: the name used to implement the S3
• You can determine if an argument was supplied or not of the function comes before the arguments. • Access with typeof() Class
object system
class(x)
with the missing() function.
subsetting (operators: [, [[, $)
simPliFying vs. Preserving suBsetting Subsetting returns a copy of the original examPles Set individual columns to df1$col3 <- NULL
data, NOT copy-on-modiied. NULL
• Simplifying subsetting returns the simplest 1. Lookup tables (character subsetting)
Subset to return only the df1[c('col1',
possible data structure that can represent the output. out oF Bounds Character matching provides a powerful way to make columns you want 'col2')]
• Preserving subsetting keeps the structure of the lookup tables.
output the same as the input. • [ and [[ differ slightly in their behavior when the index 5. Selecting rows based on a condition
is out of bounds (OOB). x <- c('m', 'f', 'u', 'f', 'f', 'm', 'm')
(logical subsetting)
Simplifying* Preserving • For example, when you try to extract the ifth element lookup <- c(m = 'Male', f = 'Female', u = NA)
of a length four vector, aka OOB x[5] -> NA, or lookup[x] • Logical subsetting is probably the most commonly
Vector x[[1]] x[1] >m f u f f m m used technique for extracting rows out of a data
subset a vector with NA or NULL: x[NULL] -> x[0]
List x[[1]] x[1] > 'Male' 'Female' NA 'Female' 'Female' 'Male' 'Male' frame.
Operator Index Atomic List unname(lookup[x]) df1[df1$col1 == 5 & df1$col2 == 4, ]
Factor x[1:4, drop = T] x[1:4]
> 'Male' 'Female' NA 'Female' 'Female' 'Male' 'Male'
x[1, , drop = F] or [ OOB NA list(NULL) • Remember to use the vector boolean operators &
Array x[1, ] or x[, 1]
x[, 1, drop = F]
[ NA_real_ NA list(NULL) 2. Matching and merging by hand and |, not the short-circuiting scalar operators &&
x[, 1, drop = F] (integer subsetting) and || which are more useful inside if statements.
Data frame x[, 1] or x[[1]] [ NULL x[0] list(NULL)
or x[1] Lookup table which has multiple columns of • subset() is a specialised shorthand function for
• When you use drop = FALSE, it's preserving. [[ OOB Error Error information. subsetting data frames, and saves some typing
• Omitting drop = FALSE when subsetting matrices grades <- c(1, 2, 2, 3, 1)
because you don't need to repeat the name of the
[[ NA_real_ Error NULL
and data frames is one of the most common sources info <- data.frame(
data frame.
of programming errors. [[ NULL Error Error grade = 3:1, subset(df1, col1 == 5 & col2 == 4)
• [[ is similar to [, except it can only return a single • If the input vector is named, then the names of OOB, desc = c('Excellent', 'Good', 'Poor'),
value and it allows you to pull pieces out of a list. missing, or NULL components will be "<NA>". fail = c(F, F, T) Boolean algeBra vs. sets
) (logical & integer suBsetting)
* Simplifying behavior varies slightly between different $ suBsetting oPerator First method : • It's useful to be aware of the natural equivalence
data types: • $ is a useful shorthand for [[ combined with between set operations (integer subsetting) and
id <- match(grades, info$grade)
boolean algebra (logical subsetting).
• Atomic Vector: x[[1]] is the same as x[1]. character subsetting:
• Using set operations is more effective when:
info[id, ]
• List: [ ] always returns a list, to get the contents use x$y is equivalent to x[['y', exact = FALSE]]
[[ ]]. Second method : » You want to ind the irst (or last) TRUE.
• Factor: drops any unused levels but it remains a • One common mistake with $ is to try and use it when rownames(info) <- info$grade » You have very few TRUEs and very many
factor class. you have the name of a column stored in a variable: info[as.character(grades), ] FALSEs; a set representation may be faster and
• Matrix or array: if any of the dimensions has var <- 'cyl' • If you have multiple columns to match on, you’ll require less storage.
length 1, drops that dimension. x$var
need to irst collapse them to a single column (with • which() allows you to convert a boolean
• Data.frame is similar, if output is a single column, # doesn't work, translated to x[['var']] interaction(), paste(), or plyr::id()).
representation to an integer representation. There’s
it returns a vector instead of a data frame. # Instead use x[[var]] no reverse operation in base R.
• You can also use merge() or plyr::join(), which
• There's one important difference between $ and [[, do the same thing for you. which(c(T, F, T F)) -> 1 3
data.Frames suBsetting $ does partial matching, [[ does not:
# returns the index of the true*
• Data frames possess the characteristics of x <- list(abc = 1)
3. Expanding aggregated counts
x$a -> 1 # since "exact = FALSE" (integer subsetting)
both lists and matrices. If you subset with a
single vector, they behave like lists; if you subset with
x[['a']] -> # would be an error • Sometimes you get a data frame where identical * The integer representation length is always <=
two vectors, they behave like matrices. rows have been collapsed into one and a count boolean representation length.
suBsetting with assignment column has been added.
List Subsetting df1[c('col1', 'col2')] • rep() and integer subsetting make it easy to • When irst learning subsetting, a common mistake is
• All subsetting operators can be combined with uncollapse the data by subsetting with a repeated to use x[which(y)] instead of x[y].
Matrix Subsetting df1[, c('col1', 'col2')]
assignment to modify selected values of the input row index: rep(x, y) • Here the which() achieves nothing, it switches from
The subsetting results are the same in this example. vector. • rep replicates the values in x, y times. logical to integer subsetting but the result will be
• Subsetting with nothing can be useful in conjunction
• Single column subsetting: matrix subsetting with assignment because it will preserve the original df1$countCol is c(3, 5, 1)
exactly the same.
simpliies by default, list subsetting does not. • Also beware that x[-which(y)] is not equivalent
object class and structure. rep(1:nrow(df1), df1$countCol)
to x[!y]. If y is all FALSE, which(y) will be
>111222223 integer(0) and -integer(0) is still integer(0),
str(df1[, 'col1']) -> int [1:3] df1[] <- lapply(df1, as.integer)
# the result is a vector # df1 will remain as a data frame so you’ll get no values, instead of all values.
4. Removing columns from data frames
(character subsetting) • In general, avoid switching from logical to integer
str(df1['col1']) -> 'data.frame' df1 <- lapply(df1, as.integer)
subsetting unless you want, for example, the irst or
# the result remains a data frame of 1 column # df1 will become a list There are two ways to remove columns from a data last TRUE value.
frame.
debugging, condition handling, & deFensive programming object oriented (oo) Field guide
deBugging that can easily be suppressed by the user using oBject oriented systems • To see if an object is a pure base type, (i.e., it
?suppressMessages().
Use traceback() and browser(), and interactive tools doesn't also have S3, S4, or RC behavior), check
2. Handling conditions programmatically: R has three object oriented systems (plus the base types) that is.object(x) returns FALSE.
in RStudio:
• try() gives you the ability to continue execution 1. S3 is a very casual system. It has no formal deinition
• RStudio's error inspector or traceback() which list s3
even when an error occurs. of classes. S3 implements a style of OO programming
the sequence of calls that lead to the error.
called generic-function OO.
• RStudio's breakpoints or browser() which open an • tryCatch() lets you specify handler functions that • S3 is R's irst and simplest OO system. It is the only
interactive debug session at an arbitrary location in control what happens when a condition is signaled. • Generic-function OO - a special type of OO system used in the base and stats package.
the code. function called a generic function decides which
result = tryCatch(code, method to call. • In S3, methods belong to functions, called generic
• RStudio's "Rerun with Debug" tool or error = function(c) "error", functions, or generics for short. S3 methods do not
options(error = browser)* which open an warning = function(c) "warning",
message = function(c) "message" Example: drawRect(canvas, 'blue') belong to objects or classes.
interactive debug session where the error occurred. )
* There are two other useful functions that you can use
Langauge: R • Given a class, the job of an S3 generic is to call the
Use conditionMessage(c) or c$message to right S3 method. You can recognise S3 methods by
with the error option:
extract the message associated with the original • Message-passing OO - messages (methods) their names, which look like generic.class().
1. Recover is a step up from browser, as it allows you error. are sent to objects and the object determines which
to enter the environment of any of the calls in the call function to call. For example, the Date method for the mean()
stack. • You can also capture the output of the try() and
tryCatch() functions. generic is called mean.Date()
Example: canvas.drawRect('blue')
This is useful because often the root cause of the
If successful, it will be the last result evaluated in the This is the reason that most modern style guides
error is a number of calls back. block, just like a function. Langauge: Java, C++, and C#
discourage the use of . in function names, it makes
2. dump.frames is an equivalent to recover for non- If unsuccessful it will be an invisible object of class them look like S3 methods.
interactive code. It creates a last.dump.rda ile in "try-error". 2. S4 works similarly to S3, but is more formal. There are
the current working directory. two major differences to S3. • See all methods that belong to a generic :
3. Custom signal classes:
Then, in a later interactive R session, you load that • S4 has formal class deinitions, which describe the
ile, and use debugger() to enter an interactive • One of the challenges of error handling in R is that representation and inheritance for each class, and methods('mean')
debugger with the same interface as recover(). This most functions just call stop() with a string. has special helper functions for deining generics #> mean.Date
allows interactive debugging of batch code. • Since conditions are S3 classes, the solution is to and methods. #> mean.default
In batch R process ---- deine your own classes if you want to distinguish • S4 also has multiple dispatch, which means that #> mean.difftime
different types of error. generic functions can pick methods based on the
dump_and_quit <- function() { • Each condition signalling function, stop(), class of any number of arguments, not just one. • List all generics that have a method for a given
warning(), and message(), can be given either a class :
# Save debugging info to ile last.dump.rda
dump.frames(to.ile = TRUE) list of strings, or a custom S3 condition object. 3. Reference classes, called RC for short, are quite
different from S3 and S4. methods(class = 'Date')
# Quit R with error status deFensive Programming
q(status = 1) • RC implements message-passing OO, so methods • S3 objects are usually built on top of lists, or atomic
} The basic principle of defensive programming is to "fail belong to classes, not functions. vectors with attributes. Factor and data frame are
options(error = dump_and_quit) fast", to raise an error as soon as something goes S3 class.
• $ is used to separate objects and methods, so
In a later interactive session ---- wrong.
method calls look like canvas$drawRect('blue').
In R, this takes three particular forms: Check if an object is a is.object(x) & !isS4(x) or
load("last.dump.rda")
1. Checking that inputs are correct using stopifnot(),
c structure S3 object pryr::otype()
debugger()
the 'assertthat' package, or simple if statements and Check if inherits from a
• Underlying every R object is a C speciic class
inherits(x, 'classname')
condition handling (oF exPected errors) stop()
structure (or struct) that describes how
2. Avoiding non-standard evaluation like subset(), that object is stored in memory. Determine class of any
1. Communicating potential problems to the object
class(x)
user is the job of conditions: errors, warnings, and transform(), and with().
messages: • The struct includes the contents of the object, the
These functions save time when used interactively,
information needed for memory management and a
• Fatal errors are raised by stop() and force all but because they make assumptions to reduce
type.
execution to terminate. Errors are used when there typing, when they fail, they often fail with
is no way for a function to continue. uninformative error messages. typeof() # determines an object's base type
• Warnings are generated by warning() and are used 3. Avoiding functions that can return different types
to display potential problems, such as when some of output. The two biggest offenders are [ and • The "Data structures" section explains the most
common base types (atomic vectors and lists), but Created by Arianne Colton and Sean Chen
elements of a vectorised input are invalid. sapply(). data.scientist.info@gmail.com
base types also encompass functions, environments,
• Messages are generated by message() and Note: Whenever subsetting a data frame in a and other more exotic objects likes names, calls, and
Based on content from
“Advanced R” by Hadley Wickham
are used to give informative output in a way function, you should always use drop = FALSE promises.
Updated: January 15, 2016
Advanced R Environments
Cheat Sheet Search Path Function Environments
Search path – mechanism to look up objects, particularly functions. 1. Enclosing environment - an environment where the
Created by: Arianne Colton and Sean Chen function is created. It determines how function finds
• Access with : search() – lists all parents of the global environment
value.
Environment Basics (see Figure 1)
• Enclosing environment never changes, even if the
• Access any environment on the search path:
Environment – Data structure (with two function is moved to a different environment.
as.environment('package:base')
components below) that powers lexical scoping • Access with: environment(‘func1’)

Create environment: env1<-new.env() 2. Binding environment - all environments that the


function has a binding to. It determines how we find
1. Named list (“Bag of names”) – each name the function.
points to an object stored elsewhere in • Access with: pryr::where(‘func1’)
memory.
Example (for enclosing and binding environment):
If an object has no names pointing to it, it Figure 1 – The Search Path
gets automatically deleted by the garbage
• Mechanism : always start the search from global environment,
collector. then inside the latest attached package environment.
• Access with: ls('env1')
• New package loading with library()/require() : new package is
2. Parent environment – used to implement attached right after global environment. (See Figure 2)
y <- 1
lexical scoping. If a name is not found in e <- new.env()
• Name conflict in two different package : functions with the same
an environment, then R will look in its name, latest package function will get called. e$g <- function(x) x + y
parent (and so on).
• function g enclosing environment is the global
• Access with: parent.env('env1') environment,
search() :
• the binding environment is "e".
Four special environments '.GlobalEnv' ... 'Autoloads' 'package:base'

1. Empty environment – ultimate ancestor of library(reshape2); search() 3. Execution environment - new created environments
all environments '.GlobalEnv' 'package:reshape2' ... 'Autoloads' 'package:base‘ to host a function call execution.
• Parent: none NOTE: Autoloads : special environment used for saving memory by • Two parents :
• Access with: emptyenv() only loading package objects (like big datasets) when needed I. Enclosing environment of the function
2. Base environment - environment of the II. Calling environment of the function
base package Figure 2 – Package Attachment
• Execution environment is thrown away once the
• Parent: empty environment function has completed.
• Access with: baseenv() Binding Names to Values 4. Calling environment - environments where the
3. Global environment – the interactive function was called.
Assignment – act of binding (or rebinding) a name to a value in an
workspace that you normally work in
environment. • Access with: parent.frame(‘func1’)
• Parent: environment of last attached
1. <- (Regular assignment arrow) – always creates a variable in the • Dynamic scoping :
package
current environment
• Access with: globalenv() • About : look up variables in the calling
2. <<- (Deep assignment arrow) - modifies an existing variable environment rather than in the enclosing
4. Current environment – environment that
found by walking up the parent environments environment
R is currently working in (may be any of the
above and others) • Usage : most useful for developing functions that
Warning: If <<- doesn’t find an existing variable, it will create
aid interactive data analysis
• Parent: empty environment one in the global environment.
• Access with: environment()
RStudio® is a trademark of RStudio, Inc. • CC BY Arianne Colton, Sean Chen • data.scientist.info@gmail.com • 844-448-1212 • rstudio.com Updated: 2/16
Data Structures Object Oriented (OO) Field Guide
Object Oriented Systems S3
Homogeneous Heterogeneous
1d Atomic vector List R has three object oriented systems : 1. About S3 :
1. S3 is a very casual system. It has no formal • R's first and simplest OO system
2d Matrix Data frame
definition of classes. It implements generic
• Only OO system used in the base and stats
nd Array function OO.
package
• Generic-function OO - a special type of
• Methods belong to functions, not to objects or
Note: R has no 0-dimensional or scalar types. Individual numbers function called a generic function decides
classes.
or strings, are actually vectors of length one, NOT scalars. which method to call.
2. Notation :
Human readable description of any R data structure : Example: drawRect(canvas, 'blue')
• generic.class()
str(variable) Language: R
Date method for the
mean.Date()
• Message-passing OO - messages generic - mean()
Every Object has a mode and a class
(methods) are sent to objects and the object
1. Mode: represents how an object is stored in memory
determines which function to call. 3. Useful ‘Generic’ Operations
• ‘type’ of the object from R’s point of view
• Get all methods that belong to the ‘mean’
• Access with: typeof() Example: canvas.drawRect('blue')
generic:
2. Class: represents the object’s abstract type Language: Java, C++, and C# - Methods(‘mean’)
• ‘type’ of the object from R’s object-oriented programming • List all generics that have a method for the
2. S4 works similarly to S3, but is more formal.
point of view ‘Date’ class :
Two major differences to S3 :
• Access with: class() - methods(class = ‘Date’)
• Formal class definitions - describe the
typeof() class() representation and inheritance for each class, 4. S3 objects are usually built on top of lists, or
strings or vector of strings character character and has special helper functions for defining atomic vectors with attributes.
generics and methods. • Factor and data frame are S3 class
numbers or vector of numbers numeric numeric
• Multiple dispatch - generic functions can • Useful operations:
list list list pick methods based on the class of any
data.frame list data.frame number of arguments, not just one. Check if object is is.object(x) & !isS4(x) or
3. Reference classes are very different from S3 an S3 object pryr::otype()
and S4:
Factors Check if object
• Implements message-passing OO - inherits from a inherits(x, 'classname')
1. Factors are built on top of integer vectors using two attributes : methods belong to classes, not functions. specific class
• Notation - $ is used to separate objects and
class(x) -> 'factor' Determine class of
methods, so method calls look like class(x)
any object
levels(x) # defines the set of allowed values canvas$drawRect('blue').

2. Useful when you know the possible values a variable may take,
even if you don’t see all values in a given dataset. Base Type (C Structure)
Warning on Factor Usage: R base types - the internal C-level types that underlie • Internal representation : C structure (or struct) that
1. Factors look and often behave like character vectors, they the above OO systems. includes :
are actually integers. Be careful when treating them like • Includes : atomic vectors, list, functions, • Contents of the object
strings.
environments, etc. • Memory Management Information
2. Most data loading functions automatically convert character
• Useful operation : Determine if an object is a base • Type
vectors to factors. (Use argument stringAsFactors = FALSE
type (Not S3, S4 or RC) is.object(x) returns FALSE - Access with: typeof()
to suppress this behavior)

RStudio® is a trademark of RStudio, Inc. • CC BY Arianne Colton, Sean Chen • data.scientist.info@gmail.com • 844-448-1212 • rstudio.com Updated: 2/16
Functions
Function Basics Function Arguments Primitive Functions
Arguments – passed by reference and copied on modify
Functions – objects in their own right What are Primitive Functions?
1. Arguments are matched first by exact name (perfect matching), then
All R functions have three parts: 1. Call C code directly with .Primitive() and contain no R code
by prefix matching, and finally by position.
2. Check if an argument was supplied : missing()
body() code inside the function print(sum) :
i <- function(a, b) { > function (..., na.rm = FALSE) .Primitive('sum')
list of arguments which missing(a) -> # return true or false
formals() controls how you can } 2. formals(), body(), and environment() are all NULL
call the function
3. Only found in base package
“map” of the location of 3. Lazy evaluation – since x is not used stop("This is an error!")
4. More efficient since they operate at a low level
the function’s variables never get evaluated.
environment()
(see “Enclosing
f <- function(x) {
Environment”) Influx Functions
10
}
Every operation is a function call f(stop('This is an error!')) -> 10 What are Influx Functions?
• +, for, if, [, $, { … 1. Function name comes in between its arguments, like + or –
4. Force evaluation
• x + y is the same as `+`(x, y) 2. All user-created infix functions must start and end with %.
f <- function(x) {
force(x)
Note: the backtick (`), lets you refer to `%+%` <- function(a, b) paste0(a, b)
10
functions or variables that have
} 'new' %+% 'string'
otherwise reserved or illegal names.
5. Default arguments evaluation 3. Useful way of providing a default value in case the output of
Lexical Scoping f <- function(x = ls()) { another function is NULL:
a <- 1
What is Lexical Scoping? `%||%` <- function(a, b) if (!is.null(a)) a else b
x
• Looks up value of a symbol. (see } function_that_might_return_null() %||% default value
"Enclosing Environment")
• findGlobals() - lists all the external f() -> 'a' 'x' ls() evaluated inside f
dependencies of a function Replacement Functions
f(ls()) ls() evaluated in global environment
What are Replacement Functions?
f <- function() x + 1
1. Act like they modify their arguments in place, and have the
codetools::findGlobals(f) Return Values special name xxx <-
> '+' 'x' 2. Actually create a modified copy. Can use pryr::address() to
• Last expression evaluated or explicit return().
find the memory address of the underlying object
environment(f) <- emptyenv() Only use explicit return() when returning early.

f() • Return ONLY single object. `second<-` <- function(x, value) {


Workaround is to return a list containing any number of objects. x[2] <- value
# error in f(): could not find function “+” x
• Invisible return object value - not printed out by default when you
}
call the function.
x <- 1:10
• R relies on lexical scoping to find second(x) <- 5L
f1 <- function() invisible(1)
everything, even the + operator.

RStudio® is a trademark of RStudio, Inc. • CC BY Arianne Colton, Sean Chen • data.scientist.info@gmail.com • 844-448-1212 • rstudio.com Updated: 2/16
Subsetting
Subsetting returns a copy of the Data Frame Subsetting Examples
original data, NOT copy-on modified
Data Frame – possesses the characteristics of both lists and
1. Lookup tables (character subsetting)
Simplifying vs. Preserving Subsetting matrices. If you subset with a single vector, they behave like lists; if
you subset with two vectors, they behave like matrices x <- c('m', 'f', 'u', 'f', 'f', 'm', 'm')
1. Simplifying subsetting lookup <- c(m = 'Male', f = 'Female', u = NA)
1. Subset with a single vector : Behave like lists
• Returns the simplest possible lookup[x]
data structure that can represent >m f u f f m m
df1[c('col1', 'col2')]
the output > 'Male' 'Female' NA 'Female' 'Female' 'Male' 'Male'
unname(lookup[x])
2. Preserving subsetting 2. Subset with two vectors : Behave like matrices > 'Male' 'Female' NA 'Female' 'Female' 'Male' 'Male'
• Keeps the structure of the output 2. Matching and merging by hand (integer subsetting)
df1[, c('col1', 'col2')]
the same as the input. Lookup table which has multiple columns of information:
• When you use drop = FALSE, it’s
The results are the same in the above examples, however, results are grades <- c(1, 2, 2, 3, 1)
preserving info <- data.frame(
different if subsetting with only one column. (see below)
Simplifying* Preserving grade = 3:1,
1. Behave like matrices desc = c('Excellent', 'Good', 'Poor'),
Vector x[[1]] x[1] fail = c(F, F, T)
str(df1[, 'col1']) -> int [1:3] )
List x[[1]] x[1]
First Method
Factor x[1:4, drop = T] x[1:4] • Result: the result is a vector
id <- match(grades, info$grade)
2. Behave like lists info[id, ]
x[1, , drop = F] or
Array x[1, ] or x[, 1]
x[, 1, drop = F] str(df1['col1']) -> ‘data.frame’ Second Method
Data x[, 1, drop = F] or rownames(info) <- info$grade
x[, 1] or x[[1]] • Result: the result remains a data frame of 1 column
frame x[1] info[as.character(grades), ]

Simplifying behavior varies slightly 3. Expanding aggregated counts (integer subsetting)


between different data types: $ Subsetting Operator • Problem: a data frame where identical rows have been
1. Atomic Vector collapsed into one and a count column has been added
1. About Subsetting Operator
• x[[1]] is the same as x[1] • Solution: rep() and integer subsetting make it easy to
• Useful shorthand for [[ combined with character subsetting
uncollapse the data by subsetting with a repeated row index:
2. List
x$y is equivalent to x[['y', exact = FALSE]] rep(x, y) rep replicates the values in x, y times.
• [ ] always returns a list
• Use [[ ]] to get list contents, this 2. Difference vs. [[ df1$countCol is c(3, 5, 1)
returns a single value piece out of rep(1:nrow(df1), df1$countCol)
• $ does partial matching, [[ does not
a list >111222223
3. Factor x <- list(abc = 1)
x$a -> 1 # since "exact = FALSE" 4. Removing columns from data frames (character subsetting)
• Drops any unused levels but it x[['a']] -> # would be an error There are two ways to remove columns from a data frame:
remains a factor class
Set individual columns to NULL df1$col3 <- NULL
4. Matrix or Array 3. Common mistake with $
Subset to return only columns you want df1[c('col1', 'col2')]
• If any of the dimensions has • Using it when you have the name of a column stored in a variable
length 1, that dimension is 5. Selecting rows based on a condition (logical subsetting)
dropped var <- 'cyl'
• This is the most commonly used technique for extracting
x$var
5. Data Frame rows out of a data frame.
# doesn't work, translated to x[['var']]
• If output is a single column, it # Instead use x[[var]] df1[df1$col1 == 5 & df1$col2 == 4, ]
returns a vector instead of a data
frame
RStudio® is a trademark of RStudio, Inc. • CC BY Arianne Colton, Sean Chen • data.scientist.info@gmail.com • 844-448-1212 • rstudio.com Updated: 2/16
Subsetting continued Debugging, Condition Handling and Defensive Programming
Boolean Algebra vs. Sets Debugging Methods Condition Handling of Expected Errors
(Logical and Integer Subsetting)
1. traceback() or RStudio's error inspector 1. Communicating potential problems to users:
1. Using integer subsetting is more effective I. stop()
• Lists the sequence of calls that lead to
when: • Action : raise fatal error and force all execution to terminate
the error
• You want to find the first (or last) TRUE. • Example usage : when there is no way for a function to continue
2. browser() or RStudio's breakpoints tool
• You have very few TRUEs and very II. warning()
• Opens an interactive debug session at
many FALSEs; a set representation may • Action : generate warnings to display potential problems
an arbitrary location in the code
be faster and require less storage. • Example usage : when some of elements of a vectorized input are
3. options(error = browser) or RStudio's invalid
2. which() - conversion from boolean
"Rerun with Debug" tool
representation to integer representation III. message()
• Opens an interactive debug session • Action : generate messages to give informative output
which(c(T, F, T F)) -> 1 3 where the error occurred • Example usage : when you would like to print the steps of a program
• Error Options: execution
• Integer representation length : is always 2. Handling conditions programmatically:
options(error = recover)
<= boolean representation length I. try()
• Difference vs. 'browser': can enter
• Common mistakes : • Action : gives you the ability to continue execution even when an error
environment of any of the calls in the
occurs
I. Use x[which(y)] instead of x[y] stack
II. tryCatch()
II. x[-which(y)] is not equivalent to options(error = dump_and_quit) • Action : lets you specify handler functions that control what happens
x[!y]
• Equivalent to ‘recover’ for non- when a condition is signaled
interactive mode
Recommendation: result = tryCatch(code,
• Creates last.dump.rda in the current error = function(c) "error",
Avoid switching from logical to integer
working directory warning = function(c) "warning",
subsetting unless you want, for example, the
first or last TRUE value In batch R process : message = function(c) "message"
)
dump_and_quit <- function() { Use conditionMessage(c) or c$message to extract the message
Subsetting with Assignment # Save debugging info to file associated with the original error.
1. All subsetting operators can be combined last.dump.rda
with assignment to modify selected values dump.frames(to.file = TRUE)
of the input vector.
# Quit R with error status Defensive Programming
q(status = 1)
df1$col1[df1$col1 < 8] <- 0 Basic principle : "fail fast", to raise an error as soon as something goes wrong
}
options(error = dump_and_quit) 1. stopifnot() or use ‘assertthat’ package - check inputs are correct
2. Subsetting with nothing in conjunction with
assignment : 2. Avoid subset(), transform() and with() - these are non-standard
In a later interactive session : evaluation, when they fail, often fail with uninformative error messages.
• Why : Preserve original object class and
structure load("last.dump.rda") 3. Avoid [ and sapply() - functions that can return different types of output.
debugger() • Recommendation : Whenever subsetting a data frame in a function, you
df1[] <- lapply(df1, as.integer)
should always use drop = FALSE

RStudio® is a trademark of RStudio, Inc. • CC BY Arianne Colton, Sean Chen • data.scientist.info@gmail.com • 844-448-1212 • rstudio.com Updated: 2/16

Anda mungkin juga menyukai