Data Science Boot Camp Survival Manual

Data
Science Boot-Camp Survival Manual
Table of Contents
1. Prologue
2. Chapter 0 - Data Scientist's Toolbox
3. Chapter 1 - R Programming
4. Chapter 2 - Getting and Cleaning Data
5. Chapter 3 - Exploratory Data Analysis
6. Chapter 4 - Reproducible Research
7. Chapter 5 - Statistical Inference
8. Chapter 6 - Regression Models
9. Chapter 7 - Practical Machine Learning
10. Chapter 8 - Developing Data Products
11. Capstone
12. Epilogue
Data Science Boot-Camp Survival Manual
Prologue
Welcome recruits!
During the next year you will learn the fundamentals of data science. The Data Science Specialization, offered by Johns
Hopkins University, is challenging. Success requires a strategy. This book aims to equip each of you with the knowledge
and skills to complete boot-camp. The "Data Science Boot-Camp Survival Manual" alone cannot guarantee success. Listen
to the instructor's lectures and apply yourself to the evaluations throughout your training.
According to Jeff Leek and the Data Science Specialization Team the key word in data science is "science". To this end, the
focus of the ten-course series including a capstone project is to provide the learner with:
1. an introduction to the key ideas behind reproducible research,
2. an introduction to the tools and techniques to transform raw data into a presentable report,
3. an opportunity to gain hands-on practice so you can learn the techniques for yourself, and
4. an appreciation of the mathematics & statistics involved in data science.
Core Courses
The courses comprising the Data Science Specialization are:
Data Scientist's Toolbox
R Programming
Getting and Cleaning Data
Exploratory Data Analysis
Reproducible Research
Statistical Inference
Regression Models
Practical Machine Learning
Developing Data Products
These courses taught by Brian Caffo, Jeff Leek, and Roger D. Peng enable the learner to get the foundational skills. While
the lectures and assignments build these foundational skills, learners often required further explanations. The course
forums allow learners to discuss the lecture topics and assignments. Yet each session of a course begins without the
shared knowledge of previous participants. As a Community Teaching Assistant (CTA) it became clear that a companion
guide would be beneficial.
Are you up to the challenge of Johns Hopkins University's Data Science Specialization?
Structure of the Boot-Camp Survival Manual

Each chapter covers one of the core courses. A tutorial-style balancing theory and practical application makes surviving
data science boot-camp possible. You learn the workflow typically involved in all phases of a data analysis project.
Chapter 0: The Data Scientist's Toolbox
URL: https://www.coursera.org/course/datascitoolbox
Synopsis: "Get an overview of the data, questions, and tools that data analysts and data scientists work with. This is the
first course in the Johns Hopkins Data Science Specialization."
Prologue
Chapter 1: R Programming
URL: https://www.coursera.org/course/rprog
Synopsis: "Learn how to program in R and how to use R for effective data analysis. This is the second course in the Johns
Hopkins Data Science Specialization."
Chapter 2: Getting and Cleaning Data
URL: https://www.coursera.org/course/getdata
Synopsis: "Learn how to gather, clean, and manage data from a variety of sources. This is the third course in the Johns
Hopkins Data Science Specialization."
Chapter 3: Exploratory Data Analysis
URL: https://www.coursera.org/course/exdata
Synopsis: "Learn the essential exploratory techniques for summarizing data. This is the fourth course in the Johns Hopkins
Data Science Specialization."
Chapter 4: Reproducible Research
URL: https://www.coursera.org/course/repdata
Synopsis: "Learn the concepts and tools behind reporting modern data analyses in a reproducible manner. This is the fifth
course in the Johns Hopkins Data Science Specialization."
Chapter 5: Statistical Inference
URL: https://www.coursera.org/course/statinference
Synopsis: "Learn how to draw conclusions about populations or scientific truths from data. This is the sixth course in the
Johns Hopkins Data Science Course Track."
Chapter 6: Regression Models
URL: https://www.coursera.org/course/regmods
Synopsis: "Learn how to use regression models, the most important statistical analysis tool in the data scientist's toolkit.
This is the seventh course in the Johns Hopkins Data Science Specialization."
Chapter 7: Practical Machine Learning
URL: https://www.coursera.org/course/predmachlearn
Synopsis: "Learn the basic components of building and applying prediction functions with an emphasis on practical
applications. This is the eighth course in the Johns Hopkins Data Science Specialization."
Chapter 8: Developing Data Products
URL: https://www.coursera.org/course/devdataprod
Synopsis: "Learn the basics of creating data products using Shiny, R packages, and interactive graphics. This is the ninth
course in the Johns Hopkins Data Science Specialization."
Data Science Capstone
Prologue
URL: https://www.coursera.org/course/dsscapstone
Synopsis: "The capstone project class will allow students to create a usable/public data product that can be used to show
your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry,
government, and academic partners. "
Course synposes quoted from the course information pages at Coursera as at 1 April 2015.
Course Dependency and Recommended Sequence

Although the courses are standalone, the knowledge is cumulative. The pedagogical course dependencies are available
from Johns Hopkins University.
Figure 1 Course dependency diagram provided by Daniel M. Bontje (created 17 November 2014)
You need a language or system to perform the tasks (R Programming) and data to analyse (Getting and Cleaning Data) to
get a sense of the data (Exploratory Data Analysis) before building models and drawing inferences (Statistical Inference,
Regression Models) or making predictions (Practical Machine Learning) from the data before presenting your conclusions
and supporting evidence (Building Data Products, Reproducible Research).
The recommended mathematics background is linear algebra and introductory statistics (descriptive and inferential).
Statistical Inference and Regression Models, courses in this specialisation, cover all the basic statistical concepts forming a
solid foundation for subsequent courses in the Data Science Specialization. These courses along with Practical Machine
Learning are the theoretical underpinnings, while the other six courses are applied in nature: obtaininng data, scrubbing
data, exploring data, modeling data, and interpreting data collectively known as the OSEMN (prounounced as awesome)
model.
Again welcome to the Data Science Boot-Camp. Review the "Data Science Boot-Camp Survival Manual" on a regular basis
throughout your training.
Prologue
Chapter 0 - The Data Scientist's Toolbox

We shall neither fail nor falter; we shall not weaken or tire...give us the tools and we will finish the job. - Winston
Churchill, Prime Minister of Great Britain
Primary Instructor: Jeff Leek, MS, PhD (Biostatistics)
The foundational course Data Scientist's Toolbox is a high-level overview of the specialisation. This course lays the
groundwork for the nine-course series plus capston project. A comprehensive approach teaching fundamental skills for data
science regardless of data set.
The keyword in "data science" is science not data. The method is not dependent upon the dataset size; it scales from small
data to big data. The data science method equates to the scientific method used in the natural sciences. The Financial
Times article, "Big data: are we making a big mistake?", argues for a rigorous methodology. An article "The Data Science
Methology", published on Data Science 101, argues for adoption of the scientific method familiar to scientists in the natural
sciences.
Data Science Methodology
1. problem formulation (hypothesis)
2. obtain data (experiment)
3. analysis (validate or refute hypothesis)
4. data product (report)
The courses in the Johns Hopkins University Data Science Specialization "emphasise a data science methodology rather
than focusing primarily on data science technique. [T]he instructors have taken care throughout to demonstrate a
responsible, scientifically-based approach to collecting, curating and analyzing data sources," says specialisation
participant John Frederick Thiels.
Learning Objectives
You will have learned the basic skills to successfully use the various tools required throughout the book and the data
science specialisation courses.
Tools of the Trade

To successfully complete the hands-on exercises in the book and course assignments (quizzes, programming, and course
projects) some software must be installed on your computer or in a hosted environment: Git, R and RStudio. A GitHub
account is mandatory because peer-assessed submissions must be accessible. Internet access is necessary to fully
participate in the courses; such as watching or downloading lectures, taking quizzes, submitting programming assignments,
and participating in the peer assessment process. Due to the variety of operating system platforms on which the software
can be deployed, for this book, we decided to solely focus on Ubuntu Linux running locally or remotely in a virtualised
environment.
Before delving into how to use the various tools in our toolbox it is important to consider the types of skills we need as datascientists-in-training. Firstly, linear algebra, probability and calculus at the introductory level is sufficient mathematics.
Secondly, introductory descriptive and inferential statistics including hypothesis testing is the recommended statistics
background. Thirdly, basic programming skills are recommended. None of the aforementioned skills are mandatory for the
Data Science Specialization. For those readers seeking to learn any of these skills there are courses available, including:
Pre-Calculus - Instructors: Sarah Eichhorn and Rachel Cohen Lehman, University of California, Irvine
Probability - Instructor: Santosh S. Venkatesh
Chapter 0 - Data Scientist's Toolbox
Calculus: Single Variable - Instructor: Robert Ghrist, University of Pennsylvania

Descriptive Statistics - Instructor: Matthijs Rooduijn, University of Amsterdam
Inferential Statistics - Instructor: Annemarie Zand Scholten, University of Amsterdam
Data Analysis and Statistical Inference - Instructor: Mine etinkaya-Rundel, Duke University
Programming for Everbody (Python) - Instructor: Charles Severance, University of Michigan
Programming for Everybody (Python) deserves special mention because it is consistently highly-rated by course
participants for the teaching-style of "Dr. Chuck." You do not have to be a geek to enjoy this course.
Read the information page of each course especially if you prefer a self-teaching approach to learning. There are freely
available textbooks for some of these courses.
Virtualisation Software
While the various applications required for these courses can be installed on the host operating system of your computer
we recommend using virtualisation software such as Oracle VirtualBox, VMWare Workstation or Fusion or Player, and
Parallels Desktop depending upon the operating system running on the computer. Another virtualisation option is RStudio
Server Amazon Machine Image (AMI) or rolling your own local or hosted virtual machine instance.
This section will describe two scenarios:
importing a ready-made disk image (AMI) of Ubuntu Linux 14.04 LTS (64-bit) on the Amazon Web Service Elastic
Computing 2 (AWS EC2) hosting platform.
importing a ready-made disk image of Ubuntu Linux 14.04 LTS (32-bit or 64-bit) into Oracle VirtualBox on your
computer, and
An advantage of virtualisation software, running on your computer or remotely hosted by a service provider, is all the
required applications are kept separate from your computer's operating system and by default isolated from the host file
system.
Option A: Amazon Web Service Elastic Compute 2 with Amazon Machine Image
If you prefer installing Oracle VirtualBox and creating a virtual machine on your computer, you can skip this section.
Instructions are forthcoming.
Option B: Local Computer with Oracle VirtualBox
Please consult the instructions about downloading and installing Oracle VirtualBox onto your computer before proceeding.
Download the ready-made disk image of Ubuntu Linux (32-bit or 64-bit) based on the version supported by Oracle
VirtualBox and the architecture of the computer.
Note: Some computers are 64-bit but only allow 32-bit operating systems to run within virtualisation software.
Extract the compressed archive containing the disk image using p7zip.
$ 7za e Ubuntu_14.04.2-32bit.7z
7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_CA.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: Ubuntu_14.04.2-32bit.7z
Extracting 32bit/Ubuntu 14.04.2 (32bit).vdi
Extracting 32bit
Everything is Ok
Folders: 1
Files: 1
Size: 3807379456
Compressed: 776252068
$
After installing Oracle VirtualBox it is time to launch it so we can import the virtual machine disk image (.vdi).
Figure 0.1 Creating a new virtual machine instance

Click 'New' on the main menu. A dialogue box pop-up appears where you enter the name to assign to the virtual machine
and select the operating system and version. Click 'Next' to continue.
Figure 0.2 Allocating system memory to the new virtual machine instance
Select the amount of system memory (RAM) to allocate to the virtual machine. Allocate 2048 MB of system memory to this
virtual machine instance. This parameter can be modified later if necessary. Click 'Next' to continue.
Figure 0.3 Associating an existing virtual hard drive to the new virtual machine instance
Select 'Use an existing virtual hard drive file' and click on the file folder icon to navigate to the virtual hard drive file
previously downloaded and uncompressed. Click 'Create' to associate this disk image with the current virtual machine.
10
Figure 0.4 Mount the VirtualBox Guest Additions ISO

Make the VirtualBox Guest Additions ISO accessible to the virtual machine instance. At the main screen of Oracle
VirtualBox select the DataScientistsToolbox virtual machine. Click 'Settings', then 'Storage', followed by 'Empty'.
11
Figure 0.5 Mount the VirtualBox Guest Additions ISO

Click the CD/DVD icon and select VBoxGuestAdditions.iso from the dropdown list. Click 'OK' to return to the main screen.
12

Figure 0.6 Starting the new virtual machine instance
At the main screen of Oracle VirtualBox select the newly created virtual machine instance. Click 'Start' to launch the virtual
machine. At the login prompt type the password from the download webpage.
The final preparatory step is enabling the VirtualBox Guest Additions and updating any out-of-date packages installed on
the virtual machine. Open a terminal window (CTRL + ALT + T).
Activate the VirtualBox Guest Additions so the virtual machine instance integrates with the host system.
$ cd /media/osboxes/VBOXADDITIONS*
$ sudo sh VBoxLinuxAdditions.run
Upon successful installation shutdown the virtual machine instance by clicking the Gear icon in the upper right corner of the
virtual machine, umount the VirtualBox Guest Additions by reversing the steps shown in Figures 0.4 and 0.5. Alternatively,
you may choose to leave the VirtualBox Guest Additions ISO attached.
Note: Whenever an updated Linux kernel is installed as part of the normal update process the VirtualBox Guest Additions
will have to be reapplied to ensure the shared clipboard, for example, continues to work. Do NOT forget to restart the virtual
machine instance so the VirtualBox Guest Additions are activated.
13
Figure 0.7 Enable/Disable shared clipboard and drag-and-drop

Enabling a shared clipboard between your computer and the virtual machine instance is configurable via the 'Settings'
menu.
14
Figure 0.8 Pointing device and device boot-order configuration

The mouse device type should be configured as 'PS/2 Mouse' whether using a wired or wireless mouse. The device boot
order should be configured to ensure the virtual disk image is the default boot device.
Restart the virtual machine instance.
Switching between standard mode and full-screen mode is as easy as Host_Key + F (RIGHT_CTRL + F by default).
For convenience launch a terminal session (CTRL + ATL + T) and when its icon appears in the application bar right-click
the mouse and select 'Lock to Launcher'. From this point forward any time a terminal session is wanted simply click the
'Terminal' icon.
15
Figure 0.9 System settings configuration

Before proceeding with updating the currently installed system software and applications we should select an Ubuntu Linux
package repository in geographic proximity to your location. This can be accomplished by clicking the 'System Settings'
icon in the application bar along the left-edge of the screen. Click 'Software & Updates'.
Next, open a terminal session (CTRL + ALT + T). When the terminal displays the shell prompt type the following commands
to update and upgrade the currently installed system software and applications. If you see the 'Software Updater' icon in
the application bar, you can apply software updates by clicking the icon instead.
$ sudo apt-get -y update

$ sudo apt-get -y upgrade
16
Figure 0.10 Editing the user name, password, language preference and enabling automatic login
Automatic login can be enabled and the display name for the user account and password can be changed, if desired, via
'Systems Setting's by clicking 'User Accounts'.
Figure 0.11 Automatic login enabled

Click 'Unlock' to enable editing of the user account configuration. Type the current password when prompted. If you want to
change the account name, click 'osboxes.org' and type the desired account name. If you want to change the password,
click on the asterisks and type the desired password. If you want to enable automatic login, click 'OFF' so that 'ON' is
visible. Finally, click 'Lock' to relock the user account configuration.
Getting Familiar with the Command-Line Interface (CLI)

After a short detour to familiarise ourselves with the command-line interface (CLI) we will install Git, R, and RStudio. Rest
assured that interacting with command-line is not required beyond this chapter. RStudio provides seamless integration with
the file system to navigate and manipulate files, version control and repository synchronisation between your computer and
repository hosting services, and the statistical computation and software development environment.
17
15-minute Introduction to Navigating and Manipulating the File System from the Terminal
Let's start exploring the basic features of the environment from the comfort of a terminal session and the command-line.
Open a terminal window (CTRL+ALT+T) if you are running a graphical desktop environment. By learning a few basic
commands to navigate and manipulate the file system you will feel at ease and understand what is going in behind the
scenes within File Panel of RStudio.
Command
Description
pwd
print working directory

name
ls
list file and/or directory

names
mkdir
Common Flags
-l (long form)
-a (hidden)
-R (recursive)
Arguments
[directory_path/]
[pattern]
(optional)
[directory_path/]directory_name or
[directory_path/]directory_name_list
make directory
(mandatory)
[directory_path/][directory_name]
cd
change directory
(optional)
[directory_path/]file_name
touch
create an empty file

(mandatory)
echo
create a file
(by default stdout)
-e -n
(no carriage
return)
"a string of characters"

(mandatory)
(source)
[directory_path/][filename]
cp
copy file or directory
-r (recursive)
(target)
[directory_path/][file_name]
(mandatory)
(source)
mv
move file or directory
-r (recursive)
(target)
(mandatory)
rm
remove/delete file or
directory
-f (force)
-r (recursive)
(mandatory)
Arguments in brackets are optional
but if the 'mandatory' designation is
present, at least one of the
arguments must be supplied. Directory
names and paths as well as file names
may contain wildcard characters
(* and ?) when used with some of these
commands.
Table 0.1 Basic File and Directory Commands

For each example type the commands to the right of the command prompt ($) to interactively follow along these examples.
Take your time working through the commands until you fully understand why each command produces the observed
results.
18
Example 1: Determine the current working directory
$ pwd
/home/osboxes
Example 2: List the file and subdirectory names in the current working directory
$ ls
Desktop Downloads Music Public Videos
Documents examples.desktop Pictures Templates
Example 3: Create a subdirectory named 'test' in the current directory
$ mkdir test
$ cd test
$ pwd
/home/osboxes/test
Example 4: Create subdirectories named '1', '2', '3', and '4' in the current directory
$ mkdir {1,2,3,4}
Example 5: List the files and subdirectory names in the current directory
$ ls 1 2 3 4
Example 6: Create some empty files and some files with content
$ touch 1/file01.txt 2/file02.txt $ echo "Bonjour tout le monde" Bonjour tout le monde $ echo "Hello World!" > ./1/file0101.txt
$ echo "To be or not to be" > ./3/file03.txt
Example 7: Change to the directory immediately above the current directory and list the files and subdirectory names in the
subdirectory named '1'
$ cd ..
$ ls -l test/1
total 4
-rw-rw-r-- 1 osboxes osboxes 13 Apr 3 09:28 file0101.txt
-rw-rw-r-- 1 osboxes osboxes 0 Apr 3 09:27 file01.txt
Example 8: List the files ending with '.txt' in the subdirectory named '3'
$ ls -l test/3/*.txt
-rw-rw-r-- 1 osboxes osboxes 19 Apr 3 09:29 test/3/file03.txt
Example 9: (a) Copy the file 'file02.txt' from directory named '${HOME}/test/2' to directory '${HOME}/test/4' and name the
file 'file04.txt'
$ cp ./test/2/file02.txt ./test/4/file04.txt
19
(b) Copy the file 'file02.txt' from directory named '${HOME}/test/2' to directory '${HOME}/test/4' and name the file 'file02.txt'
$ cp ~/test/2/file02.txt ./test/4/file02.txt
Example 10: Make subdirectory '${HOME}/test/3' the current working directory and create a hidden file and a hidden
subdirectory
$ cd test/3
$ touch .hidden01.txt
$ mkdir .hidden
Example 11: List the names of non-hidden files and subdirectories in the current directory
$ ls
file03.txt
$ ls -a
. .. file03.txt .hidden .hidden01.txt
Example 12: Create a subdirectory named 'another' in the home directory of the user and copy the files and recursively
from '${HOME}/osboxes/test' to '${HOME}/another'
$ mkdir ~/another
$ cp -r ../* ~/another
Exampke 13: List the files and subdirectories in the home directory of user
$ ls ~
another Documents examples.desktop Pictures Templates Videos
Desktop Downloads Music Public test
Example 14: List the file and subdirectory names in '${HOME}/another'
$ ls ~/another
1 2 3 4
Example 15: List the file namess and recursively the subdirectories in '${HOME}/another'
$ ls -R ~/another
/home/osboxes/another:
1 2 3 4
/home/osboxes/another/1:
file0101.txt file01.txt
file02.txt
file03.txt
Example 16: Create a subdirectory named 'test/5' in the home directory of the user and move (copy and delete) the files
and/or subdirectories from '${HOME}/another'
20
$ mkdir ~/test/5
$ mv ~/another/* ../5
Example 17: List the file and subdirectory names in '${HOME}//another'
$ ls -a /home/osboxes/another
. ..
Example 18: List the file and subdirectory names in '${HOME}/test/5'
$ ls ../5
1 2 3 4
Example 19: List the file names and recursively the subdirectories in '${HOME}/test/5'
$ ls -R ~/test/5
/home/osboxes/test/5:
1 2 3 4
/home/osboxes/test/5/1:
file02.txt
file03.txt
Example 20: Make directory '/home/osboxes' the current working directory
$ cd
$ pwd
/home/osboxes
Example 21: Delete the subdirectories 'test' and 'another' from '${HOME}', and then list the file and subdirectory names in
the current directory
$ rm -rf test another

$ ls
Desktop Downloads Music Public Videos
Documents examples.desktop Pictures Templates
Example 22: Close the terminal session
$ exit
A cheatsheet for the Bourne Again SHell (BASH) has been prepared by the folks at Learn Code the Hardway (LCodeTHW).
A complete manual for BASH is available from the GNU Project if you want to further explore the CLI and its capabilities.
Markdown - Writing Documentation the Easy Way

The markdown language, created by John Gruber, is relatively small and easy to learn unlike markup languages such as
21
HTML and XML. Taking a portion of this book as an example, with some minor changes to demonstrate particular features,
we explore some of the more common markdown elements.
Prologue
===
# Introduction
During the next year you will learn the fundamentals of data science.
Surviving the nine courses which make up the [Data Science
Specialization][0001] offered by [Johns Hopkins University][jhu] requires a
**strategy**.
To this end, the focus of the ten-course series including a capstone project
is to provide the learner with:
1. an introduction to the key ideas behind reproducible research,
2. an introduction to the tools and techniques to transform raw
data into a presentable report,
4. an opportunity to gain hands-on practice so you can learn the
techniques for yourself, and
3. an appreciation of the mathematics & statistics involved in
data science.
## Core Courses
* Data Scientist's Toolbox
* R Programming
* Exploratory Data Analysis
* Getting and Cleaning Data
* Reproducible Research
* Statistical Inference
* Regression Models
* Practical Machine Learning
* Developing Data Products
![Course Dependency](dst_courses.png)
*Figure 1 Course dependency diagram*
[0001]: https://www.coursera.org/specialization/jhudatascience/1?utm_medium=
courseDescripTop
[jhu]: http://www.jhu.edu
Listing 0.1 Sample markdown document

So you can immediately practise each of the markdown elements used in the sample document a concise description is
supplied with references to the sample document.
Font Modifiers
There are two styles of font modifier supported by standard markdown:
bold (text surrounded by **)
italics (text surrounded by *)
From the sample document we see that 'strategy' is modified during conversion to render bolded, whilst 'Figure 1 Course
dependency diagram' is modified during conversion to render italicised.
Headings
There are two styles of headers supported by standard markdown:
setext
First-level (text underlined by at least 3 equal-signs)
22
Secondary-level (text underlined by at least 3 dashes)

atx
First-level (# preceding text)
Secondary-level (## preceding text)
Third-level (### preceding text)
Fourth-level (#### preceding text)
From the sample document we see that 'Prologue' and 'Introduction' are first-level headers, and 'Core Courses' is a
second-level header.
Images
There are two styles of image links supported by standard markdown:
inline
filename: ![alternate text](directory_path/image "optional title")
reference
id: ![alternate text][string of digits | string of terms]
Links
There are two styles of links supported by standard markdown:
inline
URL: [random website][website]
reference
id: [random website][string of digits | string of terms]
From the sample document we see that 'Data Science Specialization' is referenced by the id label (0001) whereas 'John
Hopkins University' is referenced by the id label (jhu). The actual URLs are collected at the end of the same document
although the labels could appear anywhere in the document.
Lists
There are two styles of lists supported by standard markdown:
ordered list
number (followed by an optional period and two mandatory spaces; physical ordering overrides numeric label
during conversion)
unordered list
* (asterisk)
- (dash)
+ (plus)
From the sample document we see an ordered list containing the learner outcomes and an unordered list containing the
names of each of the nine core courses.
Install the markdown (MD) to hyper-text markup language (HTML) converter to practise modifying the sample markdown
document.
$ sudo apt-get install markdown
23
A text editor combined with the markdown-to-html converter is all that is needed.
$ nano sample.md
$ markdown sample.md # sends HTML output to the screen
$ markdown sample.md > sample.html # sends HTML outout to a file named 'sample.html'
$ firefox sample.html # view the rendered HTML in a web broswer
Take your time working through the sample markdown document until you fully understand why each element produces the
observed results. This book is written in a markdown language. In another course you will learn how to produce a
markdown document combining text and executable R code using Rmarkdown, and convert it to HTML and PDF using
RStudio.
Git - Version Control

Git is a distributed version control system allowing any number of people to collaboratively contribute to software
development or other projects. Some of the courses require learners to submit their programming assignments to GitHub
as part of a peer assessment grading process.
Installing Git
By installing the Git command-line client you can choose whether to manage your local and remote repositories from a
terminal session or within RStudio. Assuming you are running the Ubuntu Linux virtual machine or another Debian
GNU/Linux derived distribution type the command shown to install the Git client.
$ sudo apt-get install -y git git-doc
If you have installed a different distribution refer to the system documentation to determine the package manager needed to
install software from the software repository.
15-minute Introduction to Version Control with Git from the Terminal
Let's start exploring the basic features of the version control from the comfort of an R Console session. Open a terminal
window (CTRL+ALT+T) if you are running a graphical desktop environment and then type 'R' and press the [ENTER] key.
Once RStudio is installed you will have integrated access to R.
Command
git init
Description
Common Flags
Arguments
[directory_path/]
[directory_name]
initialise a local repository;

default is current working directory
(optional)
git branch
determine the current branch
git
checkout
create a new branch in the current

repository
git status
reports the status of the local

repository
git show
reports the historical differences of the

files
in the local repository
branch_name
-b (new branch)
(mandatory)
-A (add)
git add
add files to the local repository
-u (track file name

changes and
(mandatory)
24
deletions)
git commit
commit any changes the local

repository
-a (add)
"a string of characters"
-m (message)
(optional, mandatory)
git pull
source target
fetch changes from another repository

and merge with current repository
(mandatory)
target source
git push
update remote repository with changes

from the current repository
-u (add upstream
(tracking) reference)
git merge
flatten commit history before merging

source branch with target branch
--squash
(mandatory unless -u flag

present)
branch_name
(mandatory)
reference_point
git revert
undo changes to the local repository

(mandatory)
Arguments in brackets are
optional
but if the 'mandatory'
designation is
present, at least one of the
arguments must be supplied.
Table 0.2 Basic Git Commands

For each of the examples in this section type the commands to the right of the command prompt ($) to interactively follow
along these examples. Take your time working through the commands until you fully understand why each command
produces the observed results.
Preliminaries: Configure your email address and username to be used by Git. The flag --global means apply the
configuration to all of your Git repositories on the computer. The flag --local means apply the confoguration to only the
current Git repository.
$ git config [--local | --global] user.email "userid@domain.tld"

$ git config [--local | --global] user.name "username"
Note: The output of some Git commands in these examples has been reformatted for presentation within this book.
Example 1: Create a local repository.
$ mkdir Projects
$ mkdir Projects/DataScientistsToolbox
$ mkdir Projects/DataScientistsToolbox/sample
$ cd Projects/DataScientistsToolbox/sample
$ git init
Initialised empty Git repository in /home/osboxes/Projects/DataScientistsToolbox/sample/.git/
$ ls -la
drwxrwxr-x 3 osboxes osboxes 4096 Apr 5 19:15 .
drwxrwxr-x 3 osboxes osboxes 4096 Apr 5 19:07 ..
drwxrwxr-x 7 osboxes osboxes 4096 Apr 5 19:15 .git
Example 2: Create an empty README.md file in the local repository.
25
$ touch README.md
$ git add .
$ git commit -m "initial commit"
[master (root-commit) b7c48f3] initial commit
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 README.md
$ git status
On branch master
nothing to commit, working directory clean
$ git show
commit b7c48f3e5cdc772e6a198c3633acd853a69a5778
Author: jhudss
Date: Sun Apr 5 19:21:21 2015 -0300
initial commit
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..e69de29
Example 3: Edit the README.md file and paste the sample markdown document into the file.
$ nano README.md
$ git add -A .
$ git commit -m "added content"
[master 8fd8eb8] added content
1 file changed, 41 insertions(+)
Example 4: Edit the README.md file swapping "Getting and Cleaning Data" and "Exploratory Data Analysis."
$ nano README.md
$ git commit -m "swapped order of two courses"
[master 87d0125] swapped order of two courses
1 file changed, 1 insertion(+), 1 deletion(-)
Example 5: Determine whether there are any changes.
$ git status
On branch master
$ git show
commit 87d012594aa5a8a39e99d4728dc8c853779587ab
Author: jhudss
Date: Sun Apr 5 19:34:34 2015 -0300
swapped order of two courses
index 756292a..48587e6 100644
--- a/README.md
+++ b/README.md
@@ -25,8 +25,8 @@ The courses comprising the Data Science Specialization are:
* Data Scientist's Toolbox
* R Programming
-* Exploratory Data Analysis
26
* Getting and Cleaning Data

+* Exploratory Data Analysis
* Reproducible Research
* Statistical Inference
* Regression Models
Example 6: Create a branch named 'draft'.
$ git checkout -b draft

Switched to a new branch 'draft'
$ git status
On branch draft
Example 7: Edit the README.md file to add "Git is easy. Git is fun. Thanks Linus!" anywhere in the file.
$ nano README.md
$ git status
On branch draft
Changes not staged for commit:
(use "git add ..." to update what will be committed)
(use "git checkout -- ..." to discard changes in working directory)
modified: README.md
no changes added to commit (use "git add" and/or "git commit -a")
$ git commit -a -m "thanked the creator of Git"
[draft 34af00f] thanked the creator of Git
Example 8: Switch to the 'master' branch and check the repository status.
$ git checkout master

Switched to branch 'master'
$ git status
On branch master
Example 9: Merge the 'draft' branch' with the 'master' branch and check the repository status.
$ git merge draft

Updating 87d0125..34af00f
Fast-forward
README.md | 2 ++
$ git status
On branch master
$ git show
commit 34af00fc564fd28e485503715dd5a9a9a461329a
27
Author: jhudss
Date: Sun Apr 5 19:49:08 2015 -0300
thanked the creator of Git
index 48587e6..aa53fee 100644
--- a/README.md
+++ b/README.md
@@ -19,6 +19,8 @@ is to provide the learner with:
3. an appreciation of the mathematics & statistics involved in
data science.
+Git is easy. Git is fun. Thanks Linus!
+
## Core Courses
A cheatsheet for Git and GitHub has been prepared by the folks at GitHub.
GitHub - Repository Hosting Service Supporting the Git Version Control System
GitHub is a repository hosting service allowing any number of people to collaboratively contribute to software development
or other projects. Some of the courses require learners to submit their programming assignments to GitHub as part of a
peer assessment grading process.
15-minute Introduction to Version Control with GitHub from the Terminal and Web Browser
Figure 0.12 Create an account with GitHub

Before creating a repository on GitHub you must create an account preferably with the same name email address used
when configuring Git. If you use an alternate email address and username for your GitHub account, you can associate Git's
username and email address with this account.
28
Figure 0.13 Choose a Personal Plan

Select the repository hosting plan for your account. The default free plan is sufficient for peer assessments during the
Johns Hopkins University Data Science Specialization.
Figure 0.14 New Account Orientation Dashboard

After your GitHub account is set-up you are ready to explore the service. You should update the profile information at the
very least before proceeding.
For each of the examples in this section type the commands to the right of the command prompt ($) to interactively follow
along these examples. Take your time working through the commands until you fully understand why each command
produces the observed results.
Example 1: Synchronise a local repository with an empty repository of the same name on GitHub. The commands below
create the empty repository on GitHub and push the content of the local repository to your GitHiub account. Substitute your
GitHub account name for 'user_name' and type your account password when prompted.
$ curl -u user_name https://api.github.com/user/repos \

-d "{\"name\":\"sample\",\"description\":\"learning about Git and GitHub\"}"
$ git add remote origin https://github.com/username/sample.git
$ git push origin master
29
A cheatsheet for Git and GitHub has been prepared by the folks at GitHub.
R - Statistical Analysis and Computing Environment

R is a statistical analysis and computing environment providing "an integrated suite of software facilities for data
manipulation, calculation and graphical display."
Installing R
Add the line "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" to the end of the sources.list file.
$ sudo nano /etc/apt/sources.list
Fetch the signing key for the CRAN repository.
$sudo apt-key adv --keyserver keys.gnupg.net --recv-key 51716619E084DAB9
Install the lastest version of R which might be newer than shown in the figures.
$ sudo apt-get update

$ sudo apt-get upgrade
$ sudo apt-get install -y r-base r-doc-info r-mathlib libcurl4-gnutls-dev
15-minute Introduction to the R Statistical and Computational Environment

Let's start exploring the basic features of the R environment from the comfort of the R Console command-line interface.
Open a terminal window (CTRL + ALT + T) if you are running a graphical desktop environment and type 'R' followed by the
[ENTER] key. Once RStudio is installed you won't have to work at the command-line unless you choose to do so.
Command
Description
Arguments
package_name
install.packages
install a package from CRAN

(mandatory)
package_name
install_github
install a package from GitHub

(mandatory)
package_name
library
load a package
(mandatory)
access the help system
[package_name]
[function_name]
(mandatory)
q()
exit R
Prompt to save the environment before

shutting down the R Statistical Analysis
and Computing Environment.
Arguments in brackets are optional but if the
'mandatory' designation is present, at least
one of the arguments must be supplied.
30
Table 0.3 Essential R Commands

For each example type the commands to the right of the command prompt (>) to interactively follow along these examples.
Take your time working through the commands until you fully understand why each command produces the observed
results.
Example: For simplicity we do not show the output of the commands used within R Console. We will install the devtools
package, as an exemplar, which will be needed to successfully compile other packages throughout these chapters and the
nine data science courses.

$ R
R version 3.1.1 (201-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: i686-pc-linux-gnu (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> install.packages("devtools")
> library(devtools)
> ?devtools
> q()
$
RStudio - Integrated Development Environment

RStudio is an integrated development environment providing a platform "to tackle the toughest and most interesting
problems with R."
Installing RStudio
$ wget http://download1.rstudio.org/rstudio-0.98.1103-i386.deb -O ${HOME}/Downloads/rstudio.deb

$ sudo apt-get install libjpeg62
$ sudo dpkg -i ${HOME}/Downloads/rstudio.deb
15-minute Introduction to the RStudio Integrated Development Environment
31
Figure 0.15 RStudio Integrated Development Environment

Launch RStudio by clicking on the 'Button' icon near the upper left of the application bar, typing 'rstudio' into the search
field, and clicking on the RStudio icon. Once the application is visible as shown in Figure 0.15 right-click on the RStudio
icon in the application bar and select 'Lock to Launcher'.
32
Figure 0.16 Configure Global Options

Click 'Tools' on the main menu followed by 'Global Options'to configure RStudio.
33
Figure 0.17 Select the CRAN repository mirror to fetch packages

Select a geographically-nearby CRAN repository after clicking 'Packages'.
34
Figure 0.18 Configure code editing preferences

Click 'Code Editing' to configure the appearance and behaviour of the code editing pane.
35
Figure 0.19 Configure version control options

Click 'Git/SVN' to configure which version control system system will be used. If Git has laready been installed, the defaults
can be accepted. Click 'Apply'. Click 'OK'.
36
Figure 0.20 Create a directory

Click the 'Files' tab in the lower right pane and navigate to the Projects directory and click 'New Folder'. Type the name of
the course DataScientistsToolbox. If a directory named Projects does not exist, create it.
37
Figure 0.21 Create a new project - Step 1

In the upper right click 'Project (None)' and select 'New Project'.
38

Click 'New Directory' to create a new repository.
39

Click 'Empty Project' as the project type.
40
41

Navigate to ${HOME}/Projects/DataScientistsToolbox. Click 'Choose'.
42

Type a name for the project. To create the project type a directory name, select 'Create a git repository', and click 'Create
Project'.
43
Figure 0.27 Create a new text file

Select 'File' on the main menu followed by 'New File' and select 'Text File' as the file type.
44
Figure 0.28 Save the student_grades.csv data file

Type the contents shown in the code editing pane. Click on the diskette icon or select 'File, Save' from the menu. Type the
file name and click 'Save'.
45
Figure 0.29 Set the working directory

Click 'Session' on the main menu and select 'Set Working Directory' followed by 'To Files Pane Location'.
46
Figure 0.30 Save the student_grades.R script

Click 'File' on the main menu followed by 'New File' and select 'R Script'.
Type the R code shown below. Then click 'File' followed by 'Save' before typing the file name and clicking 'Save'.
47
Figure 0.31 Read student grades file and output the contents
Highlight the code in the 'student_grades.R' tab. Click 'Run'.
48
Figure 0.32 Commit changes to the local repository

Click the 'Git' tab in the upper right pane. Click 'Commit'.
49
Figure 0.33 Select changes to be commited to the local repository

Select each of the four files by marking them as staged. Type a commit message. Click 'Commit' to commit these changes
to the local repository.
50
Figure 0.34 Summary of changes to the local repository

Review the messages before clicking 'Close'. Afterwards close the 'Review Changes' pop-up window.
51
Figure 0.35 Tracking changes in an open project

Modify the R code as shown in the 'student_grades.R' tab. Did you notice the new entry under the 'Git' tab? Highlight the
last line of code and run it. Commit this change using the same procedure.
52
Figure 0.36 Push the contents of the local repository to GitHub

Login to GitHub using a web browser and create an empty repository named 'demo'. In RStudio click the gear icon under
the 'Git' tab and select 'Shell'. For convenience we put the git commands in the code pane. Type these commands in the
shell substituting your GitHub account. Type 'exit' to close the shell. Verify the repository on GitHub has been updated.
Logout of GitHub.
53
Figure 0.37 Close the currently active project

Click on 'demo' in the upper right corner of RStudio and click 'Close Project'.
Figure 0.38 GitHub repository named demo after the push from local repository
Congratulations! You successfully onfigured a virtual machine for use during the data science boot-camp.
Practise. Practise. Practice your newly acquired knowledge and skills in preparation for the course project.
54
Final Thoughts
Data Scientist's Toolbox introduced the statistical computing and graphing suite, the integrated development
environment, and the version / revision control system selected by the Data Science Specialization Lab Team in the
Biostatistics Department of Johns Hopkins University. The features and capabiilities of these tools extend beyond the
basics presented in this chapter. While the graphical user interface is convenient we highly recommend and encourage you
to become comfortable with the command-line as well.
As a data science recruit outfitted with your kit (Git, R, RStudio, Ubuntu Linux, and GitHub account) the instructor for R
Programming awaits. Boot-camp has been easy up to this point. Read the "Data Science Boot-Camp Survival Manual"
regularly to avoid washing-out of boot-camp.
Recruits, dismissed.
55
Chapter 1 - R Programming
Chapter 1 - R Programming
56
Chapter 2 - Getting and Cleaning Data
Chapter 2 - Getting and Cleaning Data
57
Chapter 3 - Exploratory Data Analysis
Chapter 3 - Exploratory Data Analysis
58
Chapter 4 - Reproducible Research
Chapter 4 - Reproducible Research
59
Chapter 5 - Statistical Inference
Chapter 5 - Statistical Inference
60
Chapter 6 - Regression Models
Chapter 6 - Regression Models
61
Chapter 7 - Practical Machine Learning
Chapter 7 - Practical Machine Learning
62
Chapter 8 - Developing Data Products
Chapter 8 - Developing Data Products
63
Capstone
Capstone
64
Epilogue
Epilogue
65

Data Science Boot Camp Survival Manual

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Science Boot Camp Survival Manual

Diunggah oleh

Hak Cipta:

Format Tersedia

Data

Science Boot-Camp Survival Manual

Data Science Boot-Camp Survival Manual

Structure of the Boot-Camp Survival Manual

Data Science Boot-Camp Survival Manual

Data Science Boot-Camp Survival Manual

Course Dependency and Recommended Sequence

Data Science Boot-Camp Survival Manual

Chapter 0 - The Data Scientist's Toolbox

Tools of the Trade

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Calculus: Single Variable - Instructor: Robert Ghrist, University of Pennsylvania

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Figure 0.1 Creating a new virtual machine instance

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Figure 0.4 Mount the VirtualBox Guest Additions ISO

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Figure 0.5 Mount the VirtualBox Guest Additions ISO

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Figure 0.7 Enable/Disable shared clipboard and drag-and-drop

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Figure 0.8 Pointing device and device boot-order configuration

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Figure 0.9 System settings configuration

$ sudo apt-get -y update

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Figure 0.11 Automatic login enabled

Getting Familiar with the Command-Line Interface (CLI)

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

print working directory

list file and/or directory

create an empty file

"a string of characters"

copy file or directory

move file or directory

Table 0.1 Basic File and Directory Commands

Data Science Boot-Camp Survival Manual

Example 1: Determine the current working directory

Example 3: Create a subdirectory named 'test' in the current directory

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Example 14: List the file and subdirectory names in '${HOME}/another'

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Example 17: List the file and subdirectory names in '${HOME}//another'

Example 18: List the file and subdirectory names in '${HOME}/test/5'

Example 20: Make directory '/home/osboxes' the current working directory

$ rm -rf test another

Example 22: Close the terminal session