R Basics For Starters (DataRockie)

R
Basics for Starters

By Datarockie
One-day crash course
about R programming,
data manipulation,
graphic visualiazation
and a glimpse of data
science
You are new to R programming
You have no/ little experience in programming
You want to improve your data analysis capability
Build on your resume
Stay relevant in your career
What the fuck is R?
You will be familiar with the R environment
Be able to perform basic R code/ function
Be able to do basic data exploration
Be able to do basic data manipulation
Be able to do basic graphical visualization
Understand basic concepts of data science
Run your first machine learning model
I am not a teacher
but an awakener.
Robert Frost
R is an environment within which statistical
modelings, data manipulation and graphical
visualization are implemented :D
First developed by John Chambers (S

language), laterd by Ross Ihaka and Robert
Gentleman from University of Auckland.
Free
Fast
Good for statistical modelings
Widely used across the world
Strong community
Great online learning resources
Source Environment
Console Help
you wont be
able to make delicious cake.
Numeric
Integer
Character
Logical (True/ False)
Complex (rarely used)
Equal uses ==
(double =)
Infinity
Not a Number
Not Available
i.e. missing value
<-
Assign value
Put the same type of data points into a long vector
Element-wise
computation
Take the value(s) you want from a vector
Subset by position
There are three main types of subset: position,

names, and logic
Subset by logic
Subset by name
R is an environment in which statistical
methods are performed
R use element-wise vectorized operations
Basic data types in R and how to do subsetting
Input Function Output
x <- c(1,2,4,6,7) mean(x) 4.00

NA = missing value and should not
be included in the calculation
Simply type help()
or ?... to get help full
details on that function
na.rm = TRUE
na.rm argument is set
First argument as TRUE by default
Second argument
args(function name)
to reveal arguments in
that function
> mean(x) > class(x)
> sum(x) > is.numeric(x)
> sd(x) > is.integer(x)
> var(x) > is.character(x)
> median(x) > is.factor(x)
> max(x) > is.logical(x)
> min(x) > as.numeric(x)
> summary(x) > as.factor(x)
Popular functions
Takes two arguments
Your function name
Function()
return() or print() to print

the result in console
We ask R to sum a+b
Concatenate string, sep =
If x > 0, then print positive
If x < 0, then print negative
If x = 0, then print zero

1
2
3
Please write your own function howlong50 that takes one argument
(year of birth) and return number of years before you age 50
2016 1988 = 22
Try use paste() inside your function
and also if .. else !
Well done guys!!
mutate
R does not recognize mutate() function ..?

Simply type below functions
R has 10,000 packages
install.packages()
ready for you to
library()
download and use
dplyr is package name
the most useful
Data Frame
Dataframe is a two-dimensional table

data.frame
data.frame() will create a data frame

using vectors we specify by column.
Just a few
clicks :D
function()
See built-in dataset in R by

typing data() in console
> str(x) > mtcars[1:3, ]
> names(x) > mtcars[ , 1:5]
> summary(x) > mtcars[1,3]
> head(x)
We will learn efficient ways to work
> tail(x) with dataframe (dplyr package) in the
> dim(x) next section
$
mtcars$mpg
Your turn to answer these three questions
using mtcars
which.min
Answer
A good data analyst knows
how to treat his/her data
Data manipulation
dplyr the most useful package for data manipulation
created by Hadley Wickham
Object + Verb
data frame select() + rename()
filter()
mutate()
arrange()
summarise()
select
select (dataframe, column1, column2, )
Now you can

select the
columns you
want.
select
select (dataframe, column1, column2, )
You have
selected column
1:5 and 8
select
select (dataframe, starts_with())
starts_with()
ends_with()
contains()
You have just

selected column
starts with g
rename
rename (dataframe, new name = selected column)
filter
filter (dataframe, criterion)
Filter() makes it
very easy to
select the rows
matched your
criteria
filter(mtcars, am == 1)
filters :D
filter (dataframe, criterion1, criterion2)
Explain: we ask R to filter mtcars (dataframe), am = 1 and hp <= 100

%in%
filter (dataframe, criterion %in% c() )
We want to filter
cyl between 4 to
6
mutate
mutate (dataframe, new_column = )
gear*carb GC
*** mutate = create new column based on existing columns

mutate ifelse
hp > 100, then high

hp <= 100, then low
select, filter, mutate
Your manager wants to knows hp, wt, gear, carb and
GC (gear * carb) of cars weight greater than 4.00 tons.
Can you get this information for your boss?
You can do it in one line code, but not easy to write
Alternatively, you can do it in multiple lines using assign (to create new dataframe)
I will show you how

to use piping to
chain your code
%>%
(Piping)
We just use mtcars data %>% select the columns we want %>% then
we filter weitht higher than 4.00 %>% and mutate new column GC
arrange
arrange (dataframe, column you want to arrange)
Default from low to high

summarise
summarise (dataframe, summarized statistics)
summarise
min, average, max
five verbs
an adverb
group_by
Together with summarise()
can give you much better
insights about your data
Automatic or manual gear car,
which type has higher average weight?
group_by summarise
0 = automatic gear
Manual gear car
1 = manual gear
seems to be lighter!
Quiz
Install new package hflights and load new data set into R
Your boss wants to know the average distance
ordered from high to low by carrier?
How many carriers in the data?

Who has highest average distance?
Solution for the quiz
Well done again!

Show me, not tell me.
Data visualization
Why do you need graphs & visualization
Storytelling
Data exploration (for personal use)
Publication (for public use)
Try barplot() for yourself !
ggvis is amongst the most popular packages for data visualization
created by Hadley Wickham and Winston Chang
graph = data + column + mark + property
mtcars wt, mpg point red

data column
mark property
This is what you will be able to create by end of this course.
Better than Excel? :D

~x ~y
data
points = scatter plot

red
layer_points This is property, you can customize

layer_lines your plot
layer_bars size
layer_histograms fill (color)
layer_smooths opacity (0-1)
layer_densities stroke (color)
strokeWidth
shape (only for layer_points)
red
= :=
Map Set
map
set
Map vs. Set
= :=
Understanding = vs.
:= is crucial for ggvis
users
layer_points()
circle (default)
square
cross
diamond
triangle-up
triangle-down
Histogram is a very popular
graph to see the distribution
of continuous variable, you
can set change bin width
easily in ggvis
factor() is used to
change class of a variable
You can use color

name or color code
to set fill property
argument se = TRUE makes
your plot looks professional
You can plot multiple

layer_xxx into same
graph :D !!
group_by
We use table() to check

frequency of qualitative variables
Now there are 3 smooth

curves because we
have 3 levels cyl
group_by() more than one factors
~interaction( , .)
creates combinations
3 cyl
2 am
So we have 3 x 2 = 6 groups
Try iris dataset, use layer_densities() to plot Sepal.Length grouped by Species
titles
Simple to use ;)
add_legend
Nice one !!
customize
range = c( , ., .) input your colors
hp is numeric, so we use
scale_numeric() to set colors :D
Next slide youll

learn how to use
scale_nominal
scale_nominal
Use your preferred colors !!

You have already learned two of
the most useful skills in R.
Well done guys !!
Most sexiest job in 21 st century
Data scientist
Changing techonology
Changing mindset
What the fuck is
data science?
Data Science Key Terminology
that you should know
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at
tasks in T, as measured by P, improves with experience E
Your problem
Model
performance/ Data Selection
validation
Model training/ Data

tuning preparation
Model selection
Supervised Unsupervised
Regression Clustering
Classification
CARET
(For supervised learning)
first ML
https://www.analyticsvidhya.com/
https://www.r-bloggers.com/
http://stackoverflow.com/
Now you have completed R Basics.
Hope you enjoy the class!
and I hope to see you again

DataRockie
Marketing, Neuro, DataScience
www.bitesize.studio
www.FB.com/datarockie

R Basics For Starters (DataRockie)

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

R Basics For Starters (DataRockie)

Diunggah oleh

Hak Cipta:

Format Tersedia

R

Basics for Starters

First developed by John Chambers (S

There are three main types of subset: position,

x <- c(1,2,4,6,7) mean(x) 4.00

return() or print() to print

If x < 0, then print negative

If x = 0, then print zero

R does not recognize mutate() function ..?

Dataframe is a two-dimensional table

data.frame() will create a data frame

See built-in dataset in R by

Now you can

You have just

Explain: we ask R to filter mtcars (dataframe), am = 1 and hp <= 100

*** mutate = create new column based on existing columns

hp > 100, then high

I will show you how

Default from low to high

How many carriers in the data?

Well done again!

graph = data + column + mark + property

mtcars wt, mpg point red

Better than Excel? :D

points = scatter plot

layer_points This is property, you can customize

You can use color

You can plot multiple

We use table() to check

Now there are 3 smooth

range = c( , ., .) input your colors

Next slide youll

Use your preferred colors !!

Model training/ Data

and I hope to see you again

Anda mungkin juga menyukai