Anda di halaman 1dari 112

R

Basics for Starters


By Datarockie
One-day crash course
about R programming,
data manipulation,
graphic visualiazation
and a glimpse of data
science
You are new to R programming
You have no/ little experience in programming
You want to improve your data analysis capability
Build on your resume
Stay relevant in your career
What the fuck is R?
You will be familiar with the R environment
Be able to perform basic R code/ function
Be able to do basic data exploration
Be able to do basic data manipulation
Be able to do basic graphical visualization
Understand basic concepts of data science
Run your first machine learning model
I am not a teacher
but an awakener.
Robert Frost
R is an environment within which statistical
modelings, data manipulation and graphical
visualization are implemented :D

First developed by John Chambers (S


language), laterd by Ross Ihaka and Robert
Gentleman from University of Auckland.
Free
Fast
Good for statistical modelings
Widely used across the world
Strong community
Great online learning resources
Source Environment

Console Help
you wont be
able to make delicious cake.
Numeric
Integer
Character
Logical (True/ False)
Complex (rarely used)
Equal uses ==
(double =)
Infinity

Not a Number

Not Available
i.e. missing value
<-
Assign value
Put the same type of data points into a long vector
Element-wise
computation
Take the value(s) you want from a vector

Subset by position

There are three main types of subset: position,


names, and logic
Take the value(s) you want from a vector

Subset by logic
Take the value(s) you want from a vector

Subset by name
R is an environment in which statistical
methods are performed
R use element-wise vectorized operations
Basic data types in R and how to do subsetting
Input Function Output

x <- c(1,2,4,6,7) mean(x) 4.00


NA = missing value and should not
be included in the calculation
Simply type help()
or ?... to get help full
details on that function
na.rm = TRUE
na.rm argument is set
First argument as TRUE by default

Second argument
args(function name)
to reveal arguments in
that function
> mean(x) > class(x)
> sum(x) > is.numeric(x)
> sd(x) > is.integer(x)
> var(x) > is.character(x)
> median(x) > is.factor(x)
> max(x) > is.logical(x)
> min(x) > as.numeric(x)
> summary(x) > as.factor(x)

Popular functions
Takes two arguments
Your function name
Function()

return() or print() to print


the result in console
We ask R to sum a+b
Concatenate string, sep =
If x > 0, then print positive

If x < 0, then print negative

If x = 0, then print zero


1
2
3
Please write your own function howlong50 that takes one argument
(year of birth) and return number of years before you age 50

2016 1988 = 22
Try use paste() inside your function
and also if .. else !
Well done guys!!
mutate

R does not recognize mutate() function ..?


Simply type below functions
R has 10,000 packages
install.packages()
ready for you to
library()
download and use
dplyr is package name
the most useful
Data Frame

Dataframe is a two-dimensional table


data.frame

data.frame() will create a data frame


using vectors we specify by column.
Just a few
clicks :D
function()

See built-in dataset in R by


typing data() in console
> str(x) > mtcars[1:3, ]
> names(x) > mtcars[ , 1:5]
> summary(x) > mtcars[1,3]
> head(x)
We will learn efficient ways to work
> tail(x) with dataframe (dplyr package) in the
> dim(x) next section
$

mtcars$mpg
Your turn to answer these three questions
using mtcars
which.min

Answer
A good data analyst knows
how to treat his/her data
Data manipulation
dplyr the most useful package for data manipulation
created by Hadley Wickham

Object + Verb
data frame select() + rename()
filter()
mutate()
arrange()
summarise()
select
select (dataframe, column1, column2, )

Now you can


select the
columns you
want.
select
select (dataframe, column1, column2, )

You have
selected column
1:5 and 8
select
select (dataframe, starts_with())

starts_with()
ends_with()
contains()

You have just


selected column
starts with g
rename
rename (dataframe, new name = selected column)
filter
filter (dataframe, criterion)

Filter() makes it
very easy to
select the rows
matched your
criteria

filter(mtcars, am == 1)
filters :D
filter (dataframe, criterion1, criterion2)

Explain: we ask R to filter mtcars (dataframe), am = 1 and hp <= 100


%in%
filter (dataframe, criterion %in% c() )

We want to filter
cyl between 4 to
6
mutate
mutate (dataframe, new_column = )

gear*carb GC

*** mutate = create new column based on existing columns


mutate ifelse

hp > 100, then high


hp <= 100, then low
select, filter, mutate
Your manager wants to knows hp, wt, gear, carb and
GC (gear * carb) of cars weight greater than 4.00 tons.
Can you get this information for your boss?
You can do it in one line code, but not easy to write

Alternatively, you can do it in multiple lines using assign (to create new dataframe)

I will show you how


to use piping to
chain your code
%>%
(Piping)

We just use mtcars data %>% select the columns we want %>% then
we filter weitht higher than 4.00 %>% and mutate new column GC
arrange
arrange (dataframe, column you want to arrange)

Default from low to high


summarise
summarise (dataframe, summarized statistics)
summarise
min, average, max
five verbs
an adverb

group_by
Together with summarise()
can give you much better
insights about your data
Automatic or manual gear car,
which type has higher average weight?
group_by summarise

0 = automatic gear
Manual gear car
1 = manual gear
seems to be lighter!
Quiz

Install new package hflights and load new data set into R
Your boss wants to know the average distance
ordered from high to low by carrier?

How many carriers in the data?


Who has highest average distance?
Solution for the quiz

Well done again!


Show me, not tell me.
Data visualization
Why do you need graphs & visualization
Storytelling
Data exploration (for personal use)
Publication (for public use)
Try barplot() for yourself !
ggvis is amongst the most popular packages for data visualization
created by Hadley Wickham and Winston Chang

graph = data + column + mark + property

mtcars wt, mpg point red


data column

mark property
This is what you will be able to create by end of this course.

Better than Excel? :D


~x ~y
data

points = scatter plot


red

layer_points This is property, you can customize


layer_lines your plot
layer_bars size
layer_histograms fill (color)
layer_smooths opacity (0-1)
layer_densities stroke (color)
strokeWidth
shape (only for layer_points)
red
= :=
Map Set

map

set
Map vs. Set
= :=

Understanding = vs.
:= is crucial for ggvis
users
layer_points()

circle (default)
square
cross
diamond
triangle-up
triangle-down
Histogram is a very popular
graph to see the distribution
of continuous variable, you
can set change bin width
easily in ggvis
factor() is used to
change class of a variable

You can use color


name or color code
to set fill property
argument se = TRUE makes
your plot looks professional

You can plot multiple


layer_xxx into same
graph :D !!
group_by

We use table() to check


frequency of qualitative variables

Now there are 3 smooth


curves because we
have 3 levels cyl
group_by() more than one factors

~interaction( , .)
creates combinations
3 cyl
2 am
So we have 3 x 2 = 6 groups
Try iris dataset, use layer_densities() to plot Sepal.Length grouped by Species
titles

Simple to use ;)
add_legend

Nice one !!
customize

range = c( , ., .) input your colors

hp is numeric, so we use
scale_numeric() to set colors :D

Next slide youll


learn how to use
scale_nominal
scale_nominal

Use your preferred colors !!


You have already learned two of
the most useful skills in R.
Well done guys !!
Most sexiest job in 21 st century
Data scientist
Changing techonology
Changing mindset
What the fuck is
data science?
Data Science Key Terminology
that you should know
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at
tasks in T, as measured by P, improves with experience E
Your problem

Model
performance/ Data Selection
validation

Model training/ Data


tuning preparation

Model selection
Supervised Unsupervised
Regression Clustering
Classification
CARET
(For supervised learning)
first ML
https://www.analyticsvidhya.com/
https://www.r-bloggers.com/
http://stackoverflow.com/
Now you have completed R Basics.
Hope you enjoy the class!

and I hope to see you again


DataRockie
Marketing, Neuro, DataScience
www.bitesize.studio
www.FB.com/datarockie

Anda mungkin juga menyukai