Anda di halaman 1dari 12

Introduction to R and RStudio

Edwin Ardiansyah

R is an integrated suite of software facilities for data manipulation,


calculation and graphical display. It includes

an effective data handling and storage facility,


a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data
analysis,
graphical facilities for data analysis and display either on-screen or
on hardcopy, and
a well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and
input and output facilities.

RStudio is a powerful and productive user interface for R. It's free and open source

BASIC FUNCTION

Calculator
R can be used as calculator to solve various arithmetic functions

Example of operator used : +-*/^


Vector and Assignment
In R, an ordered collection of numbers is called a vector

We can use c( ) to create a vector

Use arrow sign <- to assign vector / data to an object

Functions

R run functions in the form of fx( )

Example of math functions: exp( ), sqrt( ), log( ), sin( ), cos( ), tan( ), abs( )

Simple function: mean( ), sd( ), max( ), min( ), length( ), sum( )


GETTING HELP

There are a number of ways to get help in R, and there is also a wide variety of online information.
Most installations of R come with a reasonably detailed help file called "An Introduction to R", but
this can be rather technical for first-time users of a statistics package.

Some simple functions that can be used to open help files

DATA MANAGEMENT
Importing dataset
To import a dataset in csv format use the function read.csv( )
In RStudio click the Import Dataset in the environment

Each one of these variables corresponds to a question that was asked in A survey. For example, for
genhlth, respondents were asked to evaluate their general health, responding either excellent, very
good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the
past
month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of
health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had
smoked at least 100 cigarettes in her lifetime. The other variables record the respondents height in
inches, weight in pounds as well as their desired weight,wtdesire, age in years, and gender.

Suppose we want to sort our data by age and store the result in new object called cdcs. The
following are the codes and display of the result.

For continuous variable, we can get a summary that include minimum and maximum value, median,
mean, first and third quartile, and missing values if any, by assigning a function summary( )

Compute and Recoding Variables


Suppose we want to compute a new variable BMI from the weight and height variables with the
formula BMI = weight/height2. The codes will be like the following display (dont forget to convert
weight to kg and height to m)
Now we want to divide the BMI into 3 groups and recode the variable and assign 1 for BMI < 18.5, 2
for 18.5 BMI < 25, and 3 for BMI > 3 , then store in new variable called wgroup

We can see that the wgroup is still in numeric, we want to change the variable to nominal and give
labels to it. R refers nominal variable as factor.
Now the wgroup variable has been changed to nominal and labeled.

Save object
R does not automatically store our modified object. We can save our modified table by running
function write.csv( ) by setting up first our working directory.

GRAPHICS IN R
Histogram
One way to display the distribution of continuous and discrete variables is to construct a histogram.
Boxplot
Another way to summarize data that are measured on continuous or discrete is
to construct a boxplot. It is also often used in exploratory data analysis to show the
shape of the distribution, its central value, and variability. It is especially helpful for
indicating whether a distribution is skewed and whether there are any unusual
observations or outliers in the data set.
The dots above the boxplot are outliers. Suppose we want to remove the outliers from the plot, we
can add the outline = F argument to better see the distribution.

Scatter Plot
A scatterplot is a useful summary of a set of bivariate data (two variables), usually drawn
before working out a linear correlation coefficient or fitting a regression line, as part of
exploratory data analysis. It gives a good visual picture of the relationship between the two
variables, and aids the interpretation of the correlation coefficient or regression model.
Bar chart
A bar graph is composed of discrete bars that represent different categories of data. Its height is
equal to the quantity or frequency within that category data. It is useful for displaying
categorical data and it is best used to compare values across categories.

STATISTICAL PROCEDURE

Test of normality
Normality test is done to make an objective measure of the data distribution. Suppose we want to
perform normality test for weight based on gender, the following are the codes.
Wilcoxon rank sum test
Now lets try to compare weight in male and female. Lets say the normality assumption of the
weight of two groups was not hold after several attempts of transformation. Hence, non-
parametric test was performed and median would be compared instead.
Chi square test
Baseline characteristics (could be demographics, known risk factors or confounders) are often
compared in observational studies (bivariate analysis) before proceeding to multivariable analysis
(i.e. logistic regression). Hence, many of them are binary such as gender and smoking status.

Suppose we wished to compare proportion of people who had smoke 100 cigarettes in their lifetime
among male and female.

Or we can use gmodels package


EXTRA

Packages provide extension functions despite the basic function that R brings.

Swirl package provides interactive R learning in R or RStudio

THANK YOU

Anda mungkin juga menyukai