Anda di halaman 1dari 2

mydata <- read.csv("nameofthedatafile.

csv")

What we do is just read the data file and assign the data to an R internal object called 'mydata'
This is technically called a data frame. Type mydata to visualise your data frame and see the
columns.
To access one of the columns of the data frame, just use the \$ operator:
mydata$column1

Will retrieve column1 in your data. Easy, isn't it?


Now you can apply several statistical functions to it. For instance, if you want to know the mean
of column1, type
mean(mydata$column1)

This will work if you don't have NA's in your data. But maybe there are some NA's, and then the
function won't work. But we can fix it by telling R to ignore the NA's
mean(mydata$column1, na.rm=TRUE)

That's it. To calculate a standard deviation it's the same procedure, but this time use the sd()
function. As you might reckon, the median is computed with the median() function. All easypeasy.
You can also see a bunch of summary statistics of an object with the summary() function:
summary(mydata)

If you want to add a constant to each of the values of your column1, say 1500, simply type
1500+mydata$column1

The same applies to all the other arithmetic operations.


Sorting a data frame
Ah, this is kind of a nightmare for many beginning users, but there's no need to worry, actually.
Let's see.
The key function here is order() With order you can define which column will be the ordering
index, and tell R to sort in ascending or descending order. The syntax is straightforward:
mysorteddata <- mydata[order(mydata$column1),]

If you want it sorted in descending order, just add a minus sign before the column name.

Now type mysorteddata and see the new data frame: it's now sorted by column1! Cool.
Now, let's say you want to select only the 15 first rows (those with lower column1 values). Use
the head() function:
low <- head(mysorteddata, 15)

And if you want to select only the 15 last rows (those with higher column1 values,) use the tail()
function instead:
high <- tail(mysorteddata, 15)

Then you can apply statistical functions as normal on the new 'low' and 'high' data frames.
Linear regression
Here our buddy is the lm() function. A linear regression comes like this: y=+x. In R it
would be like:
lm(y ~ x, somedata)

Where y and x are columns of a dataframe called 'somedata'. To see the results of the regression,
use our old friend summary():
summary(lm(y ~ x, somedata))

Is this really this easy? Yes.


And if you want to plot your new regression, use the plot() funtion:
plot(x=somedata$x, y=somedata$y)

Suddenly, RStudio will show the plot. To add the regression line, use the abline function (type it
immediately after the above command, just separated by an enter line):
plot(x=somedata$x, y=somedata$y)
abline(lm(y ~ x, somedata))

And there you have it, guys! Here's all you need to get through this week's homework. If you
want to know more about R and statistics, check out my blog mathsuser.blogspot.com. It's full of
loads of cool stuff.
Last tip: As a side remark, I had trouble with the illiteracy rate questions. Just remember that
illiteracy rate is the same as 100 minus the literacy rate.
Good luck!