1 Descriptive Stats R

Descriptive Statistics with R
1. Initial steps
1. Read in the data
2. What variables are in the file?
2. Measures of central tendency
1. What to use when?
2. Sum and mean
3. Median
4. Mode
5. Trimmed mean to remove influence of outliers
3. Measures of variability
1. Range
2. Quartiles and IQR
3. Variance and sd
4. Mean absolute deviation, median absolute deviation
4. Measures of shape
1. Skewness
2. Kurtosis
5. Summary of a variable
6. Describing a data frame
1. Descriptive statistics separately for each group
2. Summarizing an entire dataframe
7. Standard scores (z)
Initial Steps
Read in the data
use the load() function
read.table or read.csv
> setwd("~/Documents/statistics/probability_and_statistics_with_R/navarro_datasets")
> load("aflsmall.Rdata")
What variables are in the file?

Two ways:
use head()
load lsr package and use who() function
> library(lsr)
> who()
-- Name --- Class -- -- Size -afl.finalists factor
400
afl.margins
numeric
176
x
integer
1
Measures of central tendency

What to use when
Measure
Data type
Mean
Ratio, Interval
Median
Ordinal (usually), also Ratio, Interval
Mode
Nominal (usually), also Ordinal, Ratio, Interval
Sum and mean

> sum(afl.margins)
[1] 6213
> sum(afl.margins[1:5])
[1] 183
> sum(afl.margins[1:5]) / 5
[1] 36.6
> mean(x = afl.margins)
[1] 35.30114
# sum of a subset of data

# mean of a subset of data
# x is the argument passed to mean()
Median
Usage:
ordinal data
ratio data
interval data
For median, first sort:
> sort(x = afl.margins)
[1] 0 0 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 5
[20] 6 7 7 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10 10
...
> median(x = afl.margins)
[1] 30.5
Mode
Who has played the most finals?
> print(afl.finalists)
[1] Hawthorn
Melbourne
[5] Hawthorn
Carlton
...
Get a frequency table:
> table(afl.finalists)
Carlton
Melbourne
Melbourne
Carlton
afl.finalists
Adelaide
26
Essendon
32
Hawthorn
27
Richmond
6
Western Bulldogs
24
Brisbane
Carlton
Collingwood
25
26
28
Fitzroy
Fremantle
Geelong
0
6
39
Melbourne North Melbourne Port Adelaide
28
28
17
St Kilda
Sydney
West Coast
24
26
38
Find the mode.

> modeOf(x = afl.finalists)
[1] "Geelong"
> maxFreq(x=afl.finalists)
[1] 39
Trimmed mean to remove influence of outliers

> dataset <- c(-15,2,3,4,5,6,7,8,9,12)
> mean(x=dataset)
[1] 4.1
> median(x=dataset)
[1] 5.5
> mean(x=dataset, trim=0.1)
# trim by 10% - one value on either side
[1] 5.5
# trimmed mean is same as median
For afl.margins dataset:

> mean(x=afl.margins, trim=0.05)
[1] 33.75
Measures of variability
> range(afl.margins)
[1] 0 116
> quantile(x = afl.margins, probs = c(0.25, 0.75)) # gives 25th and 75th percentile
25% 75%
12.75 50.50
> IQR(x = afl.margins)
# tells where the middle half of data sits
[1] 37.75
> var(afl.margins)
[1] 679.8345
> sd(afl.margins)
[1] 26.07364
> mean(abs(afl.margins mean(afl.margins)))
# mean absolute deviation
[1] 21.10124
> mad(afl.margins)
# median absolute deviation
[1] 28.9107
Measures of shape
Skewness (measure of asymmetry) and kurtosis:
> library(psych)
> skew(x=afl.margins)
[1] 0.7671555
> kurtosi(x=afl.margins)
[1] 0.02962633
# the data are quite skewed

# note the spelling!
Summary of a variable
> summary(object = afl.margins)
# argument is numeric
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 12.75 30.50 35.30 50.50 116.00
> summary(object = afl.finalists)
# argument is a factor
Adelaide
Brisbane
Carlton
Collingwood
26
25
26
28
Essendon
Fitzroy
Fremantle
Geelong
32
0
6
39
Hawthorn
Melbourne North Melbourne Port Adelaide
27
28
28
17
Richmond
St Kilda
Sydney
West Coast
6
24
26
38
Western Bulldogs
24
> f2 <- as.character(afl.finalists)
> summary(object = f2)
Length
Class
Mode
400 character character
# factor to character vector
> describe(x = afl.margins)

var n mean sd median trimmed mad min max range skew kurtosis se
1 1 176 35.3 26.07 30.5 32.82 28.91 0 116 116 0.77
0.03 1.97
For a logical vector:

e.g. how many blowouts were there?
Blowout = a game in which the winning margin exceeds 50 points.
> blowouts <- afl.margins > 50
> blowouts
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
[14] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
> summary(object = blowouts)

Mode FALSE TRUE NA's
logical
132
44
0
Describing a dataframe
> load("clinicaltrial.Rdata")
> who(TRUE)
-- Name --- Class -- -- Size -clin.trial
data.frame 18 x 3
$drug
factor
18
$therapy
factor
18
$mood.gain
numeric
18
Descriptive statistics separately for each group

Three functions:
by()
describeBy()
aggregate()
The describeBy() has argument group, which specifies the grouping variable.
The following gives statistics broken down by therapy type.
> describeBy(x=clin.trial, group=clin.trial$therapy)

group: no.therapy
drug*
1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81 0.29
therapy* 2 9 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN
NaN 0.00
mood.gain 3 9 0.72 0.59 0.5 0.72 0.44 0.1 1.7 1.6 0.51 -1.59 0.20
--------------------------------------------------------------group: CBT
drug*
1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81 0.29
therapy* 2 9 2.00 0.00 2.0 2.00 0.00 2.0 2.0 0.0 NaN
NaN 0.00
mood.gain 3 9 1.04 0.45 1.1 1.04 0.44 0.3 1.8 1.5 -0.03 -1.12 0.15
The by() function has argument FUN, which specifies the name of the function you want to apply
separately to each group.
> by(data = clin.trial, INDICES = clin.trial$therapy, FUN = describe) # same as describeBy()

clin.trial$therapy: no.therapy
drug*
1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81 0.29
therapy* 2 9 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN
NaN 0.00
mood.gain 3 9 0.72 0.59 0.5 0.72 0.44 0.1 1.7 1.6 0.51 -1.59 0.20
--------------------------------------------------------------clin.trial$therapy: CBT
drug*
1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81 0.29
therapy* 2 9 2.00 0.00 2.0 2.00 0.00 2.0 2.0 0.0 NaN
NaN 0.00
mood.gain 3 9 1.04 0.45 1.1 1.04 0.44 0.3 1.8 1.5 -0.03 -1.12 0.15
> by(data = clin.trial, INDICES = clin.trial$therapy, FUN = summary)
clin.trial$therapy: no.therapy
drug
therapy mood.gain
placebo :3 no.therapy:9 Min. :0.1000
anxifree:3 CBT
:0 1st Qu.:0.3000
joyzepam:3
Median :0.5000
Mean :0.7222
3rd Qu.:1.3000
Max. :1.7000
--------------------------------------------------------------clin.trial$therapy: CBT
drug
therapy mood.gain
anxifree:3 CBT
:9 1st Qu.:0.800
joyzepam:3
Median :1.100
Mean :1.044
3rd Qu.:1.300
Max. :1.800
Use aggregate() to group multiple variables.

e.g. Look at average mood gain for all possible combinations of drug and therapy.
> aggregate(formula=mood.gain ~ drug + therapy, data = clin.trial, FUN = mean)
drug therapy mood.gain
1 placebo no.therapy 0.300000
2 anxifree no.therapy 0.400000
3 joyzepam no.therapy 1.466667
4 placebo
CBT 0.600000
5 anxifree
CBT 1.033333
6 joyzepam
CBT 1.500000
> aggregate(formula=mood.gain ~ drug + therapy, data = clin.trial, FUN = sd)
drug therapy mood.gain
1 placebo no.therapy 0.2000000
2 anxifree no.therapy 0.2000000
3 joyzepam no.therapy 0.2081666
4 placebo
CBT 0.3000000
5 anxifree
CBT 0.2081666
6 joyzepam
CBT 0.2645751
Summarizing an entire dataframe

> summary(clin.trial)
drug
therapy mood.gain
anxifree:6 CBT
:9 1st Qu.:0.4250
joyzepam:6
Median :0.8500
Mean :0.8833
3rd Qu.:1.3000
Max. :1.8000
> describe(x=clin.trial)
# load psych package first
drug*
1 18 2.00 0.84 2.00 2.00 1.48 1.0 3.0 2.0 0.00 -1.66 0.20
therapy* 2 18 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.11 0.12
mood.gain 3 18 0.88 0.53 0.85 0.88 0.67 0.1 1.8 1.7 0.13 -1.44 0.13
Standard scores (z)

> x <- c(3,10,8,4,9,11,6)
> mean(x)
[1] 7.285714
> sd(x)
[1] 3.039424
> z <- ((10 - mean(x)) / sd(x))
>z
[1] 0.8930265
# calculating z score for 10
To calculate the percentile rank of the z-score, use pnorm():

> pnorm(0.8930625)
[1] 0.8140881
Interpretation:
z = 0.8930. The individual score is 0.89 sd above the mean.
pnorm value: If 10 had been a score for laziness, then that individual is lazier than 81.4% of the
people sampled.
Handling missing values

> partial.data <- c(10, 20, NA, 30)
> mean(x = partial.data)
[1] NA
> mean(x = partial.data, na.rm = TRUE)
[1] 20

1 Descriptive Stats R

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

1 Descriptive Stats R

Diunggah oleh

Hak Cipta:

Format Tersedia

Descriptive Statistics with R

What variables are in the file?

Measures of central tendency

Ordinal (usually), also Ratio, Interval

Nominal (usually), also Ordinal, Ratio, Interval

Sum and mean

# sum of a subset of data

Get a frequency table:

Find the mode.

Trimmed mean to remove influence of outliers

For afl.margins dataset:

# the data are quite skewed

# factor to character vector

> describe(x = afl.margins)

For a logical vector:

> summary(object = blowouts)

Descriptive statistics separately for each group

> describeBy(x=clin.trial, group=clin.trial$therapy)

> by(data = clin.trial, INDICES = clin.trial$therapy, FUN = describe) # same as describeBy()

Use aggregate() to group multiple variables.

Summarizing an entire dataframe

Standard scores (z)

# calculating z score for 10

To calculate the percentile rank of the z-score, use pnorm():

Handling missing values

Anda mungkin juga menyukai