1. Initial steps
1. Read in the data
2. What variables are in the file?
2. Measures of central tendency
1. What to use when?
2. Sum and mean
3. Median
4. Mode
5. Trimmed mean to remove influence of outliers
3. Measures of variability
1. Range
2. Quartiles and IQR
3. Variance and sd
4. Mean absolute deviation, median absolute deviation
4. Measures of shape
1. Skewness
2. Kurtosis
5. Summary of a variable
6. Describing a data frame
1. Descriptive statistics separately for each group
2. Summarizing an entire dataframe
7. Standard scores (z)
Initial Steps
Read in the data
use the load() function
read.table or read.csv
> setwd("~/Documents/statistics/probability_and_statistics_with_R/navarro_datasets")
> load("aflsmall.Rdata")
Data type
Mean
Ratio, Interval
Median
Mode
Median
Usage:
ordinal data
ratio data
interval data
For median, first sort:
> sort(x = afl.margins)
[1] 0 0 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 5
[20] 6 7 7 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10 10
...
> median(x = afl.margins)
[1] 30.5
Mode
Who has played the most finals?
> print(afl.finalists)
[1] Hawthorn
Melbourne
[5] Hawthorn
Carlton
...
> table(afl.finalists)
Carlton
Melbourne
Melbourne
Carlton
afl.finalists
Adelaide
26
Essendon
32
Hawthorn
27
Richmond
6
Western Bulldogs
24
Brisbane
Carlton
Collingwood
25
26
28
Fitzroy
Fremantle
Geelong
0
6
39
Melbourne North Melbourne Port Adelaide
28
28
17
St Kilda
Sydney
West Coast
24
26
38
Measures of variability
> range(afl.margins)
[1] 0 116
> quantile(x = afl.margins, probs = c(0.25, 0.75)) # gives 25th and 75th percentile
25% 75%
12.75 50.50
> IQR(x = afl.margins)
# tells where the middle half of data sits
[1] 37.75
> var(afl.margins)
[1] 679.8345
> sd(afl.margins)
[1] 26.07364
> mean(abs(afl.margins mean(afl.margins)))
# mean absolute deviation
[1] 21.10124
> mad(afl.margins)
# median absolute deviation
[1] 28.9107
Measures of shape
Skewness (measure of asymmetry) and kurtosis:
> library(psych)
> skew(x=afl.margins)
[1] 0.7671555
> kurtosi(x=afl.margins)
[1] 0.02962633
Summary of a variable
> summary(object = afl.margins)
# argument is numeric
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 12.75 30.50 35.30 50.50 116.00
> summary(object = afl.finalists)
# argument is a factor
Adelaide
Brisbane
Carlton
Collingwood
26
25
26
28
Essendon
Fitzroy
Fremantle
Geelong
32
0
6
39
Hawthorn
Melbourne North Melbourne Port Adelaide
27
28
28
17
Richmond
St Kilda
Sydney
West Coast
6
24
26
38
Western Bulldogs
24
> f2 <- as.character(afl.finalists)
> summary(object = f2)
Length
Class
Mode
400 character character
Describing a dataframe
> load("clinicaltrial.Rdata")
> who(TRUE)
-- Name --- Class -- -- Size -clin.trial
data.frame 18 x 3
$drug
factor
18
$therapy
factor
18
$mood.gain
numeric
18
The by() function has argument FUN, which specifies the name of the function you want to apply
separately to each group.
Mean :0.7222
3rd Qu.:1.3000
Max. :1.7000
--------------------------------------------------------------clin.trial$therapy: CBT
drug
therapy mood.gain
placebo :3 no.therapy:0 Min. :0.300
anxifree:3 CBT
:9 1st Qu.:0.800
joyzepam:3
Median :1.100
Mean :1.044
3rd Qu.:1.300
Max. :1.800
Interpretation:
z = 0.8930. The individual score is 0.89 sd above the mean.
pnorm value: If 10 had been a score for laziness, then that individual is lazier than 81.4% of the
people sampled.