Statistics - Part A

BASIC STATISTICAL
METHODS:
Module 2
SHC 797: 2011
Prof. Hannes Gräbe Pr Eng

Department of Civil Engineering
Acknowledgements: Engineering 1 Room 11-15
Mr. J. Van der Westhuizen Tel. (012) 420 4723
hannes.grabe@up.ac.za1
Prof. Alex Visser
What is Statistics about?

• Data Collection
• Summarizing Data
• Interpreting Data
• Drawing Conclusions
from Data
1
Context
• Part 1: Data Description

• Part 2: Probability
• Part 3: Inference
Part 1: Data Presentation
2
Data Description
• Designing experiments
– Does pre-heating reduce
the possibility of rail
defects in thermit welds ?
• Observational studies
– Polls: The popularity of
the President
Data Description
• How can we summarise small
amounts of data? Number Height
(m)
1 8.78
• Example: Bridge column
2 8.07
support height
3 8.15
4 8.98
5 9.02
6 7.99
7 8.52
8 8.27
9 8.89
3
Data Description
• First remember that the following is required
before we continue:
– Data needs to be a random sample
– How accurately were the measurements
made?
– Notice the difference between
measurements and counts
• Measurements: More accurate
• Counts: Always whole numbers
Data Description
• Example Table
– Bridge Column Number Height (m)
Support Height 1 8.78
• Most important thing 2 8.07
about numbers in this 3 8.15
list 4 8.98
– How low/high a 5 9.02
typical one is 6 7.99
– How variable the 7 8.52
numbers are 8 8.27
9 8.89
4
Data Description
• Measures of location: Mean
8.78 + 8.07 + 8.15 + 8.98 + 9.02 + 7.99 + 8.52 + 8.27 + 8.89

Bridge Mean Height =
9
= 8.518 Number Height (m)
1 8.78
• Symbol for mean if observations 2 8.07

3 8.15
are referred to as x’s:
∑i =1 xi
n
4 8.98
x= 5 9.02
n 6 7.99
7 8.52
8 8.27
• Arithmetic mean or Average 9 8.89
Data Description
• Measures of location: Median

Number Height
• Arrange measurements in order of (m)
size. The median is then the middle 6 7.99
one: 2 8.07
Bridge Median Height = 8.52 3 8.15

8 8.27
7 8.52
• If there is an even number of
1 8.78
observations we take the average
9 8.89
of the two middle ones
4 8.98
5 9.02
5
Data Description
• Measures of location: Mode
– The mode is the observation that
occurs most frequently. In our
previous example there is no mode.
– This is more useful for counted data.
For example number of vehicles per
hour over a 24 hour period.
Data Description
• Median:
• Mean: – Can be calculated with
– Easy to calculate merely graded data
– Much statistical – Easy to calculate for
theory based on it small samples, not
moderate or large
– Accurate: means of
samples
different samples do
not vary very much – All values have to be
stored for the
– Each observation
calculation
contributes equally to
the calculation – Not as sensitive to
mistakes
– Sensitive to mistakes
– Not disturbed by
outliers
6
Data Description
• Guidelines for mean, median and modes
– In case of no outliers use mean
– In case of outliers:
ts
en
Co
m
re
un
su
ts
ea
M
Vehicles per hour – peak

hours have a substantial
influence
Data Description
• Other measures of location:
– Discard upper and lower 5%
– Root mean square
= mean of the squared observations
– Geometric mean of n positive
numbers = nth root of their product
– Harmonic mean = 1/(mean of the
reciprocals of the observations)
7
Data Description
• Other measures of location: Weighted
means
– Each observation is not counted
equally, but is “weighted” according
to its size:
=
∑ w( x) x
∑ w( x)
Examples
• University class size: A third = 9,
another third= 60, last third = 300
• What is the mean class size?
8
Examples
• University class size: A third = 9,
another third= 60, last third = 300
• What is the mean class size?
Option 1: (9+60+300)/3 = 369/3 = 123

Option 2: (9*9+60*60+300*300)/369 =
93681/369 = 254
Examples
• Performance evaluation calculation:
KPA Weight Rating (/5)
1 50% 4
2 30% 3
3 20% 3
9
Data Description
• Measure of variation: Range
– The range of a set of numbers is the
largest minus the smallest
• The range of bridge height is 9.02 – 7.99 =
1.03 m
Only two observations contribute directly
It’s very sensitive to unusually big or small
observations
No standardisation to sample size
 But of use when sample size is the same (Industrial QC)
Data Description
• Measure of variation: IQR and SIQR
(Inter-quartile Range and Semi-Inter-
quartile range = ½ IQR) – Distance
between quartiles
– Quartile
25% 25% 25% 25%
1st quartile 2nd quartile 3rd quartile

(Lower) (median) (Upper)
Separate ordered data into 4 equal groups
10
Data Description
• To determine the lower and upper
quartiles we recommend the following:
– Lower quartile
•0.25n + 0.5 (n – number of
observations)
– Upper quartile
•0.75n + 0.5
Data Description
Number Height
• Example (m)
– Using the bridge data, 6 7.99
calculate the LQ, UQ, IQR 2 8.07
and SIQR 3 8.15
8 8.27
7 8.52
1 8.78
9 8.89
4 8.98
5 9.02
11
Data Description
Order No. Height
• Example (m)
– n=9 1 6 7.99
– LQ(n) = 0.25(9)+0.5 = 2.75 2 2 8.07
– UQ(n) = 0.75(9)+0.5 = 7.25 3 3 8.15
4 8 8.27
LQ = 8.13 5 7 8.52
6 1 8.78
UQ = 8.91
7 9 8.89
IQR = 8.91 – 8.13 = 0.78
8 4 8.98
SIQR = 0.39 9 5 9.02
Data Description
• Measure of variation: Mean
Absolute Deviation (MAD)
Deviation Absolute
from deviation
M.A.D =
∑ x−x Observation
Height (m) -
x−x
mean from mean
x−x
x
n 8.78 0.26 0.26
8.07 -0.45 0.45
8.15 -0.37 0.37
8.98 0.46 0.46
3.19
M.A.D = 9.02 0.50 0.50
9 7.99 -0.53 0.53
= 0.354 8.52 0.00 0.00

8.27 -0.25 0.25
8.89 0.37 0.37
Sum 0.00 3.19
12
Data Description
• Measure of variation: Standard Deviation
– SD is approximately the average of the difference
between each value and the mean.
∑ (x − x )
2
s=
n −1
– SD is what is most commonly used as a measure of
variation
– The square of the SD is called Variance
var = s 2
Data Description
• Standard Deviation - Example
∑ (x − x )
2 Squared
Deviation deviation from
s= Observation from mean mean
n −1 Height (m) -
x x−x (x − x ) 2
8.78 0.26 0.0676

s = 0.411 8.07 -0.45 0.2025
8.15 -0.37 0.1369
8.98 0.46 0.2116
( x)
9.02 0.5 0.25
1
∑x − ∑
2 2
7.99 -0.53 0.2809
s= n 8.52 0 0
n −1 8.27 -0.25 0.0625
8.89 0.37 0.1369
Sum 0 1.3489
13
Data Description
• Standard Deviation – Typical
Observations
Description Dataset size
Few observations further Small dataset

than 1 SD from the mean (10 observations)
Few observations further Medium-sized dataset

Medium-
Few observations further Large dataset

Data Description: Pictorial

• The box-and-whisker plot
Median
Whiskers
14
– Can be drawn horizontally or vertically
– Outliers: more than 1 IQR above UQ
more than 1 IQR below LQ
– This is however much more complex
– Transformations (taking the square root or the
logarithm) can be used to make this plot more
useful
– Skewness: UQ − median
median − LQ

• Cumulative frequency plot
– Vertical axis represents the
number/percentage/proportion of
observations that are less than or equal
to the x value on the horizontal axis
15
Data Description
• Flood Example: Flood size from 1980 – 1989:

50 12 16 20 17 13 61 26 33 38
1
0.9
0.8
0.7
FREQUENCY
0.6
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70
SIZE OF FLOOD
Data Description
• Flood Example – IQR
0.9
0.8
0.7
FREQUENCY
0.6
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70
SIZE OF FLOOD
12 16 23 38 61
16
Data Description
• Large Datasets
• See arm length measurements below.
741 817 846 845 846 833 782 767 786 810
765 694 758 754 754 806 775 798 740 809
759 785 795 830 854 830 789 802 720 816
764 783 747 774 763 781 804 727 809 801
796 791 811 833 757 786 806 796 776 803
801 817 831 811 801 802 834 805 829 817
801 796 706 802 774 767 811 767 830 771
759 751 765 811 727 761 808 777 835 787
788 776 754 812 860 765 763 780 777 737
761 791 757 758 795 708 784 725 800 723
* Anthropometric survey of 100 male car drivers’

arm lengths
Data Description
Example 1
Count
• Tabulate this data Range Observations Cumulative
• Choosing classes Up to 699 1 1
– Use convenient round 700-719 2 3

numbers 720-739 6 9
– Specific classes 740-759 13 22
– Sufficient many classes – 760-769 12 34

not wide grouping 770-779 8 42
– Sufficient few classes – 780-789 11 53
between 5 – 15 classes 790-799 7 60
– Each observation must go 800-809 16 76
into only one class 810-829 11 87
830-849 11 98
850-869 2 100
870 and over 0 100
17
Data Description
• Stem-and-leaf plots
69 4
70 6 6
71
72 0 7 7 5 3
73 7
74 2 0 7
75 8 4 4 9 7 9 1 4 7 8
76 7 5 4 3 9 7 7 5 1 5 3 1
77 5 4 6 4 1 7 6 7
78 2 6 5 9 3 1 6 7 8 0 3
79 8 5 6 1 6 1 5
80 6 9 2 4 9 1 6 3 1 1 2 5 1 2 8 0
81 7 0 6 1 7 1 7 1 1 2
82 9
83 3 0 0 3 1 4 0 5
84 6 5 6
85 4
86 0
Data Description
• Histogram presentation
Note the change in range size and
the frequency half at range double
16 the size of the selected standard
range. Rather keep range size constant
Proportion of observation per class width
14
12
10
0
650 700 750 800 850 900
Arm Length (mm)
18
Data Description
• Cumulative frequency plot
100
80
% Observations <= x
60
40
20
0
680 700 720 740 760 780 800 820 840 860 880
Arm Length (mm)
Data Description: example

• The maximum flood in a river for a period of
9 years is:
20,22,21,24,19,25,27,20,23 m3/s
– Calculate:
• i) Mode
• ii) Median
• iii) The expected value/mean
• Iv) IQR and SIQR
• iv) Standard Deviation
• v) Draw the cumulative frequency plot
19
• Solution
19,20,20,21,22,23,24,25,27
• Mode: 20 m3/s
• Median: 22 m3/s
• Mean: ∑ x = 201 = 22.3 m 3 /s
n 9
• Solution
19,20,20,21,22,23,24,25,27
• For LQ (n=2.75) = 20.00

• UQ (n=7.25) = 24.25
• IQR = 24.25 – 20.00 = 4.25
• SIQR = 4.25/2 = 2.125
• Skewness = (24.25 – 22)/(22-20) = 1.125
20
Observation x−x (x − x )2
19 -3.33 11.11
20 -2.33 5.44
20 -2.33 5.44
21 -1.33 1.78
22 -0.33 0.11
23 0.67 0.44
24 1.67 2.78
25 2.67 7.11
27 4.67 21.78
Sum 0.03 56.01
 ∑ (x − x ) 
2
56.01
σ =   σ= = 2.7 m 3 /s
 n −1  8
Data Description
• Cumulative frequency plot: Cumulative
proportions
1.0
Proportion of observation less than x
0.8
0.6
0.4
0.2
0.0
Up to 699
700-719
720-739
740-759
760-769
770-779
780-789
790-799
800-809
810-829
830-849
850-869
Arm Length (mm)
21
Data Description
• Computing the mean and standard
deviation of grouped data – Arm length
example
– Assume that all the observations in a class
were in fact at its mid-point
Class x Frequency,f xf x2 f
Up to 699 690 1 690 476100
700-719 710 2 1420 1008200
720-739 730 6 4380 3197400
740-759 750 13 9750 7312500
760-769 765 12 9180 7022700
770-779 775 8 6200 4805000
780-789 785 11 8635 6778475
790-799 795 7 5565 4424175
800-809 805 16 12880 10368400
810-829 820 11 9020 7396400
830-849 840 11 9240 7761600
850-869 860 2 1720 1479200
Sum 100 78680 62030150
Data Description
1
x= ∑ xf
• The mean grouped n
1
data = 78680
100
= 786.8
• The Standard
∑ (x − x )
2
f
Deviation grouped σ=
n −1
data (∑ xf )2
∑ (x )
1
f−
2
= n
n −1
62030150 −
1
(78680 )2
= 100
99
= 35.5mm
22
Data Description
• Computing the median and quartiles of
grouped data
1.00
0.08
0.90 x = 780 + (790 − 780)
0.80
0.11
0.70
= 787.3
0.60
0.50
0.53
0.40 0.42
0.30
0.20
0.10
0.00
680 700 720 740 760 780 800 820 840 860 880
Arm Length (mm)
Data Description
• Same can be done to determine
quartiles of grouped data
• For 0.25:
– X= 762.0
• For 0.75:
– X= 809.4
• Inter-quartile range
– 809.4 - 762.5 = 46.9
23
Data Description
• Percentile
– Percentile divide the ordered
observations into one hundred equal
groups.
– EXAMPLE:
•A man of 95th percentile height is
one who is taller than 95 percent
of men and shorter than 5
percent.
Data Description
• 95th Percentile Example
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
680 700 720 740 760 780 800 820 840 860 880
Arm Length (mm) 845mm
24
Data Description
• Regression and correlation
– Plotting one variable against another
• Each x and y value correspond
• Plot it against each other
• See example below: Sum of average daily temperatures
(in thousands of degrees) and total rainfall (in hundreds
of mm) for 1924-1929
Avg Daily Temperatures Rainfall
3.1 3.6
3.0 3.4
3.2 3.9
3.1 4.7
3.2 2.9
3.3 2.2
Data Description
• Scatter plot
5.0
4.5
Rainfall (Hundreds of mm)
4.0
3.5
3.0
2.5
2.0
1.5
2.9 3.0 3.1 3.2 3.3
Average Daily Temperatures (Thousands of degrees)
25
Data Description
• Regression
– With this term, emphasis is on the
equation that predicts y from x
• Correlation
– With this term, emphasis is on the
strength of the relationship. That is,
how useful is known x for the
purpose of predicting y? We will only
be concerned with the strength of
the linear relationship.
Data Description
• The “best” straight line
– Best straight line for predicting y.
– Any straight line obeys the equation
y = a +bx for some choice of the
constants a and b.
– Therefore choosing the best straight
line means choosing a and b.
26
Data Description
• Regression Analysis
y y 4
3
y y
yˆ = a + bx
y 4
y
Y-Axis
2 3
1
y
y
2
yˆ = a + bx
1
Data Description
• Regression Analysis
2
n  
E = ∑  y − yˆ  Sum Squared Errors
i =1  
n
yˆ = a + bx ∴ E = ∑ ( y − a − bx )
2
but
i =1
δE δE
For = =0
δa δb
δE n
= 2∑ ( y − a − bx )(1) = 0
δa i =1
δE n
= 2∑ ( y − a − bx )( x ) = 0
δb i =1
δE n
= ∑ ( y − a − bx ) = 0KKK eq.1
δa i =1
δE n
= ∑ ( y − a − bx )( x ) = 0KK eq.2
δb i =1
27
Data Description
δE
= ∑ ( y − a − bx ) = 0KKK eq.1
n
3 int o 2
δa i =1
( )
∑ xy − ∑ y − b x x − b ∑ x = 0
2
From eq.1 ∑ xy − ∑ ( yx − bx x ) − b ∑ x =0
2
∑ y − na − b ∑ x = 0 ∑ xy − ∑ yx + ∑ bx x − b ∑ x = 0
2
but ∑ y = ny and ∑ x = nx (
∑ xy − ∑ yx = − ∑ bx x − b ∑ x
2
)
∑ xy − ∑ yx = b(∑ x − ∑ x x)
2
∴a =
1
(n y − bn x ) = y − b xKKKeq.3 ∑ xy − ∑ yx
n b=
∑ x − ∑ xx
2
From eq.2
∑ xy − y ∑ x
b=
∑ xy − ∑ ax − b ∑ x = 0
2
∑ x − x∑ x
2
1
∑ xy −
∑ y∑ x
b= n
1
∑ x − (∑ x )
2 2
a = y − bx

• Sum Squared Errors – Class Example 1
• ŷ = 3+x
x y ŷ y-ŷ (y--ŷ)2
(y
1 4 4 0 0
2 2 5 -3 9
3 6 6 0 0
4 8 7 1 1
10
28
• ŷ = 1+2x
(y
1 4 3 1 1
2 2 5 -3 9
3 6 7 -1 1
4 8 9 -1 1
12
Worse than the previous choice of the straight line

• ŷ = 8-x
(y
1 4 7 -3 9
2 2 6 -4 16
3 6 5 1 1
4 8 4 4 16
42
Worse than the previous 2 choices of the straight line
29
• Calculate the slope: b
1
∑ xy − ∑ y ∑ x
b= n
1
∑ x − (∑ x )
2
x2
2
x y xy
n
1 4 1 4 1
58 − × 10 × 20
2 2 4 4 4
b=
1
3 6 9 18 30 − (10 )
2
4
4 8 16 32
8
10 20 30 58 b=
5
b = 1 .6

• Sum Squared Errors – Class Example
• Calculate the intercept: a
a = y − bx
20
x y x2 xy y= =5
1 4 1 4
4
10
2 2 4 4 x = = 2.5
3 6 9 18
4
a = 5 − 1.6 × 2.5
4 8 16 32
10 20 30 58
a =1
30
• Sum Squared Errors – Class Example
• ŷ = 1+1.6x
(y
1 4 2.6 1.4 1.96
2 2 4.2 -2.2 4.84
3 6 5.8 0.2 0.04
4 8 7.4 0.6 0.36
7.20
Data Description
• How do we know we’ve made a good
choice?
– Plot residuals (y-ŷ) against x
Equally Distributed:
Straight line good choice
E
Pattern seen:
Straight line not good choice
31
Data Description
• How do we know we’ve made a good
choice?
– Calculate correlation coefficient
Co − var iance
r=
(var iance of x )(var iance of y )
S xy
r=
S xx S yy
where
1
S xx = ∑ x 2 − (∑ x )2
n
S yy = ∑ y − (∑ y )
2 1 2
n
S xy = ∑ xy − (∑ x )(∑ y )
1
n
Data Description
• Correlation – Example
• ŷ = 1+1.6x S xy
r=
S xx S yy
Sxx = 30-1/4(10)2 = 5 8 8
r= =
(5)(20) 10
Syy = 120-1/4(20)2 = 20
r = 0.8
Sxy = 58-1/4(10)(20) = 8 r = 1 Straight line
x y x2 y2 xy r > 0.98 very good correlation
1 4 1 16 4 r > 0.96 good correlation
2 2 4 4 4 r < 0.9 bad correlation
3 6 9 36 18
4 8 16 64 32
10 20 30 120 58
32
Data Description
• Temperature Rainfall example

– Class Example
Temperature (x) Rainfall (y)

3.1 3.6
3.0 3.4
3.2 3.9
3.1 4.7
3.2 2.9
3.3 2.2
Data Description

– Class Example
5.0
4.5
Rainfall (Hunderds mm)
4.0
3.5
3.0
2.5
2.0
1.5
2.9 3.0 3.1 3.2 3.3
Temperature (Thousands of degrees)
33
Data Description
– Class Example
Rainfall (Temperature)2 (Rainfall)2 Temperature

Temperature (x) (y) (x)2 (y)2 xRainfall (xy)
3.1 3.6 9.61 12.96 11.16

3 3.4 9 11.56 10.2
3.2 3.9 10.24 15.21 12.48
3.1 4.7 9.61 22.09 14.57
3.2 2.9 10.24 8.41 9.28
3.3 2.2 10.89 4.84 7.26
18.9 20.7 59.59 75.07 64.95
Data Description
– Class Example
2
Sxx = 59.59 – 1/6(18.9) = 0.055
2
Syy = 75.07 – 1/6(20.7) = 3.655
Sxy = 64.95 – 1/6(18.9)(20.7) = -0.255
S xy S
b= r= xy
S xx S S
xx yy
− 0.255
b= − 0.255
0.055 r=
b = −4.64 0.055 × 3.655
a = 3.45 − (− 4.64)3.15 = 18.1 r = −0.57
34
Part 2: Probability
Probability
• Definition: The probability of an event is the

chance that it will occur.
• Notation
– Px[3] or P[x = 3] is the probability that x = 3
– Fx[3] is the cumulative probability that x less than or
equal to 3
• Additional notes may be found on ClickUp:
Study material/Applied Statistics_Van
As_2008
35
Probability
• Three ways of refining this idea:
1) A priori approach
– Sometimes, the experimental set-up
is so clear, we know the probabilities
in advance of collecting any data
Probability
• A priori approach: Examples
– Coin
• P[x = head] = P[x = tail] = 0.5
– Dice
• P[x = 1] = P[x = 2] = P[x = 3] = P[x = 4]
=P[x = 5] = P[x = 6] = 1/6
– Cards
• P[x = 4 of hearts] = 1/52
• P[x = Ace] = 4/52 = 1/13
36
Probability
2) Empirical
– By having enough experimental data
– Examples
•Break 100 concrete cubes. 30
cubes strength is more than
50MPa
•P[x > 50] = 30/100 = 0.3
Probability
3) Subjective assessments
– What is the probability that it will
rain on the 27th of July in Pretoria.
– Argument: During winter it does not
rain regularly in Pretoria.
•No experimental information
•Intuition:
–P[x = rain] = 0.05
37
Probability
• We do not need to worry about the
philosophy of probability. Probabilities
are much easier to use in practical
calculations than they are to
philosophise about!
Probability
• Probability Scale
1.0
Dying 0.9
Pass Statistics
Probability
0.5
Coin
0.167
Dice
0
Swim through the Atlantic ocean
38
Probability
• Definitions
– The probability of event A and event B

occurring: P[A ∩ B]
– The probability of event A or event B
occurring: P[A ∪ B]
– The probability of event A given that B is
occurring: P[A | B]
Probability
• Mutually-exclusive events
– This means they cannot occur
together.
– Consequently, the probability of one
or the other of two mutually-
exclusive events occurring is the sum
of their individual probabilities.
39
Probability
• Addition rule for mutually exclusive events:
the probability of one or other of two
mutually exclusive events occurring is the
sum of their individual probabilities:
• Example:
– P[temp increase] = 0.62
– P[temp unchanged] = 0.23
– Then P[temp increase or unchanged] =
0.62 + 0.23 = 0.85
Probability
• Addition rule for mutually exclusive events:
the probabilities of all possible exclusive
outcomes add up to 1:
• Example:
– P[temp increase] = 0.62
– P[temp unchanged] = 0.23
– Then P[temp decrease] = 1 - 0.62 - 0.23 =
0.15
40
Probability
• Multiplication rule for independent events: the
probability of more than one event happening is
the product of their individual probabilities, if
they are independent:
• Example:
– P[dice = 2] = 1/6
– P[draw a King from cards] = 1/13
– Then P[dice= 2 and King] = 1/6*1/13 = 1/78
Probability
• Example – Mutually-exclusive events
– Number of students on campus is 50 000
– B.Eng = 10 000
– B.Com = 15 000
– Choosing a student
• P[B.Eng] = 10 000/50 000 = 0.2
• P[B.Com] = 15 000/50 000 = 0.3
P[ BIng ∪ BCom] = 0.2 + 0.3 = 0.50

P[ BIng ∩ BCom] = 0
41
Probability
• Venn Diagram – Mutually-exclusive
B.Com
B.Eng
Other
P[B.Eng] P[B.Com] P[Other]

Sum of P’s = 1 0.2 0.3 0.5
Probability
• Independent probability
• Where two experiments are not influencing each other.
• Probability of both happening is the product of their
individual probabilities
• Example
– What sex is a person? Man – A, Women – B
– Does the person own a vehicle? Yes – C, No – D
– P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
42
Probability
• Example
No – Vehicle - Yes P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
Man – Sex – Women
P[ A ∩ C ] = ?
P[ A ∩ D ] = ?
P[B ∩ C ] = ?
P[B ∩ D ] = ?
Probability
• Example P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
No – Vehicle - Yes
0.36 0.44
0.09 0.11
P[ A ∩ C ] = 0.45 × 0.8 = 0.36

P[ A ∩ D ] = 0.45 × 0.2 = 0.09
P[B ∩ C ] = 0.55 × 0.8 = 0.44
P[B ∩ D ] = 0.55 × 0.2 = 0.11
43
Probability
• General addition Rule
A C
A is made up of {A but not C} plus {A and C}

C is made up of {C but not A} plus {A and C}
Therefore A or C is {A} plus {C}, minus {A and C}
“A or B” means “A or B or possibly both” (True or False?)
Probability
0.36 0.44
0.09 0.11
P[A ∩ C ] = 0.45 × 0.8 = 0.36

P[A ∩ D ] = 0.45 × 0.2 = 0.09
P[B ∩ C ] = 0.55 × 0.8 = 0.44
P[B ∩ D ] = 0.55 × 0.2 = 0.11
P[ A ∪ C ] = ?
P[ B ∪ C ] = ?
44
Probability
• Example
P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
0.36 0.44
0.09 0.11
P[A ∩ C ] = 0.45 × 0.8 = 0.36

P[A ∩ D ] = 0.45 × 0.2 = 0.09
P[B ∩ C ] = 0.55 × 0.8 = 0.44
P[B ∩ D ] = 0.55 × 0.2 = 0.11
P[ A ∪ C ] = P[ A] + P[C ] − P[ A ∩ C ] = 0.45 + 0.8 − 0.36 = 0.89
P[ B ∪ C ] = ?
Probability
0.36 0.44
0.09 0.11

P[A ∩ C ] = 0.45 × 0.8 = 0.36
P[A ∩ D ] = 0.45 × 0.2 = 0.09
P[B ∩ C ] = 0.55 × 0.8 = 0.44
P[B ∩ D ] = 0.55 × 0.2 = 0.11
P[ A ∪ C ] = P[ A] + P[C ] − P[ A ∩ C ] = 0.45 + 0.8 − 0.36 = 0.89
P[ B ∪ C ] = P[ B] + P[C ] − P[ B ∩ C ] = 0.55 + 0.8 − 0.44 = 0.91
45
Probability
• Same Example – Probability Tree
Man Sex Woman

P=0.45 P=0.55
Vehicle Vehicle
Yes No Yes No
P=0.8 P=0.2 P=0.8 P=0.2
0.36 0.09 0.44 0.11
Probability
• Class Example
• A concrete beam will fail if the
concrete is too weak or the load is too
high.
• P[Weak] = 0.2
• P[High] = 0.3
• P[Failure] = ?
46
Probability
W H
• P[Weak] = 0.2
• P[High] = 0.3 0.14 0.06 0.24
• P[Failure] = ?
P[ Failure] = P[Weak ] + P[ High] − P[Weak ∩ High]

P[ Failure] = P[Weak ] + P[ High] − P[Weak ]P[ High]
P[ Failure] = 0.2 + 0.3 − (0.2)(0.3)
P[ Failure] = 0.5 − 0.06 = 0.44 Yes – Concrete weak – No
Yes -Load to high – No

0.7
0.14 0.56
P[Failure]
0.06
0.3
0.24
0.2 0.8
Probability: example
• P[Weak] = 0.2 P[ Failure ] = P[Weak ] + P[ High] − P[Weak ∩ High]
P[ Failure ] = P[Weak ] + P[ High] − P[Weak ]P[ High]
• P[High] = 0.3
P[ Failure ] = 0.2 + 0.3 − (0.2)(0.3)
• P[Failure] = ? P[ Failure ] = 0.5 − 0.06 = 0.44
Concrete
Weak OK
P = 0.2 P = 0.8
Load Load
High OK High OK
P = 0.3 P = 0.7 P = 0.3 P = 0.7
0.06 0.14 0.24 0.56
P[Failure] = 0.44
47
Probability
• General Multiplication Rule
A C
P{A and C} = P{A} × P{C | A} The probability of both

or A and B occurring is the
P{A and C} = P{C} × P{A | C} probability that A
occurs multiplied by
Where the vertical line | means " given that" the probability of B
P{C | A} is referred to as a conditional probability occurring conditional
upon A occurring.
• 12 people
• 9 – Native born
• 3 – Foreign born
• If we select 2 people, what is the probability that
both are foreign born?
• P[F1] = 3/12 = 0.25
• Once F occurred we know that there are 11
remaining of whom 2 are foreign born.
• P[F2|F1] = 2/11 = 0.1818
• P[F1 and F2] = 0.25 x 0.1818 = 0.04545 = 1/22
48
– 200 Students A L
– 77 Accounting
– 64 Law
? ? ?
– 92 Study neither
?
Complete the Venn
diagram by writing
down the correct
student numbers
200
– 200 Students A = 77 92
– 77 Accounting
– 64 Law
44 33 31
– 92 Study neither
– Other 3 numbers to
total 200 – 92 = 108 L = 64
– Only Accounting =108 –
64 = 44
– Only Law = 108 – 77 = 33
P[ A & L ] = = 0.165
31 200
– Both = 77 – 44 or 64 – P[ A] * P[ L | A] = 77 / 200 * 33 / 77 = 0.165
31 = 33
44 4
P[Only A | A] = =
77 7
49
Probability: class example
A low-water bridge is designed to allow for a flood
occurring once every 10 years. Damage occurs
during each flood. The bridge is also located in an
active seismic region and the probability of a
destructive earthquake occurring in a year is 30%.
Determine the probability of damage during any
given year assuming that floods and earthquakes are
statistically independent.
• Solution to example
• P[Flood] = 0.1
• P[No flood] = 0.9
• P[Quake] = 0.3
• P[No quake] = 0.7
P[Quake ∩ Flood ] = 0.3 × 0.1 = 0.03

P[Quake ∪ Flood ] = P[Quake] + P[ Flood ] − P[Quake ∩ Flood ]
P[Quake ∪ Flood ] = 0.3 + 0.1 − 0.03 = 0.37
50
Probability
• Class example
Vehicles are classified into light (<3 ton), medium (3-
10 ton) and heavy (>10 ton). Vehicle counts show
that 60% are light, 30% medium and 10% heavy
vehicles. Vehicles are weighed regularly. It is known
that the probability that a vehicle is light and
overloaded is 0.12. Calculate the probability that the
next vehicle is heavy or overloaded. Assume
statistical independence.
Probability
• Class example solution
P[light] =0.6
P[light and overloaded] = 0.12
P[light and overloaded] =
P[light]P[overloaded]
0.12 = 0.6 x P[overloaded]
P[overloaded] = 0.12/0.6 = 0.2
51
Probability
• Example (continued)
Vehicle
Light Heavy
Medium 0.1
0.6 0.3
Overloaded Overloaded Overloaded
Yes No Yes No Yes No

0.2 0.8 0.2 0.8 0.2 0.8
0.12 0.48 0.06 0.24 0.02 0.08
P[Overloaded ∪ Heavy ] = P[overloaded ] + P[heavy] − P[Overloaded ∩ Heavy ]

P[Overloaded ∪ Heavy ] = 0.2 + 0.1 − 0.02 = 0.28
Probability
• Exclusive and Independent
– Exclusive:
• Events are ones that never occur together.
– Independent
• The proportion of times A occurs is the same whether
or not B occurs
– Using a Venn diagram
A B A & B mutually exclusive if x = 0
A & B independent if w/x = z/y

w x y w & x refer to events in A
y & z refer to events
outside A
See paragraph 66 NB!
z
52
Probability
• Example – Independent features
Total of 200 students, 80 study biology, 90
study neither biology nor geography, and the
choice of whether a student does or does
not study biology is independent of their
choice of studying geography.
Let us use the Venn diagram to determine all
required numbers in the Venn diagram.
Probability
• Example solution
z = 90 given
B G
w+x = 80
Total outside B = 200-80

w x y Total outside B = 120
Number inside G but

outside B = 120 – 90 = 30
z y=30
w/x = z/y
90/30 = w/x
w=3x
w+x = 4x =80
x = 20
w = 60
53
Probability
• Example:
– A disease has an incidence of 0.1%
– The probability of getting a positive test
result from a diseased person is 0.99.
– The probability of getting a positive test
result from a healthy person is 0.02.
– What is the probability that a person who
gives a positive test result really does have
the disease?
Probability
• Example:
– A disease has an incidence of 0.1%
– The probability of getting a positive test result from a diseased
person is 0.99.
– The probability of getting a positive test result from a healthy
person is 0.02.
– What is the probability that a person who gives a positive test
result really does have the disease?
• Answer:
– Draw tree diagram
P[+] = (0.001*0.99)/[(0.001*0.99)+(0.999*0.02)]
= 0.047
– Conclusion?
54
Probability
• Asking sensitive questions in surveys: the
randomised response method
– Spin a coin twice. Show no-one the results
– If the spin resulted in a head, answer the
question marked H. If the spin resulted in a tail,
answer the question marked T.
• H: Sensitive question
• T: Harmless question – Was the second spin a
tail?
Probability
• Asking sensitive questions in surveys: the
randomised response method
Roll 1
Head Tail
0.5 0.5
Sensitive Q Harmless Q
Yes No Yes No
p 1-p 0.5 0.5
0.5 p 0.5 – 0.5p 0.25 0.25
y = 0.5p + 0.25
p = 2y – 0.5
55

Statistics - Part A

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Statistics - Part A

Diunggah oleh

Hak Cipta:

Format Tersedia

BASIC STATISTICAL

Prof. Hannes Gräbe Pr Eng

What is Statistics about?

• Part 1: Data Description

Part 1: Data Presentation

• Measures of location: Mean

8.78 + 8.07 + 8.15 + 8.98 + 9.02 + 7.99 + 8.52 + 8.27 + 8.89

• Symbol for mean if observations 2 8.07

• Measures of location: Median

Bridge Median Height = 8.52 3 8.15

Vehicles per hour – peak

Option 1: (9+60+300)/3 = 369/3 = 123

1st quartile 2nd quartile 3rd quartile

Separate ordered data into 4 equal groups

= 0.354 8.52 0.00 0.00

8.78 0.26 0.0676

Few observations further Small dataset

Few observations further Medium-sized dataset

Few observations further Large dataset

Data Description: Pictorial

Data Description: Pictorial

• Flood Example: Flood size from 1980 – 1989:

* Anthropometric survey of 100 male car drivers’

• Choosing classes Up to 699 1 1

– Use convenient round 700-719 2 3

– Specific classes 740-759 13 22

– Sufficient many classes – 760-769 12 34

Arm Length (mm)

Arm Length (mm)

Data Description: example

Data Description: example

• For LQ (n=2.75) = 20.00

Arm Length (mm)

Arm Length (mm)

Arm Length (mm) 845mm

Average Daily Temperatures (Thousands of degrees)

Data Description: example

Worse than the previous choice of the straight line

Data Description: example

Worse than the previous 2 choices of the straight line

Data Description: example

• Temperature Rainfall example

Temperature (x) Rainfall (y)

• Temperature Rainfall example

Temperature (Thousands of degrees)

Rainfall (Temperature)2 (Rainfall)2 Temperature

3.1 3.6 9.61 12.96 11.16

Sxy = 64.95 – 1/6(18.9)(20.7) = -0.255

• Definition: The probability of an event is the

– The probability of event A and event B

P[ BIng ∪ BCom] = 0.2 + 0.3 = 0.50

P[B.Eng] P[B.Com] P[Other]

Man – Sex – Women

Man – Sex – Women

P[ A ∩ C ] = 0.45 × 0.8 = 0.36

A is made up of {A but not C} plus {A and C}

“A or B” means “A or B or possibly both” (True or False?)

Man – Sex – Women

P[A ∩ C ] = 0.45 × 0.8 = 0.36

Man – Sex – Women

P[A ∩ C ] = 0.45 × 0.8 = 0.36

Man – Sex – Women

Man Sex Woman