Anda di halaman 1dari 55

BASIC STATISTICAL

METHODS:
Module 2
SHC 797: 2011

Prof. Hannes Gräbe Pr Eng


Department of Civil Engineering
Acknowledgements: Engineering 1 Room 11-15
Mr. J. Van der Westhuizen Tel. (012) 420 4723
hannes.grabe@up.ac.za1
Prof. Alex Visser

What is Statistics about?


• Data Collection
• Summarizing Data
• Interpreting Data
• Drawing Conclusions
from Data

1
Context

• Part 1: Data Description


• Part 2: Probability
• Part 3: Inference

Part 1: Data Presentation

2
Data Description
• Designing experiments
– Does pre-heating reduce
the possibility of rail
defects in thermit welds ?

• Observational studies
– Polls: The popularity of
the President

Data Description
• How can we summarise small
amounts of data? Number Height
(m)
1 8.78
• Example: Bridge column
2 8.07
support height
3 8.15
4 8.98
5 9.02
6 7.99
7 8.52
8 8.27
9 8.89

3
Data Description
• First remember that the following is required
before we continue:
– Data needs to be a random sample
– How accurately were the measurements
made?
– Notice the difference between
measurements and counts
• Measurements: More accurate
• Counts: Always whole numbers

Data Description
• Example Table
– Bridge Column Number Height (m)
Support Height 1 8.78
• Most important thing 2 8.07
about numbers in this 3 8.15
list 4 8.98
– How low/high a 5 9.02
typical one is 6 7.99
– How variable the 7 8.52
numbers are 8 8.27
9 8.89

4
Data Description

• Measures of location: Mean

8.78 + 8.07 + 8.15 + 8.98 + 9.02 + 7.99 + 8.52 + 8.27 + 8.89


Bridge Mean Height =
9
= 8.518 Number Height (m)

1 8.78

• Symbol for mean if observations 2 8.07


3 8.15
are referred to as x’s:
∑i =1 xi
n
4 8.98

x= 5 9.02

n 6 7.99
7 8.52
8 8.27
• Arithmetic mean or Average 9 8.89

Data Description

• Measures of location: Median


Number Height
• Arrange measurements in order of (m)
size. The median is then the middle 6 7.99
one: 2 8.07

Bridge Median Height = 8.52 3 8.15


8 8.27
7 8.52
• If there is an even number of
1 8.78
observations we take the average
9 8.89
of the two middle ones
4 8.98
5 9.02

5
Data Description
• Measures of location: Mode
– The mode is the observation that
occurs most frequently. In our
previous example there is no mode.
– This is more useful for counted data.
For example number of vehicles per
hour over a 24 hour period.

Data Description
• Median:
• Mean: – Can be calculated with
– Easy to calculate merely graded data
– Much statistical – Easy to calculate for
theory based on it small samples, not
moderate or large
– Accurate: means of
samples
different samples do
not vary very much – All values have to be
stored for the
– Each observation
calculation
contributes equally to
the calculation – Not as sensitive to
mistakes
– Sensitive to mistakes
– Not disturbed by
outliers

6
Data Description
• Guidelines for mean, median and modes
– In case of no outliers use mean
– In case of outliers:

ts
en

Co
m
re

un
su

ts
ea
M

Vehicles per hour – peak


hours have a substantial
influence

Data Description
• Other measures of location:
– Discard upper and lower 5%
– Root mean square
= mean of the squared observations
– Geometric mean of n positive
numbers = nth root of their product
– Harmonic mean = 1/(mean of the
reciprocals of the observations)

7
Data Description
• Other measures of location: Weighted
means
– Each observation is not counted
equally, but is “weighted” according
to its size:
=
∑ w( x) x
∑ w( x)

Examples
• University class size: A third = 9,
another third= 60, last third = 300
• What is the mean class size?

8
Examples
• University class size: A third = 9,
another third= 60, last third = 300
• What is the mean class size?

Option 1: (9+60+300)/3 = 369/3 = 123


Option 2: (9*9+60*60+300*300)/369 =
93681/369 = 254

Examples
• Performance evaluation calculation:
KPA Weight Rating (/5)
1 50% 4
2 30% 3
3 20% 3

9
Data Description
• Measure of variation: Range
– The range of a set of numbers is the
largest minus the smallest
• The range of bridge height is 9.02 – 7.99 =
1.03 m
 Only two observations contribute directly
 It’s very sensitive to unusually big or small
observations
 No standardisation to sample size
 But of use when sample size is the same (Industrial QC)

Data Description
• Measure of variation: IQR and SIQR
(Inter-quartile Range and Semi-Inter-
quartile range = ½ IQR) – Distance
between quartiles
– Quartile
25% 25% 25% 25%

1st quartile 2nd quartile 3rd quartile


(Lower) (median) (Upper)

Separate ordered data into 4 equal groups

10
Data Description
• To determine the lower and upper
quartiles we recommend the following:
– Lower quartile
•0.25n + 0.5 (n – number of
observations)
– Upper quartile
•0.75n + 0.5

Data Description
Number Height
• Example (m)
– Using the bridge data, 6 7.99
calculate the LQ, UQ, IQR 2 8.07
and SIQR 3 8.15
8 8.27
7 8.52
1 8.78
9 8.89
4 8.98
5 9.02

11
Data Description
Order No. Height
• Example (m)
– n=9 1 6 7.99
– LQ(n) = 0.25(9)+0.5 = 2.75 2 2 8.07
– UQ(n) = 0.75(9)+0.5 = 7.25 3 3 8.15
4 8 8.27
LQ = 8.13 5 7 8.52
6 1 8.78
UQ = 8.91
7 9 8.89
IQR = 8.91 – 8.13 = 0.78
8 4 8.98
SIQR = 0.39 9 5 9.02

Data Description
• Measure of variation: Mean
Absolute Deviation (MAD)
Deviation Absolute
from deviation

M.A.D =
∑ x−x Observation
Height (m) -
x−x
mean from mean

x−x
x
n 8.78 0.26 0.26
8.07 -0.45 0.45
8.15 -0.37 0.37
8.98 0.46 0.46
3.19
M.A.D = 9.02 0.50 0.50
9 7.99 -0.53 0.53

= 0.354 8.52 0.00 0.00


8.27 -0.25 0.25
8.89 0.37 0.37
Sum 0.00 3.19

12
Data Description
• Measure of variation: Standard Deviation
– SD is approximately the average of the difference
between each value and the mean.

∑ (x − x )
2

s=
n −1
– SD is what is most commonly used as a measure of
variation
– The square of the SD is called Variance
var = s 2

Data Description
• Standard Deviation - Example

∑ (x − x )
2 Squared
Deviation deviation from
s= Observation from mean mean
n −1 Height (m) -
x x−x (x − x ) 2

8.78 0.26 0.0676


s = 0.411 8.07 -0.45 0.2025
8.15 -0.37 0.1369
8.98 0.46 0.2116

( x)
9.02 0.5 0.25
1
∑x − ∑
2 2
7.99 -0.53 0.2809

s= n 8.52 0 0
n −1 8.27 -0.25 0.0625
8.89 0.37 0.1369
Sum 0 1.3489

13
Data Description
• Standard Deviation – Typical
Observations
Description Dataset size

Few observations further Small dataset


than 1 SD from the mean (10 observations)

Few observations further Medium-sized dataset


Medium-
than 2 SD from the mean (100 observations)

Few observations further Large dataset


than 3 SD from the mean (1000 observations)

Data Description: Pictorial


• The box-and-whisker plot
Median

Whiskers

14
Data Description: Pictorial
• The box-and-whisker plot
– Can be drawn horizontally or vertically
– Outliers: more than 1 IQR above UQ
more than 1 IQR below LQ
– This is however much more complex
– Transformations (taking the square root or the
logarithm) can be used to make this plot more
useful
– Skewness: UQ − median
median − LQ

Data Description: Pictorial


• Cumulative frequency plot
– Vertical axis represents the
number/percentage/proportion of
observations that are less than or equal
to the x value on the horizontal axis

15
Data Description

• Flood Example: Flood size from 1980 – 1989:


50 12 16 20 17 13 61 26 33 38
1

0.9

0.8

0.7
FREQUENCY

0.6

0.5

0.4

0.3

0.2

0.1

0
0 10 20 30 40 50 60 70

SIZE OF FLOOD

Data Description
• Flood Example – IQR
• The box-and-whisker plot

0.9

0.8

0.7
FREQUENCY

0.6

0.5

0.4

0.3

0.2

0.1

0
0 10 20 30 40 50 60 70

SIZE OF FLOOD

12 16 23 38 61

16
Data Description
• Large Datasets
• See arm length measurements below.
741 817 846 845 846 833 782 767 786 810

765 694 758 754 754 806 775 798 740 809

759 785 795 830 854 830 789 802 720 816

764 783 747 774 763 781 804 727 809 801

796 791 811 833 757 786 806 796 776 803

801 817 831 811 801 802 834 805 829 817

801 796 706 802 774 767 811 767 830 771

759 751 765 811 727 761 808 777 835 787

788 776 754 812 860 765 763 780 777 737

761 791 757 758 795 708 784 725 800 723

* Anthropometric survey of 100 male car drivers’


arm lengths

Data Description
Example 1
Count
• Tabulate this data Range Observations Cumulative

• Choosing classes Up to 699 1 1

– Use convenient round 700-719 2 3


numbers 720-739 6 9

– Specific classes 740-759 13 22

– Sufficient many classes – 760-769 12 34


not wide grouping 770-779 8 42
– Sufficient few classes – 780-789 11 53
between 5 – 15 classes 790-799 7 60
– Each observation must go 800-809 16 76
into only one class 810-829 11 87
830-849 11 98
850-869 2 100
870 and over 0 100

17
Data Description
• Stem-and-leaf plots
69 4
70 6 6
71
72 0 7 7 5 3
73 7
74 2 0 7
75 8 4 4 9 7 9 1 4 7 8
76 7 5 4 3 9 7 7 5 1 5 3 1
77 5 4 6 4 1 7 6 7
78 2 6 5 9 3 1 6 7 8 0 3
79 8 5 6 1 6 1 5
80 6 9 2 4 9 1 6 3 1 1 2 5 1 2 8 0
81 7 0 6 1 7 1 7 1 1 2
82 9
83 3 0 0 3 1 4 0 5
84 6 5 6
85 4
86 0

Data Description
• Histogram presentation
Note the change in range size and
the frequency half at range double
16 the size of the selected standard
range. Rather keep range size constant
Proportion of observation per class width

14

12

10

0
650 700 750 800 850 900

Arm Length (mm)

18
Data Description
• Cumulative frequency plot

100

80
% Observations <= x

60

40

20

0
680 700 720 740 760 780 800 820 840 860 880

Arm Length (mm)

Data Description: example


• The maximum flood in a river for a period of
9 years is:
20,22,21,24,19,25,27,20,23 m3/s
– Calculate:
• i) Mode
• ii) Median
• iii) The expected value/mean
• Iv) IQR and SIQR
• iv) Standard Deviation
• v) Draw the cumulative frequency plot

19
Data Description: example
• Solution
19,20,20,21,22,23,24,25,27

• Mode: 20 m3/s
• Median: 22 m3/s
• Mean: ∑ x = 201 = 22.3 m 3 /s
n 9

Data Description: example

• Solution
19,20,20,21,22,23,24,25,27

• For LQ (n=2.75) = 20.00


• UQ (n=7.25) = 24.25
• IQR = 24.25 – 20.00 = 4.25
• SIQR = 4.25/2 = 2.125
• Skewness = (24.25 – 22)/(22-20) = 1.125

20
Data Description: example

Observation x−x (x − x )2

19 -3.33 11.11
20 -2.33 5.44
20 -2.33 5.44
21 -1.33 1.78
22 -0.33 0.11
23 0.67 0.44
24 1.67 2.78
25 2.67 7.11
27 4.67 21.78
Sum 0.03 56.01

 ∑ (x − x ) 
2

56.01
σ =   σ= = 2.7 m 3 /s
 n −1  8

Data Description
• Cumulative frequency plot: Cumulative
proportions
1.0
Proportion of observation less than x

0.8

0.6

0.4

0.2

0.0
Up to 699

700-719

720-739

740-759

760-769

770-779

780-789

790-799

800-809

810-829

830-849

850-869

Arm Length (mm)

21
Data Description
• Computing the mean and standard
deviation of grouped data – Arm length
example
– Assume that all the observations in a class
were in fact at its mid-point
Class x Frequency,f xf x2 f
Up to 699 690 1 690 476100
700-719 710 2 1420 1008200
720-739 730 6 4380 3197400
740-759 750 13 9750 7312500
760-769 765 12 9180 7022700
770-779 775 8 6200 4805000
780-789 785 11 8635 6778475
790-799 795 7 5565 4424175
800-809 805 16 12880 10368400
810-829 820 11 9020 7396400
830-849 840 11 9240 7761600
850-869 860 2 1720 1479200
Sum 100 78680 62030150

Data Description
1
x= ∑ xf
• The mean grouped n
1
data = 78680
100
= 786.8

• The Standard
∑ (x − x )
2
f
Deviation grouped σ=
n −1
data (∑ xf )2
∑ (x )
1
f−
2

= n
n −1

62030150 −
1
(78680 )2
= 100
99
= 35.5mm

22
Data Description
• Computing the median and quartiles of
grouped data
1.00
0.08
0.90 x = 780 + (790 − 780)
Proportion of observation less than x

0.80
0.11
0.70
= 787.3
0.60

0.50
0.53
0.40 0.42
0.30

0.20

0.10

0.00
680 700 720 740 760 780 800 820 840 860 880

Arm Length (mm)

Data Description
• Same can be done to determine
quartiles of grouped data
• For 0.25:
– X= 762.0
• For 0.75:
– X= 809.4
• Inter-quartile range
– 809.4 - 762.5 = 46.9

23
Data Description
• Percentile
– Percentile divide the ordered
observations into one hundred equal
groups.
– EXAMPLE:
•A man of 95th percentile height is
one who is taller than 95 percent
of men and shorter than 5
percent.

Data Description
• 95th Percentile Example

1.0

0.9
Proportion of observation less than x

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
680 700 720 740 760 780 800 820 840 860 880

Arm Length (mm) 845mm

24
Data Description
• Regression and correlation
– Plotting one variable against another
• Each x and y value correspond
• Plot it against each other
• See example below: Sum of average daily temperatures
(in thousands of degrees) and total rainfall (in hundreds
of mm) for 1924-1929
Avg Daily Temperatures Rainfall
3.1 3.6
3.0 3.4
3.2 3.9
3.1 4.7
3.2 2.9
3.3 2.2

Data Description
• Scatter plot

5.0

4.5
Rainfall (Hundreds of mm)

4.0

3.5

3.0

2.5

2.0

1.5
2.9 3.0 3.1 3.2 3.3

Average Daily Temperatures (Thousands of degrees)

25
Data Description
• Regression
– With this term, emphasis is on the
equation that predicts y from x
• Correlation
– With this term, emphasis is on the
strength of the relationship. That is,
how useful is known x for the
purpose of predicting y? We will only
be concerned with the strength of
the linear relationship.

Data Description
• The “best” straight line
– Best straight line for predicting y.
– Any straight line obeys the equation
y = a +bx for some choice of the
constants a and b.
– Therefore choosing the best straight
line means choosing a and b.

26
Data Description
• Regression Analysis

y y 4
3

y y
yˆ = a + bx
y 4
y
Y-Axis

2 3
1
y
y
2
yˆ = a + bx
1

Data Description
• Regression Analysis
2
n  
E = ∑  y − yˆ  Sum Squared Errors
i =1  
n

yˆ = a + bx ∴ E = ∑ ( y − a − bx )
2
but
i =1

δE δE
For = =0
δa δb
δE n
= 2∑ ( y − a − bx )(1) = 0
δa i =1

δE n
= 2∑ ( y − a − bx )( x ) = 0
δb i =1

δE n
= ∑ ( y − a − bx ) = 0KKK eq.1
δa i =1

δE n
= ∑ ( y − a − bx )( x ) = 0KK eq.2
δb i =1

27
Data Description
δE
= ∑ ( y − a − bx ) = 0KKK eq.1
n
3 int o 2
δa i =1
( )
∑ xy − ∑ y − b x x − b ∑ x = 0
2

From eq.1 ∑ xy − ∑ ( yx − bx x ) − b ∑ x =0
2

∑ y − na − b ∑ x = 0 ∑ xy − ∑ yx + ∑ bx x − b ∑ x = 0
2

but ∑ y = ny and ∑ x = nx (
∑ xy − ∑ yx = − ∑ bx x − b ∑ x
2
)
∑ xy − ∑ yx = b(∑ x − ∑ x x)
2

∴a =
1
(n y − bn x ) = y − b xKKKeq.3 ∑ xy − ∑ yx
n b=
∑ x − ∑ xx
2

From eq.2
∑ xy − y ∑ x
b=
∑ xy − ∑ ax − b ∑ x = 0
2

∑ x − x∑ x
2

1
∑ xy −
∑ y∑ x
b= n
1
∑ x − (∑ x )
2 2

a = y − bx

Data Description: example


• Sum Squared Errors – Class Example 1
• ŷ = 3+x

x y ŷ y-ŷ (y--ŷ)2
(y
1 4 4 0 0
2 2 5 -3 9
3 6 6 0 0
4 8 7 1 1

10

28
Data Description: example
• Sum Squared Errors – Class Example 1
• ŷ = 1+2x

x y ŷ y-ŷ (y--ŷ)2
(y
1 4 3 1 1
2 2 5 -3 9
3 6 7 -1 1
4 8 9 -1 1

12

Worse than the previous choice of the straight line

Data Description: example


• Sum Squared Errors – Class Example 1
• ŷ = 8-x

x y ŷ y-ŷ (y--ŷ)2
(y
1 4 7 -3 9
2 2 6 -4 16
3 6 5 1 1
4 8 4 4 16

42

Worse than the previous 2 choices of the straight line

29
Data Description: example
• Sum Squared Errors – Class Example 1
• Calculate the slope: b
1
∑ xy − ∑ y ∑ x
b= n
1
∑ x − (∑ x )
2
x2
2
x y xy
n
1 4 1 4 1
58 − × 10 × 20
2 2 4 4 4
b=
1
3 6 9 18 30 − (10 )
2

4
4 8 16 32
8
10 20 30 58 b=
5
b = 1 .6

Data Description: example


• Sum Squared Errors – Class Example
• Calculate the intercept: a

a = y − bx
20
x y x2 xy y= =5
1 4 1 4
4
10
2 2 4 4 x = = 2.5
3 6 9 18
4
a = 5 − 1.6 × 2.5
4 8 16 32
10 20 30 58
a =1

30
Data Description: example
• Sum Squared Errors – Class Example
• ŷ = 1+1.6x

x y ŷ y-ŷ (y--ŷ)2
(y
1 4 2.6 1.4 1.96
2 2 4.2 -2.2 4.84
3 6 5.8 0.2 0.04
4 8 7.4 0.6 0.36

7.20

Data Description
• How do we know we’ve made a good
choice?
– Plot residuals (y-ŷ) against x
Equally Distributed:
Straight line good choice
E

Pattern seen:
Straight line not good choice

31
Data Description
• How do we know we’ve made a good
choice?
– Calculate correlation coefficient
Co − var iance
r=
(var iance of x )(var iance of y )
S xy
r=
S xx S yy
where
1
S xx = ∑ x 2 − (∑ x )2
n
S yy = ∑ y − (∑ y )
2 1 2

n
S xy = ∑ xy − (∑ x )(∑ y )
1
n

Data Description
• Correlation – Example
• ŷ = 1+1.6x S xy
r=
S xx S yy
Sxx = 30-1/4(10)2 = 5 8 8
r= =
(5)(20) 10
Syy = 120-1/4(20)2 = 20
r = 0.8
Sxy = 58-1/4(10)(20) = 8 r = 1 Straight line
x y x2 y2 xy r > 0.98 very good correlation
1 4 1 16 4 r > 0.96 good correlation
2 2 4 4 4 r < 0.9 bad correlation
3 6 9 36 18
4 8 16 64 32

10 20 30 120 58

32
Data Description

• Temperature Rainfall example


– Class Example

Temperature (x) Rainfall (y)


3.1 3.6
3.0 3.4
3.2 3.9
3.1 4.7
3.2 2.9
3.3 2.2

Data Description

• Temperature Rainfall example


– Class Example

5.0

4.5
Rainfall (Hunderds mm)

4.0

3.5

3.0

2.5

2.0

1.5
2.9 3.0 3.1 3.2 3.3

Temperature (Thousands of degrees)

33
Data Description
• Temperature Rainfall example
– Class Example

Rainfall (Temperature)2 (Rainfall)2 Temperature


Temperature (x) (y) (x)2 (y)2 xRainfall (xy)

3.1 3.6 9.61 12.96 11.16


3 3.4 9 11.56 10.2
3.2 3.9 10.24 15.21 12.48
3.1 4.7 9.61 22.09 14.57
3.2 2.9 10.24 8.41 9.28
3.3 2.2 10.89 4.84 7.26
18.9 20.7 59.59 75.07 64.95

Data Description
• Temperature Rainfall example
– Class Example
2
Sxx = 59.59 – 1/6(18.9) = 0.055
2
Syy = 75.07 – 1/6(20.7) = 3.655

Sxy = 64.95 – 1/6(18.9)(20.7) = -0.255

S xy S
b= r= xy

S xx S S
xx yy

− 0.255
b= − 0.255
0.055 r=
b = −4.64 0.055 × 3.655
a = 3.45 − (− 4.64)3.15 = 18.1 r = −0.57

34
Part 2: Probability

Probability

• Definition: The probability of an event is the


chance that it will occur.
• Notation
– Px[3] or P[x = 3] is the probability that x = 3
– Fx[3] is the cumulative probability that x less than or
equal to 3
• Additional notes may be found on ClickUp:
Study material/Applied Statistics_Van
As_2008

35
Probability
• Three ways of refining this idea:
1) A priori approach
– Sometimes, the experimental set-up
is so clear, we know the probabilities
in advance of collecting any data

Probability
• A priori approach: Examples
– Coin
• P[x = head] = P[x = tail] = 0.5
– Dice
• P[x = 1] = P[x = 2] = P[x = 3] = P[x = 4]
=P[x = 5] = P[x = 6] = 1/6
– Cards
• P[x = 4 of hearts] = 1/52
• P[x = Ace] = 4/52 = 1/13

36
Probability
2) Empirical
– By having enough experimental data
– Examples
•Break 100 concrete cubes. 30
cubes strength is more than
50MPa
•P[x > 50] = 30/100 = 0.3

Probability
3) Subjective assessments
– What is the probability that it will
rain on the 27th of July in Pretoria.
– Argument: During winter it does not
rain regularly in Pretoria.
•No experimental information
•Intuition:
–P[x = rain] = 0.05

37
Probability
• We do not need to worry about the
philosophy of probability. Probabilities
are much easier to use in practical
calculations than they are to
philosophise about!

Probability
• Probability Scale
1.0
Dying 0.9
Pass Statistics
Probability

0.5
Coin

0.167
Dice
0
Swim through the Atlantic ocean

38
Probability
• Definitions

– The probability of event A and event B


occurring: P[A ∩ B]
– The probability of event A or event B
occurring: P[A ∪ B]
– The probability of event A given that B is
occurring: P[A | B]

Probability
• Mutually-exclusive events
– This means they cannot occur
together.
– Consequently, the probability of one
or the other of two mutually-
exclusive events occurring is the sum
of their individual probabilities.

39
Probability
• Addition rule for mutually exclusive events:
the probability of one or other of two
mutually exclusive events occurring is the
sum of their individual probabilities:
• Example:
– P[temp increase] = 0.62
– P[temp unchanged] = 0.23
– Then P[temp increase or unchanged] =
0.62 + 0.23 = 0.85

Probability
• Addition rule for mutually exclusive events:
the probabilities of all possible exclusive
outcomes add up to 1:
• Example:
– P[temp increase] = 0.62
– P[temp unchanged] = 0.23
– Then P[temp decrease] = 1 - 0.62 - 0.23 =
0.15

40
Probability
• Multiplication rule for independent events: the
probability of more than one event happening is
the product of their individual probabilities, if
they are independent:
• Example:
– P[dice = 2] = 1/6
– P[draw a King from cards] = 1/13
– Then P[dice= 2 and King] = 1/6*1/13 = 1/78

Probability
• Example – Mutually-exclusive events
– Number of students on campus is 50 000
– B.Eng = 10 000
– B.Com = 15 000
– Choosing a student
• P[B.Eng] = 10 000/50 000 = 0.2
• P[B.Com] = 15 000/50 000 = 0.3

P[ BIng ∪ BCom] = 0.2 + 0.3 = 0.50


P[ BIng ∩ BCom] = 0

41
Probability
• Venn Diagram – Mutually-exclusive

B.Com

B.Eng

Other

P[B.Eng] P[B.Com] P[Other]


Sum of P’s = 1 0.2 0.3 0.5

Probability
• Independent probability
• Where two experiments are not influencing each other.
• Probability of both happening is the product of their
individual probabilities

• Example
– What sex is a person? Man – A, Women – B
– Does the person own a vehicle? Yes – C, No – D
– P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2

42
Probability
• Example
No – Vehicle - Yes P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2

Man – Sex – Women

P[ A ∩ C ] = ?
P[ A ∩ D ] = ?
P[B ∩ C ] = ?
P[B ∩ D ] = ?

Probability
• Example P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
No – Vehicle - Yes

0.36 0.44

0.09 0.11

Man – Sex – Women

P[ A ∩ C ] = 0.45 × 0.8 = 0.36


P[ A ∩ D ] = 0.45 × 0.2 = 0.09
P[B ∩ C ] = 0.55 × 0.8 = 0.44
P[B ∩ D ] = 0.55 × 0.2 = 0.11

43
Probability
• General addition Rule

A C

A is made up of {A but not C} plus {A and C}


C is made up of {C but not A} plus {A and C}
Therefore A or C is {A} plus {C}, minus {A and C}

“A or B” means “A or B or possibly both” (True or False?)

Probability
• Example P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
No – Vehicle - Yes

0.36 0.44

0.09 0.11

Man – Sex – Women

P[A ∩ C ] = 0.45 × 0.8 = 0.36


P[A ∩ D ] = 0.45 × 0.2 = 0.09
P[B ∩ C ] = 0.55 × 0.8 = 0.44
P[B ∩ D ] = 0.55 × 0.2 = 0.11
P[ A ∪ C ] = ?
P[ B ∪ C ] = ?

44
Probability
• Example
No – Vehicle - Yes
P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2

0.36 0.44

0.09 0.11

Man – Sex – Women

P[A ∩ C ] = 0.45 × 0.8 = 0.36


P[A ∩ D ] = 0.45 × 0.2 = 0.09
P[B ∩ C ] = 0.55 × 0.8 = 0.44
P[B ∩ D ] = 0.55 × 0.2 = 0.11
P[ A ∪ C ] = P[ A] + P[C ] − P[ A ∩ C ] = 0.45 + 0.8 − 0.36 = 0.89
P[ B ∪ C ] = ?

Probability
• Example P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
No – Vehicle - Yes

0.36 0.44

0.09 0.11

Man – Sex – Women


P[A ∩ C ] = 0.45 × 0.8 = 0.36
P[A ∩ D ] = 0.45 × 0.2 = 0.09
P[B ∩ C ] = 0.55 × 0.8 = 0.44
P[B ∩ D ] = 0.55 × 0.2 = 0.11
P[ A ∪ C ] = P[ A] + P[C ] − P[ A ∩ C ] = 0.45 + 0.8 − 0.36 = 0.89
P[ B ∪ C ] = P[ B] + P[C ] − P[ B ∩ C ] = 0.55 + 0.8 − 0.44 = 0.91

45
Probability
• Same Example – Probability Tree

Man Sex Woman


P=0.45 P=0.55
Vehicle Vehicle
Yes No Yes No
P=0.8 P=0.2 P=0.8 P=0.2

0.36 0.09 0.44 0.11

Probability
• Class Example
• A concrete beam will fail if the
concrete is too weak or the load is too
high.
• P[Weak] = 0.2
• P[High] = 0.3
• P[Failure] = ?

46
Probability
W H
• P[Weak] = 0.2
• P[High] = 0.3 0.14 0.06 0.24
• P[Failure] = ?

P[ Failure] = P[Weak ] + P[ High] − P[Weak ∩ High]


P[ Failure] = P[Weak ] + P[ High] − P[Weak ]P[ High]
P[ Failure] = 0.2 + 0.3 − (0.2)(0.3)
P[ Failure] = 0.5 − 0.06 = 0.44 Yes – Concrete weak – No

Yes -Load to high – No


0.7
0.14 0.56
P[Failure]
0.06
0.3

0.24
0.2 0.8

Probability: example
• P[Weak] = 0.2 P[ Failure ] = P[Weak ] + P[ High] − P[Weak ∩ High]
P[ Failure ] = P[Weak ] + P[ High] − P[Weak ]P[ High]
• P[High] = 0.3
P[ Failure ] = 0.2 + 0.3 − (0.2)(0.3)
• P[Failure] = ? P[ Failure ] = 0.5 − 0.06 = 0.44

Concrete

Weak OK
P = 0.2 P = 0.8

Load Load

High OK High OK
P = 0.3 P = 0.7 P = 0.3 P = 0.7

0.06 0.14 0.24 0.56

P[Failure] = 0.44

47
Probability
• General Multiplication Rule

A C

P{A and C} = P{A} × P{C | A} The probability of both


or A and B occurring is the
P{A and C} = P{C} × P{A | C} probability that A
occurs multiplied by
Where the vertical line | means " given that" the probability of B
P{C | A} is referred to as a conditional probability occurring conditional
upon A occurring.

Probability: example
• 12 people
• 9 – Native born
• 3 – Foreign born
• If we select 2 people, what is the probability that
both are foreign born?
• P[F1] = 3/12 = 0.25
• Once F occurred we know that there are 11
remaining of whom 2 are foreign born.
• P[F2|F1] = 2/11 = 0.1818
• P[F1 and F2] = 0.25 x 0.1818 = 0.04545 = 1/22

48
Probability: example
– 200 Students A L
– 77 Accounting
– 64 Law
? ? ?
– 92 Study neither

?
Complete the Venn
diagram by writing
down the correct
student numbers

Probability: example
200
– 200 Students A = 77 92
– 77 Accounting
– 64 Law
44 33 31
– 92 Study neither
– Other 3 numbers to
total 200 – 92 = 108 L = 64
– Only Accounting =108 –
64 = 44
– Only Law = 108 – 77 = 33
P[ A & L ] = = 0.165
31 200
– Both = 77 – 44 or 64 – P[ A] * P[ L | A] = 77 / 200 * 33 / 77 = 0.165
31 = 33
44 4
P[Only A | A] = =
77 7

49
Probability: class example
A low-water bridge is designed to allow for a flood
occurring once every 10 years. Damage occurs
during each flood. The bridge is also located in an
active seismic region and the probability of a
destructive earthquake occurring in a year is 30%.
Determine the probability of damage during any
given year assuming that floods and earthquakes are
statistically independent.

Probability: example
• Solution to example
• P[Flood] = 0.1
• P[No flood] = 0.9
• P[Quake] = 0.3
• P[No quake] = 0.7

P[Quake ∩ Flood ] = 0.3 × 0.1 = 0.03


P[Quake ∪ Flood ] = P[Quake] + P[ Flood ] − P[Quake ∩ Flood ]
P[Quake ∪ Flood ] = 0.3 + 0.1 − 0.03 = 0.37

50
Probability
• Class example
Vehicles are classified into light (<3 ton), medium (3-
10 ton) and heavy (>10 ton). Vehicle counts show
that 60% are light, 30% medium and 10% heavy
vehicles. Vehicles are weighed regularly. It is known
that the probability that a vehicle is light and
overloaded is 0.12. Calculate the probability that the
next vehicle is heavy or overloaded. Assume
statistical independence.

Probability
• Class example solution
P[light] =0.6
P[light and overloaded] = 0.12
P[light and overloaded] =
P[light]P[overloaded]
0.12 = 0.6 x P[overloaded]
P[overloaded] = 0.12/0.6 = 0.2

51
Probability
• Example (continued)
Vehicle
Light Heavy
Medium 0.1
0.6 0.3

Overloaded Overloaded Overloaded

Yes No Yes No Yes No


0.2 0.8 0.2 0.8 0.2 0.8

0.12 0.48 0.06 0.24 0.02 0.08

P[Overloaded ∪ Heavy ] = P[overloaded ] + P[heavy] − P[Overloaded ∩ Heavy ]


P[Overloaded ∪ Heavy ] = 0.2 + 0.1 − 0.02 = 0.28

Probability
• Exclusive and Independent
– Exclusive:
• Events are ones that never occur together.
– Independent
• The proportion of times A occurs is the same whether
or not B occurs
– Using a Venn diagram

A B A & B mutually exclusive if x = 0

A & B independent if w/x = z/y


w x y w & x refer to events in A
y & z refer to events
outside A
See paragraph 66 NB!
z

52
Probability
• Example – Independent features
Total of 200 students, 80 study biology, 90
study neither biology nor geography, and the
choice of whether a student does or does
not study biology is independent of their
choice of studying geography.
Let us use the Venn diagram to determine all
required numbers in the Venn diagram.

Probability
• Example solution
z = 90 given
B G
w+x = 80

Total outside B = 200-80


w x y Total outside B = 120

Number inside G but


outside B = 120 – 90 = 30
z y=30

w/x = z/y
90/30 = w/x
w=3x
w+x = 4x =80
x = 20
w = 60

53
Probability
• Example:
– A disease has an incidence of 0.1%
– The probability of getting a positive test
result from a diseased person is 0.99.
– The probability of getting a positive test
result from a healthy person is 0.02.
– What is the probability that a person who
gives a positive test result really does have
the disease?

Probability
• Example:
– A disease has an incidence of 0.1%
– The probability of getting a positive test result from a diseased
person is 0.99.
– The probability of getting a positive test result from a healthy
person is 0.02.
– What is the probability that a person who gives a positive test
result really does have the disease?

• Answer:
– Draw tree diagram
P[+] = (0.001*0.99)/[(0.001*0.99)+(0.999*0.02)]
= 0.047
– Conclusion?

54
Probability
• Asking sensitive questions in surveys: the
randomised response method
– Spin a coin twice. Show no-one the results
– If the spin resulted in a head, answer the
question marked H. If the spin resulted in a tail,
answer the question marked T.
• H: Sensitive question
• T: Harmless question – Was the second spin a
tail?

Probability
• Asking sensitive questions in surveys: the
randomised response method
Roll 1

Head Tail
0.5 0.5

Sensitive Q Harmless Q

Yes No Yes No
p 1-p 0.5 0.5

0.5 p 0.5 – 0.5p 0.25 0.25

y = 0.5p + 0.25
p = 2y – 0.5

55

Anda mungkin juga menyukai