METHODS:
Module 2
SHC 797: 2011
1
Context
2
Data Description
• Designing experiments
– Does pre-heating reduce
the possibility of rail
defects in thermit welds ?
• Observational studies
– Polls: The popularity of
the President
Data Description
• How can we summarise small
amounts of data? Number Height
(m)
1 8.78
• Example: Bridge column
2 8.07
support height
3 8.15
4 8.98
5 9.02
6 7.99
7 8.52
8 8.27
9 8.89
3
Data Description
• First remember that the following is required
before we continue:
– Data needs to be a random sample
– How accurately were the measurements
made?
– Notice the difference between
measurements and counts
• Measurements: More accurate
• Counts: Always whole numbers
Data Description
• Example Table
– Bridge Column Number Height (m)
Support Height 1 8.78
• Most important thing 2 8.07
about numbers in this 3 8.15
list 4 8.98
– How low/high a 5 9.02
typical one is 6 7.99
– How variable the 7 8.52
numbers are 8 8.27
9 8.89
4
Data Description
1 8.78
x= 5 9.02
n 6 7.99
7 8.52
8 8.27
• Arithmetic mean or Average 9 8.89
Data Description
5
Data Description
• Measures of location: Mode
– The mode is the observation that
occurs most frequently. In our
previous example there is no mode.
– This is more useful for counted data.
For example number of vehicles per
hour over a 24 hour period.
Data Description
• Median:
• Mean: – Can be calculated with
– Easy to calculate merely graded data
– Much statistical – Easy to calculate for
theory based on it small samples, not
moderate or large
– Accurate: means of
samples
different samples do
not vary very much – All values have to be
stored for the
– Each observation
calculation
contributes equally to
the calculation – Not as sensitive to
mistakes
– Sensitive to mistakes
– Not disturbed by
outliers
6
Data Description
• Guidelines for mean, median and modes
– In case of no outliers use mean
– In case of outliers:
ts
en
Co
m
re
un
su
ts
ea
M
Data Description
• Other measures of location:
– Discard upper and lower 5%
– Root mean square
= mean of the squared observations
– Geometric mean of n positive
numbers = nth root of their product
– Harmonic mean = 1/(mean of the
reciprocals of the observations)
7
Data Description
• Other measures of location: Weighted
means
– Each observation is not counted
equally, but is “weighted” according
to its size:
=
∑ w( x) x
∑ w( x)
Examples
• University class size: A third = 9,
another third= 60, last third = 300
• What is the mean class size?
8
Examples
• University class size: A third = 9,
another third= 60, last third = 300
• What is the mean class size?
Examples
• Performance evaluation calculation:
KPA Weight Rating (/5)
1 50% 4
2 30% 3
3 20% 3
9
Data Description
• Measure of variation: Range
– The range of a set of numbers is the
largest minus the smallest
• The range of bridge height is 9.02 – 7.99 =
1.03 m
Only two observations contribute directly
It’s very sensitive to unusually big or small
observations
No standardisation to sample size
But of use when sample size is the same (Industrial QC)
Data Description
• Measure of variation: IQR and SIQR
(Inter-quartile Range and Semi-Inter-
quartile range = ½ IQR) – Distance
between quartiles
– Quartile
25% 25% 25% 25%
10
Data Description
• To determine the lower and upper
quartiles we recommend the following:
– Lower quartile
•0.25n + 0.5 (n – number of
observations)
– Upper quartile
•0.75n + 0.5
Data Description
Number Height
• Example (m)
– Using the bridge data, 6 7.99
calculate the LQ, UQ, IQR 2 8.07
and SIQR 3 8.15
8 8.27
7 8.52
1 8.78
9 8.89
4 8.98
5 9.02
11
Data Description
Order No. Height
• Example (m)
– n=9 1 6 7.99
– LQ(n) = 0.25(9)+0.5 = 2.75 2 2 8.07
– UQ(n) = 0.75(9)+0.5 = 7.25 3 3 8.15
4 8 8.27
LQ = 8.13 5 7 8.52
6 1 8.78
UQ = 8.91
7 9 8.89
IQR = 8.91 – 8.13 = 0.78
8 4 8.98
SIQR = 0.39 9 5 9.02
Data Description
• Measure of variation: Mean
Absolute Deviation (MAD)
Deviation Absolute
from deviation
M.A.D =
∑ x−x Observation
Height (m) -
x−x
mean from mean
x−x
x
n 8.78 0.26 0.26
8.07 -0.45 0.45
8.15 -0.37 0.37
8.98 0.46 0.46
3.19
M.A.D = 9.02 0.50 0.50
9 7.99 -0.53 0.53
12
Data Description
• Measure of variation: Standard Deviation
– SD is approximately the average of the difference
between each value and the mean.
∑ (x − x )
2
s=
n −1
– SD is what is most commonly used as a measure of
variation
– The square of the SD is called Variance
var = s 2
Data Description
• Standard Deviation - Example
∑ (x − x )
2 Squared
Deviation deviation from
s= Observation from mean mean
n −1 Height (m) -
x x−x (x − x ) 2
( x)
9.02 0.5 0.25
1
∑x − ∑
2 2
7.99 -0.53 0.2809
s= n 8.52 0 0
n −1 8.27 -0.25 0.0625
8.89 0.37 0.1369
Sum 0 1.3489
13
Data Description
• Standard Deviation – Typical
Observations
Description Dataset size
Whiskers
14
Data Description: Pictorial
• The box-and-whisker plot
– Can be drawn horizontally or vertically
– Outliers: more than 1 IQR above UQ
more than 1 IQR below LQ
– This is however much more complex
– Transformations (taking the square root or the
logarithm) can be used to make this plot more
useful
– Skewness: UQ − median
median − LQ
15
Data Description
0.9
0.8
0.7
FREQUENCY
0.6
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70
SIZE OF FLOOD
Data Description
• Flood Example – IQR
• The box-and-whisker plot
0.9
0.8
0.7
FREQUENCY
0.6
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70
SIZE OF FLOOD
12 16 23 38 61
16
Data Description
• Large Datasets
• See arm length measurements below.
741 817 846 845 846 833 782 767 786 810
765 694 758 754 754 806 775 798 740 809
759 785 795 830 854 830 789 802 720 816
764 783 747 774 763 781 804 727 809 801
796 791 811 833 757 786 806 796 776 803
801 817 831 811 801 802 834 805 829 817
801 796 706 802 774 767 811 767 830 771
759 751 765 811 727 761 808 777 835 787
788 776 754 812 860 765 763 780 777 737
761 791 757 758 795 708 784 725 800 723
Data Description
Example 1
Count
• Tabulate this data Range Observations Cumulative
17
Data Description
• Stem-and-leaf plots
69 4
70 6 6
71
72 0 7 7 5 3
73 7
74 2 0 7
75 8 4 4 9 7 9 1 4 7 8
76 7 5 4 3 9 7 7 5 1 5 3 1
77 5 4 6 4 1 7 6 7
78 2 6 5 9 3 1 6 7 8 0 3
79 8 5 6 1 6 1 5
80 6 9 2 4 9 1 6 3 1 1 2 5 1 2 8 0
81 7 0 6 1 7 1 7 1 1 2
82 9
83 3 0 0 3 1 4 0 5
84 6 5 6
85 4
86 0
Data Description
• Histogram presentation
Note the change in range size and
the frequency half at range double
16 the size of the selected standard
range. Rather keep range size constant
Proportion of observation per class width
14
12
10
0
650 700 750 800 850 900
18
Data Description
• Cumulative frequency plot
100
80
% Observations <= x
60
40
20
0
680 700 720 740 760 780 800 820 840 860 880
19
Data Description: example
• Solution
19,20,20,21,22,23,24,25,27
• Mode: 20 m3/s
• Median: 22 m3/s
• Mean: ∑ x = 201 = 22.3 m 3 /s
n 9
• Solution
19,20,20,21,22,23,24,25,27
20
Data Description: example
Observation x−x (x − x )2
19 -3.33 11.11
20 -2.33 5.44
20 -2.33 5.44
21 -1.33 1.78
22 -0.33 0.11
23 0.67 0.44
24 1.67 2.78
25 2.67 7.11
27 4.67 21.78
Sum 0.03 56.01
∑ (x − x )
2
56.01
σ = σ= = 2.7 m 3 /s
n −1 8
Data Description
• Cumulative frequency plot: Cumulative
proportions
1.0
Proportion of observation less than x
0.8
0.6
0.4
0.2
0.0
Up to 699
700-719
720-739
740-759
760-769
770-779
780-789
790-799
800-809
810-829
830-849
850-869
21
Data Description
• Computing the mean and standard
deviation of grouped data – Arm length
example
– Assume that all the observations in a class
were in fact at its mid-point
Class x Frequency,f xf x2 f
Up to 699 690 1 690 476100
700-719 710 2 1420 1008200
720-739 730 6 4380 3197400
740-759 750 13 9750 7312500
760-769 765 12 9180 7022700
770-779 775 8 6200 4805000
780-789 785 11 8635 6778475
790-799 795 7 5565 4424175
800-809 805 16 12880 10368400
810-829 820 11 9020 7396400
830-849 840 11 9240 7761600
850-869 860 2 1720 1479200
Sum 100 78680 62030150
Data Description
1
x= ∑ xf
• The mean grouped n
1
data = 78680
100
= 786.8
• The Standard
∑ (x − x )
2
f
Deviation grouped σ=
n −1
data (∑ xf )2
∑ (x )
1
f−
2
= n
n −1
62030150 −
1
(78680 )2
= 100
99
= 35.5mm
22
Data Description
• Computing the median and quartiles of
grouped data
1.00
0.08
0.90 x = 780 + (790 − 780)
Proportion of observation less than x
0.80
0.11
0.70
= 787.3
0.60
0.50
0.53
0.40 0.42
0.30
0.20
0.10
0.00
680 700 720 740 760 780 800 820 840 860 880
Data Description
• Same can be done to determine
quartiles of grouped data
• For 0.25:
– X= 762.0
• For 0.75:
– X= 809.4
• Inter-quartile range
– 809.4 - 762.5 = 46.9
23
Data Description
• Percentile
– Percentile divide the ordered
observations into one hundred equal
groups.
– EXAMPLE:
•A man of 95th percentile height is
one who is taller than 95 percent
of men and shorter than 5
percent.
Data Description
• 95th Percentile Example
1.0
0.9
Proportion of observation less than x
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
680 700 720 740 760 780 800 820 840 860 880
24
Data Description
• Regression and correlation
– Plotting one variable against another
• Each x and y value correspond
• Plot it against each other
• See example below: Sum of average daily temperatures
(in thousands of degrees) and total rainfall (in hundreds
of mm) for 1924-1929
Avg Daily Temperatures Rainfall
3.1 3.6
3.0 3.4
3.2 3.9
3.1 4.7
3.2 2.9
3.3 2.2
Data Description
• Scatter plot
5.0
4.5
Rainfall (Hundreds of mm)
4.0
3.5
3.0
2.5
2.0
1.5
2.9 3.0 3.1 3.2 3.3
25
Data Description
• Regression
– With this term, emphasis is on the
equation that predicts y from x
• Correlation
– With this term, emphasis is on the
strength of the relationship. That is,
how useful is known x for the
purpose of predicting y? We will only
be concerned with the strength of
the linear relationship.
Data Description
• The “best” straight line
– Best straight line for predicting y.
– Any straight line obeys the equation
y = a +bx for some choice of the
constants a and b.
– Therefore choosing the best straight
line means choosing a and b.
26
Data Description
• Regression Analysis
y y 4
3
y y
yˆ = a + bx
y 4
y
Y-Axis
2 3
1
y
y
2
yˆ = a + bx
1
Data Description
• Regression Analysis
2
n
E = ∑ y − yˆ Sum Squared Errors
i =1
n
yˆ = a + bx ∴ E = ∑ ( y − a − bx )
2
but
i =1
δE δE
For = =0
δa δb
δE n
= 2∑ ( y − a − bx )(1) = 0
δa i =1
δE n
= 2∑ ( y − a − bx )( x ) = 0
δb i =1
δE n
= ∑ ( y − a − bx ) = 0KKK eq.1
δa i =1
δE n
= ∑ ( y − a − bx )( x ) = 0KK eq.2
δb i =1
27
Data Description
δE
= ∑ ( y − a − bx ) = 0KKK eq.1
n
3 int o 2
δa i =1
( )
∑ xy − ∑ y − b x x − b ∑ x = 0
2
From eq.1 ∑ xy − ∑ ( yx − bx x ) − b ∑ x =0
2
∑ y − na − b ∑ x = 0 ∑ xy − ∑ yx + ∑ bx x − b ∑ x = 0
2
but ∑ y = ny and ∑ x = nx (
∑ xy − ∑ yx = − ∑ bx x − b ∑ x
2
)
∑ xy − ∑ yx = b(∑ x − ∑ x x)
2
∴a =
1
(n y − bn x ) = y − b xKKKeq.3 ∑ xy − ∑ yx
n b=
∑ x − ∑ xx
2
From eq.2
∑ xy − y ∑ x
b=
∑ xy − ∑ ax − b ∑ x = 0
2
∑ x − x∑ x
2
1
∑ xy −
∑ y∑ x
b= n
1
∑ x − (∑ x )
2 2
a = y − bx
x y ŷ y-ŷ (y--ŷ)2
(y
1 4 4 0 0
2 2 5 -3 9
3 6 6 0 0
4 8 7 1 1
10
28
Data Description: example
• Sum Squared Errors – Class Example 1
• ŷ = 1+2x
x y ŷ y-ŷ (y--ŷ)2
(y
1 4 3 1 1
2 2 5 -3 9
3 6 7 -1 1
4 8 9 -1 1
12
x y ŷ y-ŷ (y--ŷ)2
(y
1 4 7 -3 9
2 2 6 -4 16
3 6 5 1 1
4 8 4 4 16
42
29
Data Description: example
• Sum Squared Errors – Class Example 1
• Calculate the slope: b
1
∑ xy − ∑ y ∑ x
b= n
1
∑ x − (∑ x )
2
x2
2
x y xy
n
1 4 1 4 1
58 − × 10 × 20
2 2 4 4 4
b=
1
3 6 9 18 30 − (10 )
2
4
4 8 16 32
8
10 20 30 58 b=
5
b = 1 .6
a = y − bx
20
x y x2 xy y= =5
1 4 1 4
4
10
2 2 4 4 x = = 2.5
3 6 9 18
4
a = 5 − 1.6 × 2.5
4 8 16 32
10 20 30 58
a =1
30
Data Description: example
• Sum Squared Errors – Class Example
• ŷ = 1+1.6x
x y ŷ y-ŷ (y--ŷ)2
(y
1 4 2.6 1.4 1.96
2 2 4.2 -2.2 4.84
3 6 5.8 0.2 0.04
4 8 7.4 0.6 0.36
7.20
Data Description
• How do we know we’ve made a good
choice?
– Plot residuals (y-ŷ) against x
Equally Distributed:
Straight line good choice
E
Pattern seen:
Straight line not good choice
31
Data Description
• How do we know we’ve made a good
choice?
– Calculate correlation coefficient
Co − var iance
r=
(var iance of x )(var iance of y )
S xy
r=
S xx S yy
where
1
S xx = ∑ x 2 − (∑ x )2
n
S yy = ∑ y − (∑ y )
2 1 2
n
S xy = ∑ xy − (∑ x )(∑ y )
1
n
Data Description
• Correlation – Example
• ŷ = 1+1.6x S xy
r=
S xx S yy
Sxx = 30-1/4(10)2 = 5 8 8
r= =
(5)(20) 10
Syy = 120-1/4(20)2 = 20
r = 0.8
Sxy = 58-1/4(10)(20) = 8 r = 1 Straight line
x y x2 y2 xy r > 0.98 very good correlation
1 4 1 16 4 r > 0.96 good correlation
2 2 4 4 4 r < 0.9 bad correlation
3 6 9 36 18
4 8 16 64 32
10 20 30 120 58
32
Data Description
Data Description
5.0
4.5
Rainfall (Hunderds mm)
4.0
3.5
3.0
2.5
2.0
1.5
2.9 3.0 3.1 3.2 3.3
33
Data Description
• Temperature Rainfall example
– Class Example
Data Description
• Temperature Rainfall example
– Class Example
2
Sxx = 59.59 – 1/6(18.9) = 0.055
2
Syy = 75.07 – 1/6(20.7) = 3.655
S xy S
b= r= xy
S xx S S
xx yy
− 0.255
b= − 0.255
0.055 r=
b = −4.64 0.055 × 3.655
a = 3.45 − (− 4.64)3.15 = 18.1 r = −0.57
34
Part 2: Probability
Probability
35
Probability
• Three ways of refining this idea:
1) A priori approach
– Sometimes, the experimental set-up
is so clear, we know the probabilities
in advance of collecting any data
Probability
• A priori approach: Examples
– Coin
• P[x = head] = P[x = tail] = 0.5
– Dice
• P[x = 1] = P[x = 2] = P[x = 3] = P[x = 4]
=P[x = 5] = P[x = 6] = 1/6
– Cards
• P[x = 4 of hearts] = 1/52
• P[x = Ace] = 4/52 = 1/13
36
Probability
2) Empirical
– By having enough experimental data
– Examples
•Break 100 concrete cubes. 30
cubes strength is more than
50MPa
•P[x > 50] = 30/100 = 0.3
Probability
3) Subjective assessments
– What is the probability that it will
rain on the 27th of July in Pretoria.
– Argument: During winter it does not
rain regularly in Pretoria.
•No experimental information
•Intuition:
–P[x = rain] = 0.05
37
Probability
• We do not need to worry about the
philosophy of probability. Probabilities
are much easier to use in practical
calculations than they are to
philosophise about!
Probability
• Probability Scale
1.0
Dying 0.9
Pass Statistics
Probability
0.5
Coin
0.167
Dice
0
Swim through the Atlantic ocean
38
Probability
• Definitions
Probability
• Mutually-exclusive events
– This means they cannot occur
together.
– Consequently, the probability of one
or the other of two mutually-
exclusive events occurring is the sum
of their individual probabilities.
39
Probability
• Addition rule for mutually exclusive events:
the probability of one or other of two
mutually exclusive events occurring is the
sum of their individual probabilities:
• Example:
– P[temp increase] = 0.62
– P[temp unchanged] = 0.23
– Then P[temp increase or unchanged] =
0.62 + 0.23 = 0.85
Probability
• Addition rule for mutually exclusive events:
the probabilities of all possible exclusive
outcomes add up to 1:
• Example:
– P[temp increase] = 0.62
– P[temp unchanged] = 0.23
– Then P[temp decrease] = 1 - 0.62 - 0.23 =
0.15
40
Probability
• Multiplication rule for independent events: the
probability of more than one event happening is
the product of their individual probabilities, if
they are independent:
• Example:
– P[dice = 2] = 1/6
– P[draw a King from cards] = 1/13
– Then P[dice= 2 and King] = 1/6*1/13 = 1/78
Probability
• Example – Mutually-exclusive events
– Number of students on campus is 50 000
– B.Eng = 10 000
– B.Com = 15 000
– Choosing a student
• P[B.Eng] = 10 000/50 000 = 0.2
• P[B.Com] = 15 000/50 000 = 0.3
41
Probability
• Venn Diagram – Mutually-exclusive
B.Com
B.Eng
Other
Probability
• Independent probability
• Where two experiments are not influencing each other.
• Probability of both happening is the product of their
individual probabilities
• Example
– What sex is a person? Man – A, Women – B
– Does the person own a vehicle? Yes – C, No – D
– P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
42
Probability
• Example
No – Vehicle - Yes P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
P[ A ∩ C ] = ?
P[ A ∩ D ] = ?
P[B ∩ C ] = ?
P[B ∩ D ] = ?
Probability
• Example P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
No – Vehicle - Yes
0.36 0.44
0.09 0.11
43
Probability
• General addition Rule
A C
Probability
• Example P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
No – Vehicle - Yes
0.36 0.44
0.09 0.11
44
Probability
• Example
No – Vehicle - Yes
P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
0.36 0.44
0.09 0.11
Probability
• Example P[A] = 0.45, P[B] = 0.55, P[C] = 0.8, P[D] = 0.2
No – Vehicle - Yes
0.36 0.44
0.09 0.11
45
Probability
• Same Example – Probability Tree
Probability
• Class Example
• A concrete beam will fail if the
concrete is too weak or the load is too
high.
• P[Weak] = 0.2
• P[High] = 0.3
• P[Failure] = ?
46
Probability
W H
• P[Weak] = 0.2
• P[High] = 0.3 0.14 0.06 0.24
• P[Failure] = ?
0.24
0.2 0.8
Probability: example
• P[Weak] = 0.2 P[ Failure ] = P[Weak ] + P[ High] − P[Weak ∩ High]
P[ Failure ] = P[Weak ] + P[ High] − P[Weak ]P[ High]
• P[High] = 0.3
P[ Failure ] = 0.2 + 0.3 − (0.2)(0.3)
• P[Failure] = ? P[ Failure ] = 0.5 − 0.06 = 0.44
Concrete
Weak OK
P = 0.2 P = 0.8
Load Load
High OK High OK
P = 0.3 P = 0.7 P = 0.3 P = 0.7
P[Failure] = 0.44
47
Probability
• General Multiplication Rule
A C
Probability: example
• 12 people
• 9 – Native born
• 3 – Foreign born
• If we select 2 people, what is the probability that
both are foreign born?
• P[F1] = 3/12 = 0.25
• Once F occurred we know that there are 11
remaining of whom 2 are foreign born.
• P[F2|F1] = 2/11 = 0.1818
• P[F1 and F2] = 0.25 x 0.1818 = 0.04545 = 1/22
48
Probability: example
– 200 Students A L
– 77 Accounting
– 64 Law
? ? ?
– 92 Study neither
?
Complete the Venn
diagram by writing
down the correct
student numbers
Probability: example
200
– 200 Students A = 77 92
– 77 Accounting
– 64 Law
44 33 31
– 92 Study neither
– Other 3 numbers to
total 200 – 92 = 108 L = 64
– Only Accounting =108 –
64 = 44
– Only Law = 108 – 77 = 33
P[ A & L ] = = 0.165
31 200
– Both = 77 – 44 or 64 – P[ A] * P[ L | A] = 77 / 200 * 33 / 77 = 0.165
31 = 33
44 4
P[Only A | A] = =
77 7
49
Probability: class example
A low-water bridge is designed to allow for a flood
occurring once every 10 years. Damage occurs
during each flood. The bridge is also located in an
active seismic region and the probability of a
destructive earthquake occurring in a year is 30%.
Determine the probability of damage during any
given year assuming that floods and earthquakes are
statistically independent.
Probability: example
• Solution to example
• P[Flood] = 0.1
• P[No flood] = 0.9
• P[Quake] = 0.3
• P[No quake] = 0.7
50
Probability
• Class example
Vehicles are classified into light (<3 ton), medium (3-
10 ton) and heavy (>10 ton). Vehicle counts show
that 60% are light, 30% medium and 10% heavy
vehicles. Vehicles are weighed regularly. It is known
that the probability that a vehicle is light and
overloaded is 0.12. Calculate the probability that the
next vehicle is heavy or overloaded. Assume
statistical independence.
Probability
• Class example solution
P[light] =0.6
P[light and overloaded] = 0.12
P[light and overloaded] =
P[light]P[overloaded]
0.12 = 0.6 x P[overloaded]
P[overloaded] = 0.12/0.6 = 0.2
51
Probability
• Example (continued)
Vehicle
Light Heavy
Medium 0.1
0.6 0.3
Probability
• Exclusive and Independent
– Exclusive:
• Events are ones that never occur together.
– Independent
• The proportion of times A occurs is the same whether
or not B occurs
– Using a Venn diagram
52
Probability
• Example – Independent features
Total of 200 students, 80 study biology, 90
study neither biology nor geography, and the
choice of whether a student does or does
not study biology is independent of their
choice of studying geography.
Let us use the Venn diagram to determine all
required numbers in the Venn diagram.
Probability
• Example solution
z = 90 given
B G
w+x = 80
w/x = z/y
90/30 = w/x
w=3x
w+x = 4x =80
x = 20
w = 60
53
Probability
• Example:
– A disease has an incidence of 0.1%
– The probability of getting a positive test
result from a diseased person is 0.99.
– The probability of getting a positive test
result from a healthy person is 0.02.
– What is the probability that a person who
gives a positive test result really does have
the disease?
Probability
• Example:
– A disease has an incidence of 0.1%
– The probability of getting a positive test result from a diseased
person is 0.99.
– The probability of getting a positive test result from a healthy
person is 0.02.
– What is the probability that a person who gives a positive test
result really does have the disease?
• Answer:
– Draw tree diagram
P[+] = (0.001*0.99)/[(0.001*0.99)+(0.999*0.02)]
= 0.047
– Conclusion?
54
Probability
• Asking sensitive questions in surveys: the
randomised response method
– Spin a coin twice. Show no-one the results
– If the spin resulted in a head, answer the
question marked H. If the spin resulted in a tail,
answer the question marked T.
• H: Sensitive question
• T: Harmless question – Was the second spin a
tail?
Probability
• Asking sensitive questions in surveys: the
randomised response method
Roll 1
Head Tail
0.5 0.5
Sensitive Q Harmless Q
Yes No Yes No
p 1-p 0.5 0.5
y = 0.5p + 0.25
p = 2y – 0.5
55